One Star

Why removing somemany times?

I have an update job that is updating data for my "Bonds" entity in my "SecurityMaster" Model.
Below is the output from the exist.log. I'm wondering why is it trying to remove the same "Id" 28646 somany times? This is 10 seconds to update each record. EXTREMELY SLOW!
I really hope i'm doing something wrong here... any ideas?
2011-03-08 02:41:42,158 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:42,159 INFO (RpcConnection.java :307) - query took 0ms.
2011-03-08 02:41:43,098 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:44,537 INFO (RpcConnection.java :307) - query took 0ms.
2011-03-08 02:41:45,449 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:46,990 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:47,037 INFO (RpcConnection.java :307) - query took 0ms.
2011-03-08 02:41:48,042 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:49,538 INFO (RpcConnection.java :307) - query took 0ms.
2011-03-08 02:41:50,280 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:51,139 INFO (NativeBroker.java :2222) - Removing document SecurityMaster.Bonds.28646 (38216) ...
2011-03-08 02:41:52,040 INFO (RpcConnection.java :307) - query took 0ms.

  • MDM
14 REPLIES
One Star

Re: Why removing somemany times?

hello Talend Support,
Do you have any idea what is happening above. I have been updating 50K records for 2 days now. This is such bad performance.

image upload
Please help/suggest. What more information do you need?
One Star

Re: Why removing somemany times?

when I use bulkmode, its does not improve the performance either. Am I doing something wrong?

png upload
Employee

Re: Why removing somemany times?

Hi muraliv,
To improve performance with tMDMBulkLoad, you can have a look at this component help page (select it in the designer, press F1).
There's a scenario that has been recently added (in 4.2.1), that explains how to chunk the XML that you send to tMDMBulkLoad.
When you face mid/high volumetrics, tWriteXMLField will slow down the process. One workaround is to create small temporary XML chunks to speed up the data integration.
It also enables you to put parallelisation in your job.
At a customer, I reached an average 90-100 rows/sec on 300k rows, using Talend MDM EE 4.1.2 (eXist) with this technique.
Hope that helps,
Cyril.
One Star

Re: Why removing somemany times?

Cyril,
I created the job as per the documentation, I still see same poor performance on "updates". You reached 90-100 rows/sec with inserts or updates? I get good speed with inserts but not with updates.
To take it up a notch, I created the XML files in chunks of 500 records and distributed it to 3 machines and kicked-off the process last night, Its still running! it actually got worse. My talendMDM test server is CE 4.1.2 with 4 CPU and 8G on linux. I obviously did not try this update on my PRODUCTION server which is talendMDM EE 4.1.2.
- Would EE perform better over CE?
- Do I need to do something with the exist logging to improve performance?
- Also I do not see any entries in the journal for all the updates that occured. There is no documentation that states journal will be skipped, at least from what I have read.

image upload
Please suggest.
Employee

Re: Why removing somemany times?

My mistake, I though it was about inserting data. Indeed, I reached 100 rows/sec on insert with 4.1.2 eXist, but not update (actually, I didn't benchmarked this kind of job on updates).
Do you have any index set on the primary key of your entity for this update ?
If the underlying database is still eXist, performance should remain the same between CE & EE.
But EE enables to use Qizx as a database since 4.2.x, which greatly enhance performance. Moreover, since 4.2.x, tMDMBulkLoad has been deeply reworked to achieve better performance on insert with Qizx (1500/2000+ rows/sec). I didn't tested it yet on bug updates, but I'm pretty sure that performance must be way higher than before, as data is indexed on the fly with Qizx.
I personnally customize the default log4j configuration to show less logs. It could maybe enhance perfomance a little bit.
Finally, tMDMBulkLoad doesn't write in the journal. Once again, sorry, I though it was about inserting, not updating...
One Star

Re: Why removing somemany times?

Cyril,
Thanks for the confirmation. Does Qizx come with 4.2.x or it needs to be purchased separately? I'm already tired of eXist.
Employee

Re: Why removing somemany times?

Qizx is now the default XML database since Talend MDM EE 4.2.x, even though you can still choose eXist.
Regards,
Cyril.
One Star

Re: Why removing somemany times?

How do I get the talendMDM EE 4.2.x ? Who do I contact? I currently have MDM EE 4.1.2.
Thanks!
Employee

Re: Why removing somemany times?

Don't hesitate to contact the support by opening a new ticket to ask for the upgrade, or to ask your sales representative.
Regards,
Cyril.
One Star

Re: Why removing somemany times?

Cyril,
I was reading about QizX and something caught my eye, that was concerning...
http://www.xmlmind.com/qizx/product.html
Is Qizx a Native XML Database ?
Qizx can do all what Native XML Databases do. The main difference is that most NXDbs are optimized for updates, and offer relatively poor search speed.
Qizx is the opposite: it is optimized for high querying speed, not for intensive updating of XML data (in fact Qizx stores and indexes XML documents quite fast, but an update transaction on a document creates a new copy of the document, making modifications of large documents not very efficient).

Migrating to EE 4.2.2 resolve my issue with UPDATES?
Employee

Re: Why removing somemany times?

Hi,
The XML database used in Talend Enterprise Edition as no direct link with Quizx.
Every piece of information that you might find on the internet will not be accurate as Quizx and Talend MDM XML Database share almost nothing in common.
This is a XML database developped by Talend R&D in order to achieve maximum performance and to support 100+million records with under the second response time on complexe MDM queries and massive update operations.

Benjamin
One Star

Re: Why removing somemany times?

Benjamin,
That's very reassuring. Thanks! I'm implementing it very soon and will post my feedback.
Employee

Re: Why removing somemany times?

Hi,
Do not forget to use tMDMBulkLoad in order to achieve maximum throughput.
Benjamin
One Star

Re: Why removing somemany times?

The only problem I have with using bulk mode is that it does not write to the journal.
The way I implemented the flow of data from talendMDM to database is by scanning the journal for updates and push the changes to DB twice a day or on-demand.
How else do you recommend implementing the data push to a DB for down-stream applications to consume master data?