As a part of a sub-job, I'm - pulling a limited set of columns (user_id, uuid, crc) of the ca 20 million records. - For each row I use a tMap looking up another CRC with uuid, from another table (20 million records) - inside the tMap I compare the two CRCs and outputs a field flagging if the crcs are different. - this i output/update to the same table as I input from. MySQLInput -> tMap -> MySQLOutput The tMap is configured to: - loading the lookup once - store temp data on disc (with path set) - max buffer size is set to 200 000 The MySQL ouptut-component is set to - update only - commit every 50 000 - batch size 10 000
Problem: This is going at a approx 1000 r/s which is tooo slow. Is there something I can tweak with the current set of components? Or do I need to use the ELT-components? Or even write this whole part of the subjob as pure SQL? Any hints is appreciated!
Hi For performance issue, I can't give you a confirmed answer. As you say, you have use "store temp data on disc" and remove tedious columns. But i'm sure pure SQL must be the best way considering performance. Regards, Pedro
Hi geuder pls try this way if u use index keys for ur tables...
Before running the job first disable the index and then enable the index after the job is over....design the job as given below... tmysqlrow-->"ALTER TABLE tablename DISABLE KEYS" on subjob ok tmysqlinput--tmap--tmysqlouput on subjob ok tmysqlrow--> "ALTER TABLE tablename ENABLE KEYS"
Thanks! I haven't yet tried to turn off and on the indexs, but I moved the join from the tMap to sql in the MySQL input component and loaded the table with the bulk loading components. Total subjob ran 20 million rows in 21 minutes, which is a great improvement.
Hi, i think it wont search the whole table if index is defined if index is removed then use a sorter before the join. try to move joins to the input components and use a bulk execute. Did i just answer what was just answered in the above posts!!!! But then that is it! else you can simply execute a SQL query!!
Hi all, I took this further and created a number of different versions of: MySQLInput -> tMap -> "Output"
MySQLInput: is now joining the two tables and outputs only ID of the records mismatching (what I need) tMap: is only adding a row (an int flag) "Output" i did in three version: a) MySQLBulkOutputExecute b) MySQLBulkOutput triggering MySQLBulkExec c) MySQLBulkOutput->MySQLRow (Disable Index)->MySQLBulkExec->MySQLRow (Enable Index) Results: a) 16 900 r/s b) 12 800 r/s c) 10 000 r/s To me this sounds wrong. The (c) version with bulk loading AND with indexes switched of during update should be the fastest. Am I missing something?