Hi, I would like to know how Talend Open Studio handle huge amount of data. For instance, if I want to select a 20Go MySQL table, do some transformations on the fields, and put the data in another database, how is Talend Open Studio doing it ? In my example, I have a tMysqlInput --> tMap --> tMysqlOutput I tried the 2 following things : -> Limit the jvm memory usage with the option : -Xmx6144M (6Go memory usage) -> In the tMap componant, I specified the "temp data directory disk", so I guess Talend is writing data here to free some memory. By doing this, my job seems to work. If I don't do this, the job crass, returning a memory exception. But what is Talend doing exactly ? Does it just stored the data in a temporary file and do read / write access on this file ? Does it uses a specific algorithm to store temporary the data ? Can it crash anyway if there is too many data ? Does the Talend Platform for Big Data is a lot more optimized than the Open Studio version ? I know it's a lot of questions, but if someone has a few answers, it would really help me. Best regards, Bertrand.
Hi nicolasdiogo, I'm using a MySQL database. Which component would be more suitable ? I actually found a very useful option on the tMysqlInput that I didn't see first : "Enable stream" On the tMysqlOutput I also changed the "Number of rows per insert" and the "Commit every" options. If you have any suggestions, please let me know. Best regards,
There is a common problem with MySQL database input components (the same in tPostgresqlInput). The JDBC driver collects all data until the end and then starts sending them out to the application (your Talend job). Check "Enable stream" in the advanced option and that's it.
Hi, Thank you for your answers. Nicolasdiogo, If one row of a commit fails, the whole commit is aborted. However, all the previous commits worked, so the rows associated with those commits are loaded in the target table. Jlolling, I activated the "Enable Stream" and the results were great for the memory. My job was able to transfer all the 25Go of data by using only a few Go of the memory (depending on the commit size of the tMysqlOutput component). However, the speed of the transfer doesn't seem to be very good (I reached a maximum of 3Go/s by playing with the commit size of the tMysqlOutput component). I also tried with the bulk components. With this component, the job took about the same time than with the tMysql, but the whole time of the job was spent writing the file on the hard drive (for recall, the bulk component write all the data on a file and then import the file into the database). So the job spend a long time writing the file, and something like 3 seconds to import the file into the database. So my question is : Why does it take so long to write the file ? I checked my hard drive and it wasn't working a lot. So what is limiting the job ? The bulk component really needs time to write on the disk ? The incoming stream of data is slow ? (due to a bad optimization of the tMysqlInput component ?) If you have any idea, optimization suggestion or anything else, please let me know. Best regards, Bertrand
I am facing the same issue where I have a billion records as input from a mysql table, tMap looks up to 15 dimensions for keys and writes to temp directory. It's been 8 days since it is writing to disk and inserts have not started yet. We have enabled the stream and have commit every 10000 records. What more optimization can be done here? Or it will have to wait to load all the looked up data, store in temp disk and only then start inserting into mysql table output? Really stuck here and need some help. Thanks in advance.
Hi pankaj, Can you create a master table for your input table, based on timestamp, key etc. and segment the data in batches using query say between clause. Use this master table and iterate on input data for those batches. This may be slow this will start insertion from first execution itself also try using ELT components. Thanks Vaibhav
Hi Bertrand, I would look at the job differently, as per previous post, you can think of writing data from source to flat file in chunks and then load from file to database, remove old and loop again. This also would give you some control over job execution and will help to interface in case of failure. Thanks Vaibhav
Hi, I have Source table mysql that table have huge data more than 15000000 rows. i am using tmysqlinput component but unable to extract data from table.Connection is failed every time.Is their any heap size issue or extract please help simply using tmysqlinput-tmap-tpostgresoutput Thanks, Naveen
hi, i am facing the same issue i have 6 billion records to load and the input is SQL SERVER.i did not find enable stream option in tmssqlinput component and also when i increased jvm size to -Xmx1024m there is no increase in loading speed.What should be done to increase the data loading speed in talend