Hi all, I have finished a job as a ETL script recently,and the script is complicated and massive data need to be extracted.It refer to a lot of views,but I must avoid any views and extract data from tables directly. However,when I run the job ,it takes me about 30 minutes. What can I do to make the job running faster? I have cut down the unnecessary fields,and in order to optimize the structure,what shall we according to ? Can you give me some correlative information or links? Thanks for your reply.
I have been researching on this topic for a whole day.after a lot of tests,I find this: it takes 10m running a subjob,but if i drag the subjob into a further job,the new job like this subjob?main?>tmap??>tMSSqlOutputBulk | lookup | subjob_2(just takes 1m if it runs independently ) the new job takes me 30 minutes. What's the problem? the last component of the subjob is 'tBufferOutput'.
Hi Joe, As you seem to load a lot of data in memory with your lookup, you should consider increasing the memory allocated to the JVM that will execute your job. To do that: After 4.1.0 : In the Run Tab, select Advanced Settings, and then select Use Specific JVM Arguments. Change the value of the minimum (Xms) and maximum (Xmx) memory allocated to give more memory to your job. Make sure you give memory according the machine you are running the job on. If you use an older version, let me know, I will try to retrieve how it was done before (but if I remember well it was a JVM Arguments section next to the Contexts in the Run Tab).
Joe, Here's whats probably happening: When you run your subjob independently, it has its own heap. The memory it requires is probably less than what the heap is set to. When you run both jobs together, they share the heap. Most probably, you are starting to brush up agianst the max value. Java uses a process called the garbage collector to manage memory. Garbage collection is run on demand and is an expensive process. When you are operating at very close to your maximum heap, garbage collection is executed more often to make room for new objects in memory. In this situation, what happens is what I call "garbage collector thrashing" The garbage collector is only able to reclaim small amounts of memory, and is run over and over and over and over-- only making room for a few new objects every time. Because garbage collection is expensive-- having it run so often will cause your programs to slow to a crawl. When developing your jobs, keep in mind the things that will consume memory: tMap lookups, Sorting, Aggregating and probably the most obvious: buffering output. If you optimize the size, rows and number of columns used in these operations you will affect the total heap used by the job. If your current heap is the default, 1024, try doubling it and see if it resolves your issues. Setting the size of the heap higher will allow your program to run longer between GC runs, and each GC run will become much more efficient. I always try to target around 60-75% heap utilization in my jobs. if you're using 99% of your heap, chances are that your GC is thrashing.
Really good explanation of JohnGarrettMartin. I just add something about the size of the heap size and memory usage. Keep it mind that it will also depend of your OS and Architecture. If I remember correctly on a Windows 32 bits you will not be able to affect more than 1,5 Go of memory to a single java process...