I have a tough question about one of my components that is not scaling well. I am running a query using the tPostgresqlInput component. The query, shown below, checks for circular references between an employee and the manager in a single table. For testing purposes, I have been using a 25,000 record dataset. When I run this query using PG Admin or DB Visualizer, it will take less than 2 seconds to run. Additionally, using SAP BODS (our current ETL tool) the query will also run in less than 2 seconds. However, when I run this query through tPostgresqlInput it will take over 5 minutes to run.
I have tried a few different things to speed this query up, but to no avail. Inside Talend, I have tried adding additional memory on the JVM and running the subjob in parallel. Comparing the run time of the query between Talend and at the database level proves that their is something hampering the performance when ran through Talend. Any help understanding why Talend struggles with this query would be greatly appreciated. We need to be able to scale this query to process up to 500,000 records.
Thank you for any help! I truly appreciate it!
Running Job In DB Visualizer: 0.003 seconds
Running Query in Talend: 346 seconds
WITH RECURSIVE circular_managers(unique_id, mgr_unique_id, depth, path, cycle) AS (
SELECT u.unique_id, u.mgr_unique_id, 1,
ARRAY[u.unique_id || '']::varchar,
FROM table1 u
WHERE u.unique_id IS NOT NULL
SELECT u.unique_id, u.mgr_unique_id, cm.depth + 1,
path || u.unique_id,
u.unique_id = ANY(path)
FROM table1 u, circular_managers cm
WHERE u.unique_id = cm.mgr_unique_id
AND u.unique_id IS NOT NULL
AND NOT cycle
array_to_string(path, ' > ') circular_managers
AND array_to_string(path, ' > ') NOT LIKE ' >%'
AND array_to_string(path, ' > ') NOT LIKE '% > > %'
AND path = path[array_upper(path, 1)]
group by 1,2
What does your ETL job look like? Are you able to run your job successfully in talend studio?
Could you please clarify in which Talend version/edition you are?
I am using Enterprise Edition, version is 6.4.1.
The job will run successfully, but when I run it for larger datasets, it is exponentially slower. For example, a 5,000 record dataset takes 2 seconds to run. A 25,000 record dataset takes 5 minutes to run. A 100,000 record dataset has taken over 3 hours to run. When I run the input query using PG Admin or DB Visualizer, the query takes less than a second.
Using different logging options, I can see the component queryCircRef is the component taking a long time to run. I have also tried creating a custom function at the database level to run, but see the same results. Running the custom function at the db level is quick, but run it on Talend and it is super slow.
Thanks for your help!
Nickster19, Were you able to find any solution to the performance issue? We are having the same issue extracting data from PostgreSQL database. After enabling parallelization at subjob level and increasing JVM -Xmx to almost 18 GB, the job still runs at 70 rows/sec for each thread - which comes to around 350 rows/second.