I need to load different csv files to different databases. The database to connect should be set dynamically from file path. I have below flow:
tFileList --> tFileInputDelemeted --> tJavaFlex --> tPostgresqlOutput
I am extracting the DBname and tablename from csv file path and setting to context variables in tJavaFlex and then setting the context variables in tPostgresqlOutput database name and table name properties.
But I am getting
org.postgresql.util.PSQLException: ERROR: zero-length delimited identifier at or near """"
as the value comes null or empty string for dbname or table properties, but I can see the values coming properly in tJavaFlex logs.
The reason can be the tPostgresqlOutput connection details are getting initialized even before the flow starts.
Someone please help me on this...
Thanks in advance
Thanks for your reply. Yes, when I created a sub job with same database and table name it works. But I want dynamic database/table names which is possible only by iteration. When we do iteration, how can I create subjob with in that?
I have attached screen shot of my workflow. Here in tJava component I am setting the database/table name and the directory to load the csv.
The 2nd approach is interesting and will try out. But as I mentioned earlier, the CSVs are dynamic and so the jobs. So cant create job specific for CSVs like (if var.equals("Customer" ) or var.equals("Order"). From the folder path am identifying the database and connecting to specific tables. I don't have control over the CSVs or its contents. The jobs you mentioned should be created dynamically and that is the challenge.
Yes, infact it is really, really dynamic. I will know the database name and table name from file path. Also am using the dynamic datatype to map the columns so I need not worry on the csv columns and schema.
The main issue is setting database connection before loading. As a simple flow, tFileList-->tFileInputDelemeted-->tJavaFlex-->tPostgresqlOutput should work where tJavaFlex will set the database name and table name. But thats not happening. Even before the tFilelist starts loading, the tPostgresqlOutput will give error.
Yes, am connecting multiple databases and tables inside those database dynamically while loading csv.
I tried a work around now to find the file/db/table and then looped that map and then tried the same flow which I mentioned in the beginning and it is working now.
But now the problem is with performance issue. It is taking a 300K records more than hour. In tPostgresql, since am using insert or update and dynamic table creation (action on table) it is pretty slow now. I think need to find a way to segregate records to be inserted and updated and handle separately.
If there is any easy approach/component can be used to segregate, please let me know
Anyway, thanks a lot for all your inputs.
I am running it locally, both DBServer and file repo are same system.
I tested the same 300K records with only "insert" with parallelism and it was finished in less than 4 seconds. So can conclude that it is the "insert and update" without parallelism (As I need to create tables dynamically, which is not available with parallelism) makes it slower.
Thank you very much for the input. I will try that.
Watch the recorded webinar!
Create systems and workflow to manage clean data ingestion and data transformation.
Introduction to Talend Open Studio for Data Integration.
Test drive Talend's enterprise products.