Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Five Stars

Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Hi,

    I need to load different csv files to different databases. The database to connect should be set dynamically from file path. I have below flow:

tFileList --> tFileInputDelemeted --> tJavaFlex --> tPostgresqlOutput

I am extracting the DBname and tablename from csv file path and setting to context variables in tJavaFlex and then setting the context variables in tPostgresqlOutput database name and table name properties.

 

But I am getting 

org.postgresql.util.PSQLException: ERROR: zero-length delimited identifier at or near """"

as the value comes null or empty string for dbname or table properties, but I can see the values coming properly in tJavaFlex logs.

The reason can be the tPostgresqlOutput connection details are getting initialized even before the flow starts.

 

Someone please help me on this...

Thanks in advance

Sree

Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Any idea or pointers?

Ten Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

You could look at the code to check whats happening, but your assumption is probably right.
Make a subJob where you pass the context vars, in the subjob you make the flexible connection.
Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Thanks for your reply. Yes, when I created a sub job with same database and table name it works. But I want dynamic database/table names which is possible only by iteration. When we do iteration, how can I create subjob with in that?

 

I have attached screen shot of my workflow. Here in tJava component I am setting the database/table name and the directory to load the csv.

Ten Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

First Option;
After the tFileList > tRunJob ...
In the tRunJob you add context vars: targetTable and fileNamePath which you set dynamically in your (main) job.
Do all files have the same columns? I assume not... so lets assume you have customers and order files:

In the tRunJob you start :
* tJava1 -> If (context.targetTable.equals("customer")) -> tFileInputDelimited -> tPostgresqlOutput
* tJava1 -> If (context.targetTable.equals("orders")) -> tFileInputDelimited -> tPostgresqlOutput

In every job you still need to define the file input and output mapping
In .equals() is case sensitive otherwise use .equalsIgnoreCase()

Second option (prefered):
After the tFileList > tRunJob ... checkbox dynamic, a job is triggered based on a context var which would be the Jobname which should be triggered, your jobname should match exactly the context var.

Create a job for each file type
In the tRunJob you still add a context vars: targetTable and fileNamePath which you set dynamically in your (main) job.
In every job you still need to define the file input and output mapping
Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Hi Dijk,

      The 2nd approach is interesting and will try out. But as I mentioned earlier, the CSVs are dynamic and so the jobs. So cant create job specific for CSVs like (if var.equals("Customer" ) or var.equals("Order").  From the folder path am identifying the database and connecting to specific tables. I don't have control over the CSVs or its contents. The jobs you mentioned should be created dynamically and that is the challenge. 

Ten Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

From what you are saying, it sounds like there are a couple of folders, which indicates the target/output database table ... do your files within a subfolder are always the same structure and contain the same columns in the same order? If not, first create a process to determine which type it is.

If there's no control... how do you know which file contains say customers? There should be something which indicates what type of data is presented in your file? Your folder, filename or its header.

If its really really really dynamic, you need to use regex to extract parts of data from within the file (header should indicate), use this to determine the type of data. But you still need to map your input data into output columns postgresdb (unless you store 1 column which contains the content from 1 row from the file).

Good luck!
Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Yes, infact it is really, really dynamic. I will know the database name and table name from file path. Also am using the dynamic datatype to map the columns so I need not worry on the csv columns and schema. 

The main issue is setting database connection before loading. As a simple flow, tFileList-->tFileInputDelemeted-->tJavaFlex-->tPostgresqlOutput should work where tJavaFlex will set the database name and table name. But thats not happening. Even before the tFilelist starts loading, the tPostgresqlOutput will give error.

 

Ten Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

By setting database connection you actually connect to multiple databases? or multiple tables? or both?
Maybe ... just a 'wild' idea... you could generate an INSERT INTO scripts and upload as bulk?
Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Yes, am connecting multiple databases and tables inside those database dynamically while loading csv.

I tried a work around now to find the file/db/table and then looped that map and then tried the same flow which I mentioned in the beginning and it is working now.

 

But now the problem is with performance issue. It is taking a 300K records more than hour. In tPostgresql, since am using insert or update and dynamic table creation (action on table) it is pretty slow now. I think need to find a way to segregate records to be inserted and updated and handle separately.

 

If there is any easy approach/component can be used to segregate, please let me know

 

Anyway, thanks a lot for all your inputs.

 

Ten Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Your network (wifi) probably the performance killer... If you need high troughput, suggest generate batch (insert into /t-sql) files, upload this file to your DB server first and then process it.
If you use talend Im not sure you could use/change batch size... and make sure you 'trim' your input data.

The amount of people complaining about performance issues are becoming more frequent specially when cloud is involved, upload speed is often not as fast as dload speed. And if you work from home I guarantee that HOME->VPN->Office->Cloud is the serial killer.
Even cloud up/down speed are managed and based on concurrent user loads ... so...
And what about if your (cloud) database is replicated or audit logs are active ...

Upload 25MByte file and see what your avg upload speed is... I think its a very good indicator to use for your max throughput calculation using TCP/IP. Next step is your DB.
Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

I am running it locally, both DBServer and file repo are same system.

 

I tested the same 300K records with only "insert" with parallelism and it was finished in less than 4 seconds. So can conclude that it is the "insert and update" without parallelism (As I need to create tables dynamically, which is not available with parallelism) makes it slower. 

Ten Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Ohw that sucks... maybe first into tmp tables just inserts and a t-sql with: exists in (select target table)
This how I do a lot of stuff to fill a dwh (datavault with hashkeys)... the overhead in the insert/update is way to much...
Five Stars

Re: Loading different csv files to databases by setting databases name from context variable/current file path dynamically

Thank you very much for the input. I will try that.