I have a basic job setup which loads the contents of a CSV file into a database table using the "DYNAMIC" column type. This works great for quickly getting raw data in a database.
However, the latest file I am testing this on has duplicate column headers. Is there a way of handling duplicate column names when using Dynamic Schema? Maybe by adding a number to the end of the duplicate column or something along those lines?
This would be a basic example:
This would be imported as something like below to get away from the duplicate column:
The column names will need to be different. If you can add something to make the column names unique, that will solve this issue.
OK, thanks. Seems it's not possible then. I wanted to take away the manual aspect of having to check the column names as the CSV files I'm dealing with can have 100s of columns
You could build a pre-processing job to deal with this scenario.
Unfortunately this is a problem that is quite unusual. Data with columns using the same name is not something that any integration tool will solve dynamically, natively and intelligently. If I have a dataset that I have not seen before and it has two columns called "name", I would have to spend a fair amount of time looking at the data to try and assess what each column was actually holding. As a human I have a "world knowledge" that might help me in doing this.
A simple process to tokenize the header row, look for duplicates and then reassign duplicates with another name (maybe just adding a number to the existing name) should be relatively easy to achieve.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Part 2 of a series on Context Variables
Learn how to do cool things with Context Variables
Read about some useful Context Variable ideas