I have attached a screenshot of my Talend job herewith the post. It basically reads 1000s of excel files from a particular folder and using unpivot component of talend, it converts the rows into columns and then some manipulations are done using tmap. Everything works fine until data is loaded in MS SQL. The reason being i have enabled the "Insert If not exits" option. What this does is, it slows the upload process over time as the database grows bigger and bigger. It would be great if someone can help me understand 1. How the bulkupload to ms sql feature works in talend, so that its no longer a row by row upload 2. If there is a better way to prevent duplication of records / If there is a component available to prevent duplication. Thanks
Re: Bulk Upload from a CSV and Removing Duplicates
Have some patience. It slows down because "Insert if not exist" queries the table to see if the record exists before trying to insert it, causing the DB to effectively check twice for each row. Making sure a proper index exists on the key fields might speed it up. You can also try just "Insert"ing the rows and letting the DB reject them if they exist (without "Die on error"). I don't think bulk upload on its own will help you as it will die if any inserts reject because of duplicates. There is no dedicated component to filter rows based on whether they already exist in the DB. You would have to use tMap with an inner-join lookup from the DB, sending the inner-join rejects to the DB. If your duplicates are in the data from Excel, use tUniqueRow to eliminate them.