Read one file parrallely

One Star

Read one file parrallely

Experts, could you please help to me implement the solution to read the file parallel? so example i have a file of 10G. i want to have multiple partitions reading that file? is that possible?

Moderator

Re: Read one file parrallely

Hi,
You could use a sequence in tMap to break up your file into smaller chunks. What kind of data do you have in this file?
Do you want to load your big file into DB? Could you please give us more information about your current job situation?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Read one file parrallely

I am receiving full refresh files from my source team which contain 160M records. this is full refresh files, so i will have to read file and compare with previously loaded data and identify Insert, Update and Delete and apply delta to DB table. so as an example below data, and here i have customer_id as PK

Todays file contain
customer_id   customer_name
100              Sam
102              Alex
105              David  

previously loaded table
100              Sam
102              Alexy
104              John

so with above data , i need to mark
customer_id 102 for update, 105 for insert and 104 for delete. in other words, i need to use content of latest file into my final table.
i don't want to truncate and reload table because this table is used by client almost all time. logic for identifying delta i could achieve with tMap, but problem is with processing 160M records. which is taking lot of time to process. sample file content is posted below.
in below file first 2 columns are PK

6014|A26904c676|0.0186370|61
6014|A27da32789|0.0154096|55
6014|A287f20d2c|0.0219631|55
6014|A2dfe8c97e|0.0408455|61
6014|A3b52342f8|0.0243586|61
6014|A3e7ac480f|0.0260668|61
6014|A5abde4f3b|0.0398880|55
6014|A5c54eed1b|0.0293591|55
6014|A5e4e4d111|0.0312439|61
6014|X14b34ecd508|0.0263314|61
6014|X14b34ecd529|0.0263314|61
6014|X14b34ecd53c|0.0263314|61
6014|X14b34ecd594|0.0464095|61
6014|X14b3f396fa8|0.0163314|58
6014|X14b53d31504|0.0207230|58
6014|X14c174dc981|0.0311294|55
6014|X14c174dc9f6|0.0224165|55
6014|X14c2be79613|0.0270148|55

 
Ten Stars

Re: Read one file parrallely

The way to do this is to load the records into another table and carry out the comparison processing in the database. With your requirement to find deleted and new records, you will need to carry out two lookups using a tMap. Doing a lookup comparison like that, with that many records in a tMap is going to be slow even with a really powerful system. Java is nowhere near as fast as a database for comparisons. 
One Star

Re: Read one file parrallely

But my main problem is with reading 160M records from file into table. how can i make it parallelized?   so if i compare with another ETL tool informatica, it has concept of partitions, it will split the big files into logical partitions and read file parallel. do we have something like that in Talend.
Ten Stars

Re: Read one file parrallely

With Talend you are not limited to only what Talend provides. You can also make use of third party Java APIs and command-line functionality. So, if you are working on a Linux environment you can use Split (http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts). If you are not (or if you don't want to use Split), you can make use of a bit of Java to split the file (http://stackoverflow.com/questions/19177994/java-read-file-and-split-into-multiple-files). 

Processing in parallel may be a problem if you do not have the Enterprise Edition. That is one of the "paid for" features, but it doesn't stop you from doing this in parallel in the Open Source Edition. You can simply create a job which will read a file (name supplied by context variable) and the run it as many times as your system will handle it concurrently. This won't be the elegant solution that you get with the Enterprise Edition, but since the aim is simply to get the data loaded (I am assuming), then it shouldn't matter.
One Star

Re: Read one file parrallely

Sure, thanks for your help.
Four Stars

Re: Read one file parrallely

Hi bibintjohn1,

You can do this using enterprise edition , else the other option could be to do it manually. You can split your file using one job and then can execute multiple job in parallel on different file. 

Thanks,
Saurabh.

15TH OCTOBER, COUNTY HALL, LONDON

Join us at the Community Lounge.

Register Now

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

APIs for Dummies

View this on-demand webinar about APIs....

Watch Now