CSV file or Buffer memory, which is better to save mid data in the Job

Highlighted
Seven Stars

CSV file or Buffer memory, which is better to save mid data in the Job

Hi,

 

Which is the best method to store mid data in the job, whether it is in csv file or in buffer memory (hashoutput).

In my scenario, I am getting 4.4 Million records from source and I need to do some operation with this. So I am storing data in the mid of the job because my job contains multiple sub jobs.

 

I am considering multiple perspective like performance, storage space and there should have any memory issue etc.

Please suggest me the best method to use.

 

Thanks in advance.


Accepted Solutions
Fifteen Stars TRF
Fifteen Stars

Re: CSV file or Buffer memory, which is better to save mid data in the Job

Hi,

Due to the number of records, having multiple intermediate files may help if you can parallelize the operations you need to realize with these records.

Else, having all the records in memory can generate memory issues but it depends most of the global data size than the number of records (are the records long or short?) and of course of the physical available memory.

Also, text (or CSV) file are processed very fast with standard tFileInputDelimited or tFileInputFullRow components, so you don't "really" have to worry about response time when using these components (in my opinion, except if you want to gain few seconds but I don't think this is the first concern in your case).

Hope this helps.


TRF

All Replies
Fifteen Stars TRF
Fifteen Stars

Re: CSV file or Buffer memory, which is better to save mid data in the Job

Hi,

Due to the number of records, having multiple intermediate files may help if you can parallelize the operations you need to realize with these records.

Else, having all the records in memory can generate memory issues but it depends most of the global data size than the number of records (are the records long or short?) and of course of the physical available memory.

Also, text (or CSV) file are processed very fast with standard tFileInputDelimited or tFileInputFullRow components, so you don't "really" have to worry about response time when using these components (in my opinion, except if you want to gain few seconds but I don't think this is the first concern in your case).

Hope this helps.


TRF

What’s New for Talend Spring ’19

Watch the recorded webinar!

Watch Now

Agile Data lakes & Analytics

Accelerate your data lake projects with an agile approach

Watch

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.

Download

Tutorial

Introduction to Talend Open Studio for Data Integration.

Watch