tFileOutputDelimited - split output in multiple files and zip at the same time

Highlighted
Four Stars

tFileOutputDelimited - split output in multiple files and zip at the same time

Hi,

 

I'm trying to unload a table with ~1.5 to 2 billion entries. The expected behaviour is: the data is going to be exported, the result is splitted in multiple CSV files AND the resulting files are going to be compressed at the same time in order to save disk space.

The tFileOutputDelimited has the feature to compress the resulting file OR split the result into mutliple files, but not both at the same time.

So my question is: how am I able to achieve splitting and compressing a file at the same time. Is there a way I can trigger another process at the moment one subset of rows has been written to a file?

 

Best

Arne


Accepted Solutions
Highlighted
Eight Stars

Re: tFileOutputDelimited - split output in multiple files and zip at the same time

Hi Arne,

 

About the only practical way I can think to do this, would be to get the records from your source database in batches the size of each files you want to output, as follows:

 

BatchedZippedCSVOutput.png

 

A quick query to get the record count:

 

tMySQLInput.png

 

And store this in a global variable for later:

 

tSetGlobalVar.png

 

Add a context variable for the batch size, as it's used in two components, and so should be maintained in one place:

 

Context.png

 

The tLoop in "for" mode allows us to iterate and get the necessary offset for each batch of records:

 

tLoop.png

 

In the database input component, we build a query with the correct LIMIT and OFFSET (or whatever's appropriate for your DBMS):

 

tMySQLInput2.png

 

And then we simply output using a tFileOutputDelimited, with zip compression enabled, and a dynamic filename, in my case based on the record offset to keep it simple:

 

tFileOutputDelimited.png

 

Giving us our zipped CSV files:

 

CompressedCSVFiles.png

 

Regards,

 

 

Chris

View solution in original post


All Replies
Highlighted
Eight Stars

Re: tFileOutputDelimited - split output in multiple files and zip at the same time

Hi Arne,

 

About the only practical way I can think to do this, would be to get the records from your source database in batches the size of each files you want to output, as follows:

 

BatchedZippedCSVOutput.png

 

A quick query to get the record count:

 

tMySQLInput.png

 

And store this in a global variable for later:

 

tSetGlobalVar.png

 

Add a context variable for the batch size, as it's used in two components, and so should be maintained in one place:

 

Context.png

 

The tLoop in "for" mode allows us to iterate and get the necessary offset for each batch of records:

 

tLoop.png

 

In the database input component, we build a query with the correct LIMIT and OFFSET (or whatever's appropriate for your DBMS):

 

tMySQLInput2.png

 

And then we simply output using a tFileOutputDelimited, with zip compression enabled, and a dynamic filename, in my case based on the record offset to keep it simple:

 

tFileOutputDelimited.png

 

Giving us our zipped CSV files:

 

CompressedCSVFiles.png

 

Regards,

 

 

Chris

View solution in original post

Highlighted
Four Stars

Re: tFileOutputDelimited - split output in multiple files and zip at the same time

Thanks a lot for this solution. I will give it a try. Unfortunately I have to retrieve the data from a old data sink which does not have any indices at all (and its a data view). So maybe the counting the objects takes a while. 

2019 GARTNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog