Four Stars

tFileOutputDelimited - split output in multiple files and zip at the same time

Hi,

 

I'm trying to unload a table with ~1.5 to 2 billion entries. The expected behaviour is: the data is going to be exported, the result is splitted in multiple CSV files AND the resulting files are going to be compressed at the same time in order to save disk space.

The tFileOutputDelimited has the feature to compress the resulting file OR split the result into mutliple files, but not both at the same time.

So my question is: how am I able to achieve splitting and compressing a file at the same time. Is there a way I can trigger another process at the moment one subset of rows has been written to a file?

 

Best

Arne

1 ACCEPTED SOLUTION

Accepted Solutions
Six Stars

Re: tFileOutputDelimited - split output in multiple files and zip at the same time

Hi Arne,

 

About the only practical way I can think to do this, would be to get the records from your source database in batches the size of each files you want to output, as follows:

 

BatchedZippedCSVOutput.png

 

A quick query to get the record count:

 

tMySQLInput.png

 

And store this in a global variable for later:

 

tSetGlobalVar.png

 

Add a context variable for the batch size, as it's used in two components, and so should be maintained in one place:

 

Context.png

 

The tLoop in "for" mode allows us to iterate and get the necessary offset for each batch of records:

 

tLoop.png

 

In the database input component, we build a query with the correct LIMIT and OFFSET (or whatever's appropriate for your DBMS):

 

tMySQLInput2.png

 

And then we simply output using a tFileOutputDelimited, with zip compression enabled, and a dynamic filename, in my case based on the record offset to keep it simple:

 

tFileOutputDelimited.png

 

Giving us our zipped CSV files:

 

CompressedCSVFiles.png

 

Regards,

 

 

Chris

2 REPLIES
Six Stars

Re: tFileOutputDelimited - split output in multiple files and zip at the same time

Hi Arne,

 

About the only practical way I can think to do this, would be to get the records from your source database in batches the size of each files you want to output, as follows:

 

BatchedZippedCSVOutput.png

 

A quick query to get the record count:

 

tMySQLInput.png

 

And store this in a global variable for later:

 

tSetGlobalVar.png

 

Add a context variable for the batch size, as it's used in two components, and so should be maintained in one place:

 

Context.png

 

The tLoop in "for" mode allows us to iterate and get the necessary offset for each batch of records:

 

tLoop.png

 

In the database input component, we build a query with the correct LIMIT and OFFSET (or whatever's appropriate for your DBMS):

 

tMySQLInput2.png

 

And then we simply output using a tFileOutputDelimited, with zip compression enabled, and a dynamic filename, in my case based on the record offset to keep it simple:

 

tFileOutputDelimited.png

 

Giving us our zipped CSV files:

 

CompressedCSVFiles.png

 

Regards,

 

 

Chris

Four Stars

Re: tFileOutputDelimited - split output in multiple files and zip at the same time

Thanks a lot for this solution. I will give it a try. Unfortunately I have to retrieve the data from a old data sink which does not have any indices at all (and its a data view). So maybe the counting the objects takes a while.