Collating row counts in multiple subjobs.

Five Stars

Collating row counts in multiple subjobs.

I have a job similar to the one below, with two inputs and a filter on each input. I want to output the number of rows from the filters, into a file output that looks something like:

row2|0 rows

row3|100 rows

row5|5 rows

row6|5 rows

 

The file layout doesn't matter too much, I can tweak that as necessary. I just need the values.

 

RowCountJob.JPG

 

If I have a separate tFlowMeterCatcher that outputs to a CSV file, only the second set of values is output. I believe the first set IS output, but is then overwritten regardless of the "append to file" setting. In any case, I do not want to append to file, because I want the stats to be refreshed entirely each time I run the job.

 

I have also tried to output the results via the Stats&Logs option in the Job tab, but this also only outputs the results from the latest subjob, having overwritten the results from the initial one.

 

How can I make all four stats appear in the same file? Is it possible to have more than one tFlowMeterCatcher, and specify which tFlowMeter(s) it refers to?


Accepted Solutions
Five Stars

Re: Collating row counts in multiple subjobs.

I have managed to resolve it, but I am surprised at the complexity of my solution, and I would love to know if there is a better way.

 

Since tFlowMeterCatcher overwrites the output from one subjob with the second subjob, my aim was to merge the two jobs into one. The only way I found to do this was to use a tMap immediately after each tRowGenerator, and create a new field which identified the source. This is just a simple string field which can contain "Source1" or "Source2" (or anything more meaningful).

 

I then used a tUnite to bring the two tMaps together, immediately followed by a tReplicate. Each output from tReplicate connected to a tFilter which filtered on the source field, and from then I used the original tFilter which gives me the values I need, within the same job.

 

Incidentally, this initially caused an error where the size of the Java code was too large, so I split it up into two sections using a tHashInput and tHashOutput.

 

To me this whole thing seems unnecessarily complex, but I could find no other way to resolve it. Does anyone have a better, more efficient solution?

Five Stars

Re: Collating row counts in multiple subjobs.

An image of my solution, with a small refinement of using a tMap to filter on the sources, instead of a tReplicate and multiple tFilters.

 

RowCountJob_Solution.JPG


All Replies
Five Stars

Re: Collating row counts in multiple subjobs.

I have managed to resolve it, but I am surprised at the complexity of my solution, and I would love to know if there is a better way.

 

Since tFlowMeterCatcher overwrites the output from one subjob with the second subjob, my aim was to merge the two jobs into one. The only way I found to do this was to use a tMap immediately after each tRowGenerator, and create a new field which identified the source. This is just a simple string field which can contain "Source1" or "Source2" (or anything more meaningful).

 

I then used a tUnite to bring the two tMaps together, immediately followed by a tReplicate. Each output from tReplicate connected to a tFilter which filtered on the source field, and from then I used the original tFilter which gives me the values I need, within the same job.

 

Incidentally, this initially caused an error where the size of the Java code was too large, so I split it up into two sections using a tHashInput and tHashOutput.

 

To me this whole thing seems unnecessarily complex, but I could find no other way to resolve it. Does anyone have a better, more efficient solution?

Five Stars

Re: Collating row counts in multiple subjobs.

An image of my solution, with a small refinement of using a tMap to filter on the sources, instead of a tReplicate and multiple tFilters.

 

RowCountJob_Solution.JPG

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Tutorial

Introduction to Talend Open Studio for Data Integration.

Watch