Five Stars

Collating row counts in multiple subjobs.

I have a job similar to the one below, with two inputs and a filter on each input. I want to output the number of rows from the filters, into a file output that looks something like:

row2|0 rows

row3|100 rows

row5|5 rows

row6|5 rows

 

The file layout doesn't matter too much, I can tweak that as necessary. I just need the values.

 

RowCountJob.JPG

 

If I have a separate tFlowMeterCatcher that outputs to a CSV file, only the second set of values is output. I believe the first set IS output, but is then overwritten regardless of the "append to file" setting. In any case, I do not want to append to file, because I want the stats to be refreshed entirely each time I run the job.

 

I have also tried to output the results via the Stats&Logs option in the Job tab, but this also only outputs the results from the latest subjob, having overwritten the results from the initial one.

 

How can I make all four stats appear in the same file? Is it possible to have more than one tFlowMeterCatcher, and specify which tFlowMeter(s) it refers to?

Tags (3)
2 ACCEPTED SOLUTIONS

Accepted Solutions
Five Stars

Re: Collating row counts in multiple subjobs.

I have managed to resolve it, but I am surprised at the complexity of my solution, and I would love to know if there is a better way.

 

Since tFlowMeterCatcher overwrites the output from one subjob with the second subjob, my aim was to merge the two jobs into one. The only way I found to do this was to use a tMap immediately after each tRowGenerator, and create a new field which identified the source. This is just a simple string field which can contain "Source1" or "Source2" (or anything more meaningful).

 

I then used a tUnite to bring the two tMaps together, immediately followed by a tReplicate. Each output from tReplicate connected to a tFilter which filtered on the source field, and from then I used the original tFilter which gives me the values I need, within the same job.

 

Incidentally, this initially caused an error where the size of the Java code was too large, so I split it up into two sections using a tHashInput and tHashOutput.

 

To me this whole thing seems unnecessarily complex, but I could find no other way to resolve it. Does anyone have a better, more efficient solution?

Five Stars

Re: Collating row counts in multiple subjobs.

An image of my solution, with a small refinement of using a tMap to filter on the sources, instead of a tReplicate and multiple tFilters.

 

RowCountJob_Solution.JPG

2 REPLIES
Five Stars

Re: Collating row counts in multiple subjobs.

I have managed to resolve it, but I am surprised at the complexity of my solution, and I would love to know if there is a better way.

 

Since tFlowMeterCatcher overwrites the output from one subjob with the second subjob, my aim was to merge the two jobs into one. The only way I found to do this was to use a tMap immediately after each tRowGenerator, and create a new field which identified the source. This is just a simple string field which can contain "Source1" or "Source2" (or anything more meaningful).

 

I then used a tUnite to bring the two tMaps together, immediately followed by a tReplicate. Each output from tReplicate connected to a tFilter which filtered on the source field, and from then I used the original tFilter which gives me the values I need, within the same job.

 

Incidentally, this initially caused an error where the size of the Java code was too large, so I split it up into two sections using a tHashInput and tHashOutput.

 

To me this whole thing seems unnecessarily complex, but I could find no other way to resolve it. Does anyone have a better, more efficient solution?

Five Stars

Re: Collating row counts in multiple subjobs.

An image of my solution, with a small refinement of using a tMap to filter on the sources, instead of a tReplicate and multiple tFilters.

 

RowCountJob_Solution.JPG