One Star

[resolved] tStatCatcher/tFlowMeter for file iterations

Hi,
If I use the stat catcher and/or flowmeter, and I am iterating through a set if input files, will the counts be available for each file or will the count be for the combined total for all files that are looped through?
Thanks in advance,
1 ACCEPTED SOLUTION

Accepted Solutions
Community Manager

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hello
You can add a custom field on tMap component, eg:
tFlowMeterCatcher--tMap--tMysqlOutput
On tMap, you can add custom field and set its value.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
13 REPLIES
Community Manager

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hello
The 'count' column will compute the row number for each file.
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hi,
I'm having a similar situation and I'd like to store the filename in addition to the row count in the stat table. Is there a way to add custom fields to the FlowMeter default schema ? E.g., add a "filename" field in addition to "moment", "pid", "count", etc. Or is there another way to achieve the same result ?
Thanks,
Chris
Community Manager

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hello
You can add a custom field on tMap component, eg:
tFlowMeterCatcher--tMap--tMysqlOutput
On tMap, you can add custom field and set its value.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Ok it works,
I only had to trigger the tFlowMeterCatcher step after each file iteration, else it would only trigger at the end of processing and store only the last filname processed.
Thanks a lot!
Chris
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Ok this is not so simple, I have a problem with multi-threading now. Here is a description of the situation:
I'm using TOS 3.2.0M1, developing under Windows then executing on CentOS 5.3 64-bit.
The job i'm running is visible here. Basically the goal is to open a number of binary files (Browse_Files) from different directories (Browse_Dirs), process them through a custom java step (Extract_Raw_Tickets), create csv files from the output (Fill_CSV), then load these csv files into an oracle database using sqlldr (Load_CSV). One CSV file is filled with the output from all the files in one directory, then loaded in one shot. The processing is done in parallel (cf the "Iterate x5" on Browse_Dirs). So if I have 10 directories, I have 10 CSV files processed by 5 concurrent threads.
This was all doing fine until I added the statistics collection. I would like to have a table updated in the database with the list of files processed and the number of lines in each one of them. So I added "Count_Rows" to count the lines produced by each source file; then once the file is processed "Get_Stats" is triggered, filename is added with "Get_Filename" and summary row is inserted into table with "Upload_Stats". (I'm not sure whether it's the best way to do this, it seemed to work on a small sample).
However I have encountered two exceptions when running this job:
First one non-fatal, after a few hundred iterations:
Exception in component tFlowMeterCatcher_1
java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:449)
at java.util.AbstractList$Itr.next(AbstractList.java:420)
at routines.system.MetterCatcherUtils.getMessages(MetterCatcherUtils.java:160)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tFlowMeterCatcher_1Process(PP2_Call_Extract.java:4463)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tFileList_1Process(PP2_Call_Extract.java:3680)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tFileDelete_1Process(PP2_Call_Extract.java:793)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract$1tJava_1Thread.run(PP2_Call_Extract.java:595)
at routines.system.ThreadPoolWorker.runIt(TalendThreadPool.java:159)
at routines.system.ThreadPoolWorker.runWork(TalendThreadPool.java:150)
at routines.system.ThreadPoolWorker.access$0(TalendThreadPool.java:145)
at routines.system.ThreadPoolWorker$1.run(TalendThreadPool.java:122)
at java.lang.Thread.run(Thread.java:595)

Second one fatal, a few dozen iterations later:
Exception in component tOracleBulkExec_1
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:474)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tOracleBulkExec_1Process(PP2_Call_Extract.java:4864)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tFileExist_1Process(PP2_Call_Extract.java:4696)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tFileList_1Process(PP2_Call_Extract.java:3704)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract.tFileDelete_1Process(PP2_Call_Extract.java:793)
at ticketloader.pp2_call_extract_5_0.PP2_Call_Extract$1tJava_1Thread.run(PP2_Call_Extract.java:595)
at routines.system.ThreadPoolWorker.runIt(TalendThreadPool.java:159)
at routines.system.ThreadPoolWorker.runWork(TalendThreadPool.java:150)
at routines.system.ThreadPoolWorker.access$0(TalendThreadPool.java:145)
at routines.system.ThreadPoolWorker$1.run(TalendThreadPool.java:122)
at java.lang.Thread.run(Thread.java:595)

So it looks like the statistics collection is done in an thread unsafe manner. Or is there a flaw in my design itself ?
Thanks a lot for your help,
Chris
Community Manager

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hello Chris
I can't open your image, please upload them to our forum directly, don't add a link.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Ok I just did it, I only had to scale the image down for the forum to accept it.
Chris
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hi there,
any idea about this one ?
Thanks,
Chris
Community Manager

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hello
There are two wrong place in your design job:
1) Delete the 'oncomponentok' link, see picture 1.
2) The tOracleOutputBulk and tOracleBulkExec components must be used together. see picture 2.
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hi Shong,
thanks for your reply.
1) Delete the 'oncomponentok' link, see picture 1.

That's the way I did it at first, but then the RowMeterCatcher fires only after the subjob "Browse Files" finishes; as a consequence, the "Get Filename" mapper reads the file name as it is after subjob termination, i.e., the last file that was processed, and outputs this single filename for each line of statistics. So for instance instead of getting
file1 100 lines
file2 200 lines
file3 300 lines
I get
file3 100 lines
file3 200 lines
file3 300 lines
2) The tOracleOutputBulk and tOracleBulkExec components must be used together. see picture 2.

I don't understand this one too well. Wouldn't adding another OracleOutputBulk step introduce an additional, redundant, intermediate csv file? Currently the job fills up a csv file with the ouput from several input binary files then load the csv file in one go using sqlldr. This seems to work quite well as long as I don't introduce the statistics steps. Is there a better way to do that under Talend? Are the multithreading issues related to this architecture?
Thanks a lot for your help,
Chris
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hi again Shong,
do you have any input regarding the points above? Your help is definitely appreciated!
Thanks,
Chris
One Star

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hi there again,
sorry to insist but does anyone has any insight on the issue at hand? Is there a flaw in TOS multithreading support or should the job be designed differently?
Thanks to all for your help,
Chris
Community Manager

Re: [resolved] tStatCatcher/tFlowMeter for file iterations

Hello Chris
Sorry, I miss this topic yesterday.
The processing is done in parallel (cf the "Iterate x5" on Browse_Dirs

As I known, there was a bug on the parallel iterate link. So, don't check 'enable parallel execution' option and try if the problem still exsit.
as a consequence, the "Get Filename" mapper reads the file name as it is after subjob termination,

Yes, using the 'oncomponentok' link after tOracleOutput if you want to get the current file name on tMap and output it.
I don't understand this one too well. Wouldn't adding another OracleOutputBulk step introduce an additional, redundant, intermediate csv file?

Yes, you are right. If the file exists, you just need a tOracleBulkExec to read it and bulk insert records into db.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business