tStatCatcher improvements

One Star

tStatCatcher improvements

- the tStatCatcher component does not log the number of rows processed
- the tStatCatcher component does not log actual time spent in a component, just the total wallclock time from start to finish (which makes it much harder to debug which component needs optimization)
- the tStatCatcher component does not allow the timestamp in UTC
- And one more thing - do you think it would be a good idea to modify tStatCatcher to be able to "poll" data after certain amount of time and give the status of the job? Currently if some long-running component starts - there is no way to see the progress of the job. This may be a configurable parameter in IDE.
Or maybe there are some other components that can help us to do this?
Employee

Re: tStatCatcher improvements

- the tStatCatcher component does not log the number of rows processed

That's right, this is the tFlowMeter + tFlowMeterCatcher job
- the tStatCatcher component does not log actual time spent in a component, just the total wallclock time from start to finish (which makes it much harder to debug which component needs optimization)

Yes, we know, it's due to our code generation model where part of components get together. We have added tChronometerStart/tChronometerStop to calculate the duration for a given component. I've posted a screenshot in this post to give an example on how to do this.
The result is:
Starting job topic5996 at 14:00 31/03/2009.
tPerlFlex_1 duration 770 ms (0 seconds), 100000 runs, average : 7 microseconds, min : 6 microseconds, max: 1796 microseconds, speed: 129870 rows/second
tMysqlOutput_1 10721 ms (10 seconds), 100000 runs, average : 107 microseconds, min : 95 microseconds, max: 10112 microseconds, speed: 9327 rows/second
===
execution time: 28309 milliseconds
===
Job topic5996 ended at 14:00 31/03/2009.

and if I activate "extended inserts" in tMysqlOutput_1:
Starting job topic5996 at 14:10 31/03/2009.
tPerlFlex_1 duration 705 ms (0 seconds), 100000 runs, average : 7 microseconds, min : 5 microseconds, max: 3548 microseconds, speed: 141843 rows/second
tMysqlOutput_1 1877 ms (1 second), 100000 runs, average : 18 microseconds, min : 6 microseconds, max: 13739 microseconds, speed: 53276 rows/second
===
execution time: 18152 milliseconds
===
Job topic5996 ended at 14:11 31/03/2009.

So it means that tRowGenerator takes approximately 16 seconds.

- the tStatCatcher component does not allow the timestamp in UTC

You mean you want the timestamp to be timezone independant? I think it is perfectly right (should I say we made a design mistake?). What I can propose you is a "localtimeToUTCtime" routine to use right after tStatCatcher.
The problem with such a change is the existing data that has already been generated.
- And one more thing - do you think it would be a good idea to modify tStatCatcher to be able to "poll" data after certain amount of time and give the status of the job? Currently if some long-running component starts - there is no way to see the progress of the job. This may be a configurable parameter in IDE.

We have a feature that looks a bit like this with "statistics" in the Run view. For each row link, every 2 seconds, we send a message in a socket saying how many rows have already been processed. A long time ago, there was a thread (and not a fork) that was running in parallel of the main processing and sending a message in a socket every second. This was highly time consuming because there was many shared variables.
Anyway, your request would be perfectly possible. I mean adding a "running" status in addition to "begin" and "end". The problem would be in the reader. Here at Talend we propose 2 tools to read the data generated by tStatCatcher (Activity Monitoring Console and Talend Integration Suite Dashboard), and we have to check if this feature wouldn't break these tools.
One Star

Re: tStatCatcher improvements

Hi Mister Penguin,
it's really great that you're talking about the Chronometer components because I precisely encounter some difficulties to use them as you describe.
I don't want to duplicate posts (6027) but in fact, in your example you have a tChronometerStart following a tGenerator, and I am not able to create a such connection between these 2 components on my TOS version, my TOS simply doesn't allow to do that.
I'm using TOS 3.0.3 (Build id: r21383-20090126-2207) under Windows Vista, in Java mode. (you are apparently in Perl, maybe that's the key...)
Another point: I don't have so detailed informations using a tChronometer stop (I don't have min, max, average, records rate...)
Talend spirits, please take time to answer to this post, my 3 previous posts seem to be invisible on this forum Smiley Sad
Regards,
One Star

Re: tStatCatcher improvements

The link is wrong in my previous post, that's the first time i'm creating one, I'll try like this: 6027
Regards.
One Star

Re: tStatCatcher improvements

That's right, this is the tFlowMeter + tFlowMeterCatcher job
...

Thank you for your detailed reply.
I was just wondering - why do we have so many components when we can have all functionality in one component? I want times, number of rows processed and polling all-in one - so i know if I added my tStatCatcher component - I'll have all necessary information for debugging and fine tuning my application.
As for the timer component - It's good for tuning the application while you develop it - but the behavior of the script might change when the size of the input/output data grows. So I pretty much need times for eaxh and every component in my graph. Can we add timer functionality for all components? Or is it hard to implement?