Parallel processing of time series data

One Star

Parallel processing of time series data

Hello, I have a task that requires me to take a large chunk of time-varying data, do some very custom processing (e.g. do some operations that require looking back), and output a single row for each item that represents a summary across the time series. For this, I use a tJavaFlex task, as I need to track the data along the way and reference prior rows (just as a side note wondering if there is another recommended task to do such a thing - I am pretty sure tJavaRow and tJava would not be better for this, even if at all possible).
So we start with a database query that simply feeds a tJavaFlex task via a main row connection, which then in turn writes out the correct file.
This all works ok, but I would like to know the following. I have a large (lets say 1 billion+ rows) table of time series data. It would be ideal to execute the query from the start in parallel rather than in one giant chunk (thus spawning several output files, which is ok). I also have one particular constraint: there are several groups of time series, so the data cannot be broken up arbitrarily - each time series corresponding to an item must be kept in the same processing unit. One thing I could do in the absence of anything else is manually change my query and make physically separate jobs, i.e. instead of one job that starts with a query: "select * from time_series_data order by item_id, time_period" do several jobs manually where I start with a query: "select * from time_series_data where item_id > X and item_id < Y order by item_id, time_period", changing X and Y along the way for each job. Is there a way that Open Studio 3.0.4 can execute this without my manually making new jobs?
Many thanks,
DomF
One Star

Re: Parallel processing of time series data

Hi DomF,
from my point of view Talend would be only able to use parameter (context variables) in your SQL statement and you could execute the script in parallel with different settings. But this must be in different processes (one will not know the other).
Bye
Volker
One Star

Re: Parallel processing of time series data

Volker,
Thank you for reply. Different processes would be ok. It's just that I have to be sure that each process does not overlap in data, and that each individual time series is not split across processes.
Could you provide an example of how to do what you suggest?
Many thanks,
DomF
One Star

Re: Parallel processing of time series data

Hi DomF,
starting with your example query "select * from time_series_data where item_id > X and item_id < Y order by item_id, time_period". I would suggest the following:
1) define two context variables (int?) named X and Y
2) use tInput with the following SQL:
"select * from time_series_data where item_id > " + context.X + " and item_id < " + context.Y + " order by item_id, time_period"

3) The rest of the job should be designed concerning your needs.
To execute your job in parallel you need to export it out of Talend. Now you are able to execute it multiple times in parallel. To set the context value for each instance of you job you could find more information in 1615
Bye
Volker
One Star

Re: Parallel processing of time series data

Thanks Volker - I played with it a little more and came upon basically the same answer. Much appreciated.