One Star

Aggregate and denormalize

Hi,
I'm trying to execute a transformation that is simple but I cannot figure out how to do it.
Let's consider the following data flow:
TICKET|DATE|USER|TIME
24437|2008/09/15|User1|20|
24433|2008/09/15|User1|10|
24468|2008/09/15|User1|15|
...
I want to transform this into
24437, 24433, 24468|2008/09/15|User1|45|
The transformation consists of aggregating the entries based on the fields DATE and USER (GROUP BY) and with the following operations:
- TICKET: concatenate with "," as separator
- TIME: sum all aggregated entries
The tAggregateRow component works for the SUM operation but doesn't support any CONCAT operation.
I also tried to the tDenormalize component which is supposed to denormalize rows and concatenate the specified field. However, I have no idea how to use it. I can only specify the field "column to denormalize" and the "item separator". I'm expecting at least other fields for choosing the "GROUP BY" conditions. And what about if I want to denormalize several fields at the same time?
Would the latest version (tos3.0.0) solve these issues?
Which components should I use to execute the desired transformation?
Thank you for your help!
Tags (1)
7 REPLIES
One Star

Re: Aggregate and denormalize

Hi,

you can use an tJavaRow :
Create 4 global variable
- put input_Row.Date and input_Row.USER in a global variable IF different of two global variable,
- initialyze of two other global variable
ELSE
- these one is concataned input_Row.TICKETS and the other sumed input_Row.TIME.

after use a tUniqueRow

sorry for my poor english


Jeremie
One Star

Re: Aggregate and denormalize

Thank you for this quick answer but I cannot use a tJavaRow because I'm using a PERL-project.
I cannot switch to a JAVA project because perl is a requirement for my project.
Employee

Re: Aggregate and denormalize

I cannot switch to a JAVA project because perl is a requirement for my project.

Perl is perfectly OK, let's see how you can do in pictures.
Here are the issues you have to know in order to understand the job:
1. there is a bug in tDenormalize : the column to denormalize has to be the last one (can you please create a bugtracker issue related to this?)
2. tAggregateRow has no "concat" (or "concat distinct") operation, but you can do the same in tDenormalize (in a much more configurable way). You can create a feature request in the bugtracker if you think it would be smarter to have "concat " in tAggregateRow.
One Star

Re: Aggregate and denormalize

Thank you for the answer.
I understand your approach but unfortunately it doesn't work very well because:
- in your example, the data is read from cvs files. It's then simple to replicate the input.
- in my case, the data is the result of several other transformations, mappings.
If I want to apply your approach, I would have to duplicate my transformation.
The final result would then be very complex and not very nice any more.
Initially, I thought that I could just use the tReplicate component to avoid this duplication. Unfortunately, because Talend doesn't support cycles, I cannot merge the two data flows back into one later on.
One Star

Re: Aggregate and denormalize

Ok. I could fix it by duplicating the tMySqlOutput component and using "update or insert" for "action on data".
The first branch is responsible of denormalizing the entries (CONCAT).
The second branch is responsible of aggregating the entries (COUNT).
At the end, the second branch updates the entries previously inserted by the first branch.
I just fear that collisions could happen. That means that both branches will insert new entries (rather than updating) if they are executed concurrently. To avoid that, I added a tSleep component. But I'm not sure that it works in any case.
Employee

Re: Aggregate and denormalize

I understand your approach but unfortunately it doesn't work very well because:
- in your example, the data is read from cvs files. It's then simple to replicate the input.
- in my case, the data is the result of several other transformations, mappings.
If I want to apply your approach, I would have to duplicate my transformation.
The final result would then be very complex and not very nice any more.

OK, use memory buffering instead. Go in and add tArray/tArrayIn components. Store your complex transformations result in a tArray, and then use tArrayIn to read it as many times as needed. See screenshots for job design.
Employee

Re: Aggregate and denormalize

Ok. I could fix it by duplicating the tMySqlOutput component and using "update or insert" for "action on data".
The first branch is responsible of denormalizing the entries (CONCAT).
The second branch is responsible of aggregating the entries (COUNT).
At the end, the second branch updates the entries previously inserted by the first branch.

Very smart alternative to the solution I propose, the advantage over my solution is the reduced memory requirement (but I still prefer my solution if you don't have huge data sets).
I just fear that collisions could happen. That means that both branches will insert new entries (rather than updating) if they are executed concurrently. To avoid that, I added a tSleep component. But I'm not sure that it works in any case.

Don't fear collisions at this point. tReplicate doesn't executes its output branches in parallel, but in the defined order. In your screenshot, it means that row12 is executed before row14.
In your solution, my advice is to have a single tMysqlConnection that your 2 tMysqlOutput are using, so that you don't have concurrent transactions.
tSleep is absolutely not a solution, because it performs the "sleep" only once at the end of all rows and because we don't have parallel executions with tReplicate.