One Star

How to normalize a row into multiple rows from the names of columns?

I have an input csv with a format like the following:
id, service1, service2, service3
x, 123, 234, 345
y, 456, 567, 678

I would like to produce an output with the following format:
id, type, qty
x, service1, 123
x, service2, 234
x, service4, 345
y, service1, 456
y, service2, 567
y, service3, 678

In other words, I need to take a row with many attributes and normalize it into multiple rows, where the value of the "type" column in the output is equal to the name of the column in the input, and the value of the "quantity" column in the output is equal to the value of the name of the column in the input. I would prefer to do this by introspecting the schema of the input to determine the value of the "type" attribute in the output schema. Each input row is normalized into multiple output rows.
This is my current solution:
tFileInputDelimited -> tJavaRow -> tNormalize -> tExtractDelimitedFields -> tLogrow
The tJavaRow has the following schema (id, line) and its code is as such:
output_row.id = input_row.id;
output_row.line = "service1:" + input_row.service1 + ";service2:" + input_row.service2 + ";service3:" + input_row.service3;

The tFileInputDelimited has the schema (id, service1, service2, service3), as noted above.
The tNormalizeRow normalizes the column "line" using the item separator ";".
The tExtractDelimitedFields has the following schema (id, type, quantity), the field to split is "line", and the field separator is ":".
This works, but I don't like doing it this way. If a new service, such as "service4" is added, I will have to change the schema of the tFileInputDelimited, and change the code in tJavaRow to account for the -- ";service4:" + input_row.service4; -- part. If a service is removed, I would have to remove it in both spots. In reality there are many services, so the tJavaRow will get very long.
I would prefer a solution that could introspect the schema of the tFileInputDelimited, and assume that any column other than "id" (or additional, specified attributes) is a service that should be normalized into the "type" column.
What is the best way to accomplish this using Talend?
3 REPLIES
Four Stars

Re: How to normalize a row into multiple rows from the names of columns?

Hi,
Separate your flow in two steps. First separate your type fields by ';' operator using
tfileinputrow-->String handling (replace all occurances of ',' by ';' except first)-->tfileoutput
Your data would look like
x, 123; 234; 345
y, 456; 567; 678
use following flow to generate service1,service2..etc

Following code was used in tJavarow
//Code generated according to input schema and output schema
output_row.id = input_row.id;


if (id_tMemorizeRows_1== id_tMemorizeRows_1)
{
    context.cnt = context.cnt+1;
    output_row.Services = "Services"+context.cnt.toString();


}
else if (id_tMemorizeRows_1!= id_tMemorizeRows_1)
{
    context.cnt =  1;
    output_row.Services = "Services"+context.cnt.toString();
   
}
output_row.Type = input_row.Type;

Thanks
vaibhav
One Star

Re: How to normalize a row into multiple rows from the names of columns?

Hi Vaibhav,
Thanks for your answer. However, the service names are not actually "service1", "service2"; they do not increment by one each time. As an example, the services could be named "Paris","Red","January","Samsung","Obama". The logic to increment a count variable by one and append it to the string "service" would not work in this case. Do you know of a way to introspect the schema at run time such that an array could be formed dynamically based on the schema? In that case, your count method would still work as it could access the array for the correct column.
Thanks,
Matthew Moisen
One Star

Re: How to normalize a row into multiple rows from the names of columns?

Hi,
Separate your flow in two steps. First separate your type fields by ';' operator using
tfileinputrow-->String handling (replace all occurances of ',' by ';' except first)-->tfileoutput
Your data would look like
x, 123; 234; 345
y, 456; 567; 678
use following flow to generate service1,service2..etc

Following code was used in tJavarow
//Code generated according to input schema and output schema
output_row.id = input_row.id;


if (id_tMemorizeRows_1== id_tMemorizeRows_1)
{
    context.cnt = context.cnt+1;
    output_row.Services = "Services"+context.cnt.toString();


}
else if (id_tMemorizeRows_1!= id_tMemorizeRows_1)
{
    context.cnt =  1;
    output_row.Services = "Services"+context.cnt.toString();
   
}
output_row.Type = input_row.Type;

Thanks
vaibhav