Normalizing more than one column without using a custom routine

Highlighted
Six Stars

Normalizing more than one column without using a custom routine

Hi,

 

tNormalize normalize only one row, I want to normalize 3 columns which are related to each other:

 

Sample data:

 

 

idaddressemailphone
1add1~add2~add3a@a.com~a@b.com12345678~8765432~12345
2add1~add2b@b.com~c@c.com12345678~8765432
3add1~add2~add3a@a.com~a@b.com~d@d.com12345678~8765432~12345
4add1~add2b@b.com123456~8765
5add1~add2~add3  
6add1b@b.com8765432
7add1~add2~add3b@b.com~c@c.com8765432

 

I want to normalize this table in a way which

  • first phone and first email go to the first address
  • the second phone and the second email go to the second address if they exist.
  • and so on...

 

Desired output for sample data

 

 

idaddressemailphone
1add1a@a.com12345678
1add2a@b.com8765432
1add3 12345
2add1b@b.com12345678
2add2c@c.com8765432
3add1a@a.com12345678
3add2a@b.com12345678
3add3d@d.com12345
4add1b@b.com123456
4add2 8765
5add1  
5add2  
5add3  
6add1b@b.com8765432
7add1b@b.com8765432
7add2c@c.com 
7add3  



I there any other component that I can use? 

Or any java thought that I can put in a tjava?

 

Thank you.

 

 


Accepted Solutions
Nine Stars

Re: Normalizing more than one column without using a custom routine

Hi,

 

I see 2 options :

 

1) use a tJavaFlex. Assuming you are confident that you'll never have more that 3 (for example) address, you can use a tJavaFlex followed by a tFilterRow to produce the desired result:

<data_source> --(main)--> tJavaFlex --(main)--> tFilterRow

The idea is to create 3 output rows for each input rows and in each output row, put the nth item of the input row if it exists. The tFilterRow will simply discard rows with no data at all (or no address or whatever).

 

The tJavaFlex code will contain something like this :
BEGIN (replace 3 with the maximum number of data you can have) :

for (int i=0; i<3; i++) {

MAIN (row1 is the input row, row2 the output):

	String[] add;
	String[] email;
	String[] phone;
	
	row2.id = row1.id;
	
	add=row1.add.split("~", -1);
	if (add != null && add.length > i)
		row2.add=add[i];
	else
		row2.add="";
		
	email=row1.email.split("~", -1);
	if (email != null && email.length > i)
		row2.email=email[i];
	else
		row2.email="";		
		
	phone=row1.phone.split("~", -1);
	if (phone != null && phone.length > i)
		row2.phone=phone[i];
	else
		row2.phone="";

Please note that it's certainly possible to optimize this code! Take this as a guideline Smiley Happy

The idea is to split each input data and take the nth item if it exists.

 

2) Another option would be to first split your data into 3 table (with a tMap for example) :
id,address / id,email / id,phone
You then normlize each table separatly and add a sequence for each "row" of the same id:
id,row,address / id,row,email / id,row,phone
with row being something such as Numeric.sequence("address_"+id,1,1)
And at the end, you join all 3 datasets back.

I think it should also work assuming the order is preserved will going through the tNormalize.

 

In doubt, I would probably prefer the first option: it has more java code but it's probably safer and more efficient.

 

Regards.


All Replies
Nine Stars

Re: Normalizing more than one column without using a custom routine

Hi,

 

I see 2 options :

 

1) use a tJavaFlex. Assuming you are confident that you'll never have more that 3 (for example) address, you can use a tJavaFlex followed by a tFilterRow to produce the desired result:

<data_source> --(main)--> tJavaFlex --(main)--> tFilterRow

The idea is to create 3 output rows for each input rows and in each output row, put the nth item of the input row if it exists. The tFilterRow will simply discard rows with no data at all (or no address or whatever).

 

The tJavaFlex code will contain something like this :
BEGIN (replace 3 with the maximum number of data you can have) :

for (int i=0; i<3; i++) {

MAIN (row1 is the input row, row2 the output):

	String[] add;
	String[] email;
	String[] phone;
	
	row2.id = row1.id;
	
	add=row1.add.split("~", -1);
	if (add != null && add.length > i)
		row2.add=add[i];
	else
		row2.add="";
		
	email=row1.email.split("~", -1);
	if (email != null && email.length > i)
		row2.email=email[i];
	else
		row2.email="";		
		
	phone=row1.phone.split("~", -1);
	if (phone != null && phone.length > i)
		row2.phone=phone[i];
	else
		row2.phone="";

Please note that it's certainly possible to optimize this code! Take this as a guideline Smiley Happy

The idea is to split each input data and take the nth item if it exists.

 

2) Another option would be to first split your data into 3 table (with a tMap for example) :
id,address / id,email / id,phone
You then normlize each table separatly and add a sequence for each "row" of the same id:
id,row,address / id,row,email / id,row,phone
with row being something such as Numeric.sequence("address_"+id,1,1)
And at the end, you join all 3 datasets back.

I think it should also work assuming the order is preserved will going through the tNormalize.

 

In doubt, I would probably prefer the first option: it has more java code but it's probably safer and more efficient.

 

Regards.

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog