Merging multirecord file using tHash-tSort-tUnite

Seven Stars

Merging multirecord file using tHash-tSort-tUnite

Hi,

 

I have a multi-record file in which I have to make several changes: for each record I have a tMap where I am doing my changes. After that I have to generate a unique file by sorting the entire output on a specific field. For each record I generate a file with the append condition. The problem is that the output is not sorted. So, I am trying to unite each record with tHashOutput/tHashInput, and then link to tUnite and tSortRow. But, the file generated is not correct. If I use tFileOutputDelimited, the file .txt is sorted correctly, but with no pad among columns/fields. If I use instead tFileOutputPositional I cannot pad each fields properly because of the tUnite. So, each record is staggered with wrong pad. My job below:

 

Does anyone have any idea? 

JOB part1.pngJOB part2.png

Community Manager

Re: Merging multirecord file using tHash-tSort-tUnite

I'm afraid you need to explain this more and maybe give examples of the data. Alternatively send us your job and your file so we can see what is happening. 

 

However, I am guessing I know what your problem is. You have several record types in your file, with a different number of columns and different padding. You extract these in different flows so that you can keep the record structure intact. I assuming that your rows need to be in exactly the same order going out as they come in? If so, you can try this.

 

When you read the data in the 5 read flows, add a new column at the beginning of each of the rows as a "count" record. Do that with a tMap and THEN filter your records. This will give you a recordset at the end of the process with a column for the record number at the beginning or end of the record. You can use the 

routines.Numeric.sequence(seqName, startValue, step) method to generate the sequences.

 

My assumptions are these.

1) Your file must be input and output in the same order

2) Each of your records is used only once

 

Assuming your data looks similar to below....

AAAAPPPPQQQQUUUUUIIIIII -----> {4,4,4,5,6} Type A

BBNNNMMMMHHHHH -----> {2,3,4,5} Type B

FFFFFGGGGGJJJJJTTTTT ---->{5,5,5,5} Type C

BBBBCCCCTTTTLLLLLRRRRRR -----> {4,4,4,5,6} Type A

PPKKKVVVVDDDDD -----> {2,3,4,5} Type B

AAAAAWWWWWQQQQQYYYYY ---->{5,5,5,5} Type C

 

Your records would come out like this....

1,AAAA,PPPP,QQQQ,UUUUU,IIIIII

2,BB,NNN,MMMM,HHHHH

3,FFFFF,GGGGG,JJJJJ,TTTTT

4,BBBB,CCCC,TTTT,LLLLL,RRRRRR

5,PP,KKK,VVVV,DDDDD

6,AAAAA,WWWWW,QQQQQ,YYYYY

 

As such you have an order. Now all you need to do is keep that order column until you have finished processing and then order by it before outputting to your new version of the file (omitting the order column of course).

 

Seven Stars

Re: Merging multirecord file using tHash-tSort-tUnite

I am attaching myJob, myInputFile and myOutputFile. 

I am tryin to explain a bit better the issue. I have 5 input flows. I create flow 3 and 4 from flow2. Each Flow have a sequence named "progressivo", but flow 3 and 4 have its own. This because I have to put, when exist, flow 3 and 4 after flow2. For instance:

 

             progressivo (seq)

flow1            1

flow2            2

flow3            2

....

....

flow1           1

flow2           2

flow2           3

flow2           4

flow3           4

 

Flow 3 and 4 may not exist. This is the reason I use filter.

It seems that when I link tFileOutputPositional to tHashOutput something goes wrong (some field's pad)

Community Manager

Re: Merging multirecord file using tHash-tSort-tUnite

You've sent the wrong job. Can you package it up and send it again?

I still think my suggested solution would work, given your example though.

Seven Stars

Re: Merging multirecord file using tHash-tSort-tUnite

I am sorry. I updated the package.

I had a change and link tHashOutput in order to have only one tHashInput. 

Community Manager

Re: Merging multirecord file using tHash-tSort-tUnite

OK, I see your error. You are expecting the tFileOutputPositional files to carry out the padding for you and for that data to be sent to the tHashOutput components. That will not happen. You can also not guarantee the order in which the data will arrive in the tFileOutPositional files or the tHashOutput components. This is because you essentially have 5 independent flows processing your data. You also need to ensure that all of your tHashOutput components have the same schema since you have joined them.

 

This will take a lot of work to fix, but I can give you a method that will work.

 

1) Extract your data as you are at the moment (as far as the tFileInputPositional components go). Then, join those to separate tMap components. Work out the maximum number of columns you will have in the biggest data set. If this is 60, set the output of every tMap to have 60 String columns and 1 Integer column. If your tFileInputPositional just has 5 columns, join those to the first 5 columns of the 60+1 column output. If it has 60, join them in order to the first 60 columns of the tMap output. Do this for every one of the 5 input streams. Then, for your integer column, use the routines.Numeric.sequence(seqName, startValue, step) method to get your row order. Since your tFileInputPositional components read from the top of the file to bottom, your sequence will be the same for every stream. Now connect to your tFilterRow and continue.

 

2) Sort out your data mapping/alterations according to the new schema. Remember that this schema is being used JUST TEMPORARILY to help with your processing and ordering. I suspect that you can get rid of the tFileOutputPositional components now. However you will need to manually deal with the column padding. You can do this with either.....

routines.StringHandling.LPAD(first_string, length, second_string)

....or.....

routines.StringHandling.RPAD(first_string, length, second_string)

....check them out in the documentation. These can be set in the tMaps where you process your data. REMEMBER that you will need to set the padding of columns for data that you know will be empty for the respective row to be set to am empty String (""). This pre-padded data needs to go to the tHashOutput components (keep them linked).

 

3) At the bottom of your job, where you are writing out your data, you simply need to order by the sequence column from earlier. Then connect to a tMap and concatenate all of the columns and out put to 1 column. Then connect to a tFileOutputDelimited with one column and the column separator set to an empty String ("").

 

What you will have should be the result you want. It will take a while for you to build this, but check it along the way to make sure that it looks right. Use tLogRows to do this. 

 

Seven Stars

Re: Merging multirecord file using tHash-tSort-tUnite

Hi,

 

thank you for your solution! I will try it! It seems a loborious procedure: I have to pad each field and add additional columns for each dataset. What should be the length of the additional columns in the datasets?

 

Temporary I solved this way: I export the file, with my "progressivo" (seq) as first field of each dataset, and by appending each output with the several tFileOutputPositional. Then I import the generated file and I order the records by using my "progressivo". Eventually, I export the ultimate file with tFileOutputDelimited and the works is done.

Seven Stars

Re: Merging multirecord file using tHash-tSort-tUnite

@rhall_2_0 

 

Hi,

I am trying to deal with your solution applied to another case. I have an input file which I splitted in Detail_1 and Detail_2. Both Detail input file have different schema, so I followed these step:

 

- input the same number of columns: tMap_2 and tMap_4 have the same number of column as shown on the attached screenshot;

- I created a sequence in order to keep the record that must go together;

- I filtered the rows I needed;

- I linked tHashOutput_2 to tHashOutput_1

 

Then I tried to create a unique record by concatenating all the columns, as shown in tMap_5. But, right now I obtain the following result:

 

rec1: 0100200000000010000000000000024539970049914068 null
rec2: 0100200000000010000000000000035256103270096532 null
rec3: 0100200000000010000000000000045264308870139646 001null
rec4: 164469600000001X100000000000000450D0E7776A14F12F000000000001
rec5: 0100200000000010000000000000055264308870139646 null
rec6: 0100200000000010000000000000065264308870139646 001null
rec7: 164469600000001X1000000000000006F79102C95FFF691D000000000002

 

This result is what I have in my input file right now.

 

Instead, my expected result is:

 

rec1: 0100200000000010000000000000024539970049914068 
rec2: 0100200000000010000000000000035256103270096532 

 

rec3: 0100200000000010000000000000045264308870139646 001164469600000001X100000000000000450D0E7776A14F12F000000000001

 

rec4: 0100200000000010000000000000055264308870139646 

 

rec5: 0100200000000010000000000000065264308870139646 001164469600000001X1000000000000006F79102C95FFF691D000000000002

 

So, record 3 and 4 and record 6 and 7 must go together as unique rec.

 

What I am doing wrong? Can you help me please?

I hope I explianed good the issue.

 

 

 

 

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Tutorial

Introduction to Talend Open Studio for Data Integration.

Watch