how to create a sequence code for a group of records with the same id

One Star

how to create a sequence code for a group of records with the same id

Hello,
here is my task that I am trying to implement in talend.
I have fixed width text file with consumer names, addresses and some other fields. Every unique household (which is defined as combination of address and last name field) has household identified (HHID), so every record on a file has HHID and all the records with the same address and last name of a person have the same HHID.
What I need to do is to assign a sequence code starting at 1 for the first record in the group with the same HHID, then assign 2 to the second record in the group and etc.
My understanding that I should sort the data file first by HHID, when using calculate do a group by, but I am puzzled how to get down back on a record level and generate the sequence.
Employee

Re: how to create a sequence code for a group of records with the same id

Hello,
You can do this task by creating inside tMap a variable and affect to the variable the following value:
var.my_seq=Numeric.sequence(row1.adress+row1.lastname,1,1) will generate one sequence for each value of (row1.adress+row1.lastname)
See screenshot below

Re: how to create a sequence code for a group of records with the same id

use the builtin sequence function. The first argument is the "name" of the sequence-- use your HHID as a name and you will get a separate sequence for each group of HHID's
in a tmap:
perl:
sequence($row,1,1)
java:
sequence(row1.HHID,1,1)
One Star

Re: how to create a sequence code for a group of records with the same id

Thank you for the prompt response, guys! I tried that and it works on a small file. When I ran it on 8 million records file, my process failed after a few minutes of run with out of memory error.
Not very nice thing if you want to use it for production jobs...I did notice that talend was consuming more and more memory, then allocated all available memory in OS including page file and then crashed. Any ideas how to overcome that?
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
disconnected
at demo5min.boristest_0_1.boristest.tFileInputPositional_1Process(boristest.java:1913)
at demo5min.boristest_0_1.boristest.runJobInTOS(boristest.java:2090)
at demo5min.boristest_0_1.boristest.main(boristest.java:1962)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at org.talend.fileprocess.delimited.RowParser.readRecord(RowParser.java:156)
at demo5min.boristest_0_1.boristest.tFileInputPositional_1Process(boristest.java:1267)
Employee

Re: how to create a sequence code for a group of records with the same id

Post a full screenshot of your job. We ll help you to identify where the problem is coming from
One Star

Re: how to create a sequence code for a group of records with the same id

here you go...I appreciate your help, guys
Seven Stars

Re: how to create a sequence code for a group of records with the same id

The suggested approach creates a new variable for each HHID so you will run out of memory if there are a large number of unique HHIDs in your data.
A better way might be to replace your tMap with a tFilterColumns to simplify your flow to just (hhid, city, state) before you sort it and then use a tJavaRow to create the sequence along these lines:
if (((String)globalMap.get("PreviousHHID")).equals(input_row.hhid))
globalMap.put("SeqNum",(Integer)globalMap.get("SeqNum")+1);
else
globalMap.put("SeqNum",1);
output_row.hhid = input_row.hhid;
output_row.seqnum = (Integer)globalMap.get("SeqNum");
output_row.city = input_row.city;
output_row.state = input_row.state;
globalMap.put("PreviousHHID",input_row.hhid);
Employee

Re: how to create a sequence code for a group of records with the same id

I dont know how many rows you're processing but try the option "sort on disk" for the tSort component
Six Stars

Re: how to create a sequence code for a group of records with the same id

The suggested approach creates a new variable for each HHID so you will run out of memory if there are a large number of unique HHIDs in your data.
A better way might be to replace your tMap with a tFilterColumns to simplify your flow to just (hhid, city, state) before you sort it and then use a tJavaRow to create the sequence along these lines:
if (((String)globalMap.get("PreviousHHID")).equals(input_row.hhid))
globalMap.put("SeqNum",(Integer)globalMap.get("SeqNum")+1);
else
globalMap.put("SeqNum",1);
output_row.hhid = input_row.hhid;
output_row.seqnum = (Integer)globalMap.get("SeqNum");
output_row.city = input_row.city;
output_row.state = input_row.state;
globalMap.put("PreviousHHID",input_row.hhid);


FYI a component in the new versions
tmemorizerows
was created to encapsulate and simplify this behaviour ( ie. increment on field change )
One Star

Re: how to create a sequence code for a group of records with the same id

I dont know how many rows you're processing but try the option "sort on disk" for the tSort component

my sample file is a bit over 4 million records, I have 2Gb of Ram on my desktop and the process crashes consistently when it reaches 400,000 something records.
I just tried sort on disk option and observed interesting results. I moved sort component before Tmap and I also reduced number of columns passed using TfilterColumns as it was suggested by emaxt6.
After I did that, I set tsort buffer to 600,000 bytes and the process crashed again with out of memory error. Then I set it to 1,000,000 bytes and this time it crashed with another error (compete for multiple threads failed). Then I set it to 300,000 bytes it ran beyond my crash point, but it was so terribly slow that I had to kill the process.
Based on what I see, I cannot pass sorting, so I cannot try other things suggested here by alevy and emaxt6.
To be honest I do not like approach suggested with TJavaRow because it looks to me as a "hardcode" which we try to avoid by any means in our company due to huge amount of issues in the past because of hardcoded processes.
I also failed to find any documentation on tmemorizerows - it is just missing in help file.
So I guess my question how to sort the data first - my file is over 4 million rows and it crashes all the time when it reaches a bit over 400,000 records
Six Stars

Re: how to create a sequence code for a group of records with the same id

Do you need sorted *output*? If not - have you tried to remove the sort part (it seems not necessary for only id counter attribution with sequence).
If you need sorted output have you tried to output the data with generated ids to a temporary embedded file database like hsqldb, embedded in talend, creating maybe an index on the column you want to sort beforhand.
Otherwise you should provide some test data sample in order to reproduce the problem.
Seven Stars

Re: how to create a sequence code for a group of records with the same id

I also failed to find any documentation on tmemorizerows

It is only in v4.1.0 yet to be released.
I have 2Gb of Ram on my desktop

Frankly, you should have at least 4Gb and preferably 8Gb to work with large data sets.
Six Stars

Re: how to create a sequence code for a group of records with the same id

Yes if under the hood talend don't use only sql cursors and is forced to push data in memory (like the case of sorting), you surely need big amount of ram for storage... for every parsed line there are raw data plus object overhead TIMES the number of rows... go figure...
Alevy is right... if you need such volumes in production do yourself a favour and use a 64bit platform with 64bit JVM...
bye
One Star

Re: how to create a sequence code for a group of records with the same id

this simple job is my way to quickly evaluate an ETL tool - take a file with at least 4-5 million records, sort it , build group sequence code and write back to file.
As you said sorting and sequencing are very heavy operations so you can see easily how the product behaves under the stress load.
Since Talend is positioned now as professional and enterprise grade product, I think it must behave better like using some file buffers to do the job when it is running out of RAM (much slowly of course but not crash on users).
Do not get me wrong, Talend is fantastic and I can say unique product. Talend's team does a great job and made a huge leap in functions and features in the last 2 years, but I think it is still need some polishing.
Regarding RAM requirements...I am going to try to run this job on our 64bit 4 core CPU server with 4gb and will report back how it went, but I did install quite a few ETL tools on my desktop (not free and not open sourced though) and they all did the job without crashing - some were terribly slow, some were significantly better.
also I must say that the source file for my job is not that scary big and wide - record length is a bit over 1000 bytes and there are 4,000,000 records.
It is nothing nowadays when you have to deal with terabytes of data.
Thank you all for your help!
Six Stars

Re: how to create a sequence code for a group of records with the same id

Since Talend is positioned now as professional and enterprise grade product, I think it must behave better like using some file buffers to do the job when it is running out of RAM (much slowly of course but not crash on users).
Do not get me wrong, Talend is fantastic and I can say unique product. Talend's team does a great job and made a huge leap in functions and features in the last 2 years, but I think it is still need some polishing.

Here sure I agree with you; Talend is a very flexible tool but surely need more control and refinement regarding memory management... too many thing are delegate to plain java without additional layers... surely Talend is quite naive in this matter (ie no use of NIO memory mapped files, memory compression etc. etc.).
bye
One Star

Re: how to create a sequence code for a group of records with the same id

Since Talend is positioned now as professional and enterprise grade product, I think it must behave better like using some file buffers to do the job when it is running out of RAM (much slowly of course but not crash on users).
Do not get me wrong, Talend is fantastic and I can say unique product. Talend's team does a great job and made a huge leap in functions and features in the last 2 years, but I think it is still need some polishing.

Here sure I agree with you; Talend is a very flexible tool but surely need more control and refinement regarding memory management... too many thing are delegate to plain java without additional layers... surely Talend is quite naive in this matter (ie no use of NIO memory mapped files, memory compression etc. etc.).
bye
understood and as I said I like Talend and I want to make it work, but if I have these challenges with a simple job, I do not even want to think how our production jobs will be handled (our typical job deals with 100-200 million records, 10 lookup tables, 50-70 data processing steps like conversion, flagging, deduping etc)
I am going to try to run it on our server, but do you have any tips for me how to manage memory better in Talend?
I understood so far that:
1) I need to limit number of fields passed (already tried that by reducing fields set to hhid field and a few flags only - did not help me too much)
2) turn on sort on disk option and play with buffer size
3) change jvm settings?
anything else I can try?
thanks a bunch!
Employee

Re: how to create a sequence code for a group of records with the same id

You have the tExternalSortedRow component available which use the popular sort binary (see GNU website). It should solve your memory issue but your job will need an external ressource to be started...
Six Stars

Re: how to create a sequence code for a group of records with the same id

understood and as I said I like Talend and I want to make it work, but if I have these challenges with a simple job, I do not even want to think how our production jobs will be handled (our typical job deals with 100-200 million records, 10 lookup tables, 50-70 data processing steps like conversion, flagging, deduping etc)
I am going to try to run it on our server, but do you have any tips for me how to manage memory better in Talend?
thanks a bunch!

For production purposes I suggest surely a 64bit platform with memory according to the data to be handled.
But despite that your test job MUST NOT crash on you for your test case also with limited ram and on 32bit platform, if your are using sort on disk option due to design of a chunk based sort algorithm. It does, so it can be a bug, so you need to post test data, job and open a ticket.
At Talend: please verify the option "buffer size of external sort" or change the name because it can be thought as the number of bytes (and this is stated in the documentation)... but if I look in the generated code this is the SIZE of an array of objects so it doesn't so directly translate to memory usage.
Sorting is a very common requirement in an ETL tool, so it is very advised to be a first class component in every etl tool... delegating to an external process like unix sort to do such basic thing you lose control, portability, seems an hacked in solution and finally it is quite an admission of defeat for a tool designed to handle data.
If I could give an advice, try to look at http://brie.di.unipi.it/smalltext/ it is a pure java library that implements external sorting with mergesort on text data.
hope it helps
One Star

Re: how to create a sequence code for a group of records with the same id

emaxt6, you are a real asset on this forum!
Well, I tried this morning to sort my sample file on one of our data servers (Xeon 2.33Ghz, 1 cpux4 cores, 4gb ram, 64bit windows 2003 server r2)
Guess what? I am getting exactly the same errors but just a way faster Smiley Happy
The weird thing that I can see that there is still 1Gb of ram available when it crashes with heap out of memory error. I tried sort in memory, sort on disk, I changed sort buffer parameter - result is the same.
A few times I saw this error though:
Exception in thread "main" java.util.ConcurrentModificationException
at java.util.LinkedList$ListItr.checkForComodification(Unknown Source)
at java.util.LinkedList$ListItr.next(Unknown Source)
at routines.system.RunStat.sendMessages(RunStat.java:244)
at routines.system.RunStat.stopThreadStat(RunStat.java:228)
at boristest.achsort_0_1.achsort.tFileInputPositional_1Process(achsort.java:1890)
at boristest.achsort_0_1.achsort.runJobInTOS(achsort.java:2066)
at boristest.achsort_0_1.achsort.main(achsort.java:1940)
I even reduced number of fields significantly to pass to Tsort: now it is only 4 fields with 35 bytes of total record lenght. The field that is used for sorting is 13 bytes number.
I have also did a quick search on this forum and it seems I am not the only one who has issues with sorting...
One Star

Re: how to create a sequence code for a group of records with the same id

You have the tExternalSortedRow component available which use the popular sort binary (see GNU website). It should solve your memory issue but your job will need an external ressource to be started...

thank you for the tip, but if I need to use external components for such essential thing in ETL world as sorting, I just do not see much value in this product. Besides I am really concerned about crashes - the sort algorithm can run slow if it reaches the limit of RAM, CPU etc, but should not crash...at least if you position your product for commercial applications
Six Stars

Re: how to create a sequence code for a group of records with the same id

@boris
You should really post test data and job definition if you really want the problem solved in order to review it with more precision, it can be a subtle bug somewhere in the stack or memory leak.
See this benchmark, sorting a 3,3 Billion rows, 415 GB dataset
http://blogs.sun.com/aja/entry/talend_s_new_data_processing
@talend
please correct the documentation regarding buffersize of tsort, it is misleading.
thanks
Seventeen Stars

Re: how to create a sequence code for a group of records with the same id

hi all,
hope I have understood what you're looking for.
So first read input file , filter column to catch only HHID , sort it and keep one instance of each hhid with tuniqrow.
and tmap assign sequence for each hhid (here 's my lookup)
main flow : read again (sic) all input file , make an innerjoin in tmap on hhid and write in a file.
the bad thing it's to read all input twice Smiley Wink but not find another solution until now !
hope it could help you
PS:
my sequence test
aa;jdk;aaadr1;adr1
abc;hd;abcadr2;adr2
aa;idf;aaadr1;adr1
cc;djf;ccadr1;adr1

number of row : nearly 5 000 000
sort on disk
One Star

Re: how to create a sequence code for a group of records with the same id

hi all,
hope I have understood what you're looking for.
So first read input file , filter column to catch only HHID , sort it and keep one instance of each hhid with tuniqrow.
and tmap assign sequence for each hhid (here 's my lookup)

Hi kzone, thank you for the time and effort to help me out!
you got that right and your flow looks great, but what I cannot do is pass sorting. Even if I filter all columns in the start of the flow and keep only hhid, the process still crashes with out of memory error.
HHID is 13 bytes number in my case - all digits no alphas...

Re: how to create a sequence code for a group of records with the same id

Can you post your job code so we may take a look?
right click on the job name in the client. select "export items".
In the dialog, select "archive file" and define the output location and name.
upload the zip file here (it should be very small)
One Star

Re: how to create a sequence code for a group of records with the same id

Can you post your job code so we may take a look?
right click on the job name in the client. select "export items".
In the dialog, select "archive file" and define the output location and name.
upload the zip file here (it should be very small)

here you go
http://dl.dropbox.com/u/1351927/testsort_0.1.zip
I removed sequencing and tmap since I could not pass tsort. Crashes all the time when it reaches about 450,000 records. I would send you my test file, but I cannot since it has some proprietary info.
Employee

Re: how to create a sequence code for a group of records with the same id

Hello boris, could you provide export items for your job please! Because testsort_0.1.zip is an export job script file.
One Star

Re: how to create a sequence code for a group of records with the same id

Hello boris, could you provide export items for your job please! Because testsort_0.1.zip is an export job script file.

Hi gatigossou, sorry about that. please see here
http://dl.dropbox.com/u/1351927/testsort.zip
Employee

Re: how to create a sequence code for a group of records with the same id

Hello boris,
This job shows an example of configuration to sort about 5 000 000 of records.
The jvm arguments are setted to -Xms256M -Xmx1548M and the buffer size is setted to 500000
You can download my job here
http://www.talendforge.org/exchange/tos/extension_view.php?eid=312

Best regards,
One Star

Re: how to create a sequence code for a group of records with the same id

Hello boris,
This job shows an example of configuration to sort about 5 000 000 of records.
The jvm arguments are setted to -Xms256M -Xmx1548M and the buffer size is setted to 500000
You can download my job here
http://www.talendforge.org/exchange/tos/extension_view.php?eid=312

Best regards,

hurray! I adjusted tsort settings like you said on my test job and it was finally completed without out of memory error. I took a while though on my desktop to sort 6.3 million records - 18 minutes. Still I would be concerned to use that for production jobs - the fact that you need to fine tune every time and guess if it fails or not, does not work that well. Especially if you talking about 200 million records files (which is a typical size for US nationwide consumer listings files, for example).
Thank you for your help! now I need to go back and see if my original task would work (to create sequence number)
Six Stars

Re: how to create a sequence code for a group of records with the same id

Yes sorting is such an important requirement that including some type of heuristic in it (ie. auto tune the buffer... fall back to disk when memory pressure is high....) would be a very welcomed addition from talend.