[resolved] How to duplicate lines dynamically in a csv file

One Star

[resolved] How to duplicate lines dynamically in a csv file

Does anyone has the solution in TOS for duplicating lines except using the tFlowIterate or tMap join, which is not very efficient for large volume of data?
ex:

input:

key1, 3
key2, 2
key3, 5
...

output:

key1
key1
key1
key2
key2
key3
key3
key3
key3
key3
...

I have 4 millions lines in input.
Thanks!
Seven Stars

Re: [resolved] How to duplicate lines dynamically in a csv file

I presume tReplicate also doesn't suit? What exactly do you want to do?
Employee

Re: [resolved] How to duplicate lines dynamically in a csv file

Hi hzhang2,

Did you try this kind of design ?
I'm not sure you'll find a much quicker solution than using tFlowToIterate--Loop--tIterateToFlow for this issue.

Don't forget to turn off statistics when executing the job, as indeed stats on iterations make the job really slow in the studio.

Does you output needs to be sorted ? If not, you can then maybe add a bit of split/parallelization to improve perfs.
One Star

Re: [resolved] How to duplicate lines dynamically in a csv file

Hi csonnefraud,

Thanks for your response. What I did is something in mode multithread like this: using tFlowToIterate--tRowGenerater--tIterateToFlow. I think it's the same principal with yours. What impact the performance is the use of Iterate: the speed is less then 100 lines/s . So if I have 4 millions lines in input, it will take me 10h at least! The result is better when I used TMap for a join, 200 lines/s, but far from enough. I wonder if it's possible to do this by creating a java component. I have never done that.

Thanks again!
Employee

Re: [resolved] How to duplicate lines dynamically in a csv file

the speed is less then 100 lines/s

Is that a value that you calculated, or that is given by the stats ?

Because as I said, iteration stats may really decrease the perfs of your job.
Have you tried to run your job by disabling the 'Statistics' option, and enabling 'Exec Time' in the Advanced Settings tab ?
Seven Stars

Re: [resolved] How to duplicate lines dynamically in a csv file

Sorry, didn't actually look at your data. Have a look at this post, which was doing the same thing and getting much better performance than you're suggesting.

Of course, writing to a database would be slower than writing to a file. I would suggest using a bulk insert to overcome that.
One Star

Re: [resolved] How to duplicate lines dynamically in a csv file

Hi,

You can do that with one tJavaFlex like :

initial code :

System.out.println("## START\n#");
int iVAL ;
int j ;

main code :

iVAL = Integer.parseInt( row1.keyval );

j=1;
row2.keyno = row1.keyno + ";" + row1.keyval ;
while (j<iVAL){
row2.keyno = row2.keyno +"\n"+ row1.keyno + ";" + row1.keyval ;
j++;
//System.out.println(row1.keyno+","+row1.keyval);
}


final code :
System.out.println("## END\n#");

input is :
key1,2
key2,6
key3,4
key4,3

and Output is :

key1;2
key1;22
key2;6
key2;6
key2;6
key2;6
key2;6
key2;66
key3;4
key3;4
key3;4
key3;44
key4;3
key4;3
key4;33

As you just need first row, it sound good, no?
One Star

Re: [resolved] How to duplicate lines dynamically in a csv file

Hi csonnefraud ,
As you said, it's the statistic which takes the time. Once I take it off, le iterate way goes faster than the join. I got a performance of 6 mins for 1,000,000 lines in input.
Thanks a lot!

Hi Alevy,
I found the same performance between doing the filter after the cartesien product and doing it as a condition for the branch lookup in the join.

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Have you checked out Talend’s 2019 Summer release yet?

Find out about Talend's 2019 Summer release

Blog

Talend Summer 2019 – What’s New?

Talend continues to revolutionize how businesses leverage speed and manage scale

Watch Now

6 Ways to Start Utilizing Machine Learning with Amazon We Services and Talend

Look at6 ways to start utilizing Machine Learning with Amazon We Services and Talend

Blog