I have created a bigdata spark job.
I am reading rows from a file.
Basically I am generating an id for each record in the file in tmap using the following code.
Numeric.sequence("IDGen", 1000000, 1)
I checked the file and found duplicate IDs generated.
Why is this happening ?
Please note that this is a bigdata spark job and i am running this job in a spark cluster.
Is there a workaround for this issue?.
Can you share your job?
How the duplicates are distributed over the records?
Why do you initialize the sequence with 1,000,000?
out of 10000 numbers generated by the talend sequence function , 5 numbers are duplicate.
what i meant is , I can see count of 5 different numbers as two
There is no special reason for initializing with 1,000,000
it is just an example to make people understand...
The real duplicates are different...
Any way the point is , I am getting duplicates
this is how duplicates are distributed.
ID Count of ID generated in the output file
It looks very strange.
Can you share your job design + configuration for any component where the sequence is calculated.
Also, how are the duplicates identified? (with details)
Is there any value for which there is more than 2 duplicates?
here is the job design
Basically the job is reading from a source file . it attaches every record in the source file with an ID in tMap. Finally new records from tMap are stored into another file.
The subjob is checking for duplicates. the subjob groups all records in the output file of previous subjob on the basis of ID. ID along with count is stored into an output file .
There is no ID having more than 2 duplicates.
The following are the duplicates as i share earlier
Nothing strange in the job design which is very simple.
Last question: for duplicates records, except the UniqueID, are other fields duplicated or not?
No . the other fields are not duplicated. I just compared two rows whose IDs are same.
Is this a drawback of big data spark Job?
do you create a standard job or big data job?
we have to create a bigdata spark job and then run this job in spark cluster.
I am using talend big data platform 6.2
first of all sorry for my English ...
happened also in a Talend Data Integration job, during a TMap row number assigner ...
it is very strange ... the job has running for many days ed worked many times ...
when counting 500 rows it produced 20 duplicated row numbers ...
What workaround you founded ?
For me the error occurred in talend big data job.
I found a workaround using tsql row.
tsql row will be connected with two rows. First row will be the incoming records.
The other row will be containing the last generated max sequence.
in tSQLrow i used a row_number() to generate a unique number for each incoming row.
select (row_number() over()) + row2.lastseq from row1,row2