Six Stars

Numeric Sequence Generation function giving duplicate numbers

Hi,

 

I have created a bigdata spark job.

I am reading rows from a file.

Basically I am generating an id for each record in the file in tmap using the following code.

Numeric.sequence("IDGen", 1000000, 1)

I checked the file and found duplicate IDs generated.

 

Why is this happening ?

 

Please note that this is a bigdata spark job and i am running this job in a spark cluster.

 

Is there a workaround for this issue?.

 

Thanks 

  • Big Data
  • Data Integration
14 REPLIES
Nine Stars TRF
Nine Stars

Re: Numeric Sequence Generation function giving duplicate numbers

Can you share your job?

How the duplicates are distributed over the records?

Why do you initialize the sequence with 1,000,000?


TRF
Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

out of 10000 numbers generated by the talend sequence function , 5 numbers are duplicate.

what i meant is , I can see count of 5 different numbers as two 

 

ex:

 

ID/number           duplication_count

123456                   2

324532                   2

 

 

There is no special reason for initializing with 1,000,000

Ten Stars

Re: Numeric Sequence Generation function giving duplicate numbers

If you initialize to 1,000,000 how are you getting values less than 1,000,000?
Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

it is just an example to make people understand...

The real duplicates are different...

Any way the point is , I am getting duplicates

Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

ok... i will share how duplicates are distributed in a short while

Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

this is how duplicates are distributed.

 

ID                      Count of ID generated in the output file

1000815               2
1006072               2
1005490               2
1005905               2
1000889               2
1007748               2
1000246               2

Nine Stars TRF
Nine Stars

Re: Numeric Sequence Generation function giving duplicate numbers

Hi,

It looks very strange.

Can you share your job design + configuration for any component where the sequence is calculated.

Also, how are the duplicates identified? (with details)

Is there any value for which there is more than 2 duplicates?


TRF
Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

here is the job design

 

sequencegen.jpg

 

Basically the job is reading from a source file . it attaches every record in the source file with an ID in tMap. Finally new records from tMap are stored into another file.

 

tmapseq.jpg

The subjob is checking for duplicates. the subjob groups all records in the output file of previous subjob on the basis of ID. ID along with count is stored into an output file .

There is no ID having more than 2 duplicates. 

The following are the duplicates as i share earlier

 

ID                   Count

1000815            2
1006072            2
1005490            2
1005905            2
1000889            2
1007748            2
1000246            2
 

Nine Stars TRF
Nine Stars

Re: Numeric Sequence Generation function giving duplicate numbers

Nothing strange in the job design which is very simple.

Last question: for duplicates records, except the UniqueID, are other fields duplicated or not?

 


TRF
Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

No . the other fields are not duplicated.  I just compared two rows whose IDs are same.

Is this a drawback of big data spark Job?

Nine Stars TRF
Nine Stars

Re: Numeric Sequence Generation function giving duplicate numbers

I've just tried the same job with 50000 records and get no duplicate records.

I use TDI 6.3.1


TRF
Six Stars

Re: Numeric Sequence Generation function giving duplicate numbers

do you create a standard job or big data job?

we have to create a bigdata spark job and then run this job in spark cluster.

I am using talend big data platform 6.2

 

Nine Stars TRF
Nine Stars

Re: Numeric Sequence Generation function giving duplicate numbers

No, I just use standard job.
Maybe the problem is due to cluster mode.
I don't know very well how it works but if the job execution is distributed over many nodes, I suppose it should be a problem for sequence calculation which is an in memory operation (probably not shared between nodes).

TRF
Ten Stars

Re: Numeric Sequence Generation function giving duplicate numbers

Agreed, this is probably a threading issue, and the Numeric routines likely aren't threadsafe. You may have to create your own sequence.