I'm currently using CRC32 to generate unique check sum values. I have learnt that, theoretically CRC32 could (in very rare cases à 1 in 2^32 cases) could produce a duplicate. Is there any better algorithm alternative to CRC32 that I could use?
@kishore_p , you can try with MD5 or SHA-256 ,please check the below link to know more about these.
Could you please check below links? Apart from MD5, there are some other interesting options.
Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved
All hash algorithms can produce duplicates. You are taking something with thousands or potentially millions of bytes, and squashing it down to a very small number of bytes, four to be precise. Those four bytes can only contain just over 4 billion different values, so if you hash more than four billion different documents then you are guaranteed to have a duplicate. If you hash a million documents, you have a 1 in 4,000 chance of a duplicate.
The only way to reduce the likelihood of a clash is to use a longer hash. CRC32 generates a 32 bit hash, MD5 generates a 128 bits. You might think that that only increases the range by a factor of four, but in fact it is much, much, much bigger than that. You go from a 1 in 4 billion chance of a clash to a 1 in 340 million million million million million million. So long as your source value is more than 128 bits, you would need to generate a million million million hashes per second for the entire existence of the universe so far before you had a noticeable chance of a clash.
CRC64 may be enough for you, that would reduce the chance of a clash to 1 in a million million million.
Note that I said "reduce". The very nature of hashing means that a clash is possible, it's just a case of making it sufficiently unlikely.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Part 2 of a series on Context Variables
Learn how to do cool things with Context Variables
Read about some useful Context Variable ideas