Is there a better alogorithm alternative to CRC32 ?

Four Stars

Is there a better alogorithm alternative to CRC32 ?

I'm currently using CRC32 to generate unique check sum values. I have learnt that, theoretically CRC32 could (in very rare cases à 1 in 2^32 cases) could produce a duplicate. Is there any better algorithm alternative to CRC32 that I could use?

Highlighted
Moderator

Re: Is there a better alogorithm alternative to CRC32 ?

@kishore_p , you can try with MD5 or SHA-256 ,please check the below link to know more about these.

 

https://www.codejava.net/coding/how-to-calculate-md5-and-sha-hash-values-in-java

 

Manohar B
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Highlighted
Employee

Re: Is there a better alogorithm alternative to CRC32 ?

Hi,

 

   Could you please check below links? Apart from MD5, there are some other interesting options.

 

https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-un...

https://stackoverflow.com/questions/16122067/md5-vs-crc32-which-ones-better-for-common-use/16125537

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Eight Stars

Re: Is there a better alogorithm alternative to CRC32 ?

All hash algorithms can produce duplicates. You are taking something with thousands or potentially millions of bytes, and squashing it down to a very small number of bytes, four to be precise. Those four bytes can only contain just over 4 billion different values, so if you hash more than four billion different documents then you are guaranteed to have a duplicate. If you hash a million documents, you have a 1 in 4,000 chance of a duplicate.

 

The only way to reduce the likelihood of a clash is to use a longer hash. CRC32 generates a 32 bit hash, MD5 generates a 128 bits. You might think that that only increases the range by a factor of four, but in fact it is much, much, much bigger than that. You go from a 1 in 4 billion chance of a clash to a 1 in 340 million million million million million million. So long as your source value is more than 128 bits, you would need to generate a million million million hashes per second for the entire existence of the universe so far before you had a noticeable chance of a clash.

 

CRC64 may be enough for you, that would reduce the chance of a clash to 1 in a million million million.

 

Note that I said "reduce". The very nature of hashing means that a clash is possible, it's just a case of making it sufficiently unlikely.

2019 GARTNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

Best Practices for Using Context Variables with Talend – Part 2

Part 2 of a series on Context Variables

Blog

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Best Practices for Using Context Variables with Talend – Part 3

Read about some useful Context Variable ideas

Blog