How to detect encoding of a text file

One Star

How to detect encoding of a text file

Hello,
Is there a way to detect the encoding of a text file automatically?
I need to read various text files, but sometimes encoding changes without announcements.
This may go unnoticed and thus corrupted data may be stored in database, which I want to avoid.
If it is not possible to detect the encoding, is there any idea to notify that the encoding has been changed?
Regards,
Aya
Six Stars

Re: How to detect encoding of a text file

Hello,
I think there is no direct way how to do it directly in Talend. My idea in this case that you have many files to process into database and don't know it's encoding and even when you know it can change immediatelly, you could try write Talend routine where you will use one of following libraries:
http://sourceforge.net/projects/jchardet/
http://code.google.com/p/juniversalchardet/
You will have for example default UTF-8 or you will always save last processed file encoding and match it against the newest one.
Let me know, please, how did you fix this requirement in your project.
Best regards,
Ladislav
One Star

Re: How to detect encoding of a text file

Hello,
Thank you for your reply.
I was hoping that Talend might have the feature, but I will follow your advice and try using juniversalchardet.
This requirement was already fixed before deciding to use Talend, and since files are sent from various customers, it is difficult to change the requirement.
Thank you again for your help.
Regards,
Aya
One Star

Re: How to detect encoding of a text file

I you have time this is not as difficutl as it seems to be to write your own component. I will have maybe time over the weekend so I will also take a look at this.
My proposed behavior is on following, the component will have one input parameter <path to file> and one output parametr of type string and this output parameter will keep the detected file encoding.
Let me know if you have other idea of how this schould work.
Best regards,
Ladislav
Six Stars

Re: How to detect encoding of a text file

I you have time this is not as difficutl as it seems to be to write your own component. I will have maybe time over the weekend so I will also take a look at this.
My proposed behavior is on following, the component will have one input parameter <path to file> and one output parametr of type string and this output parameter will keep the detected file encoding.
Let me know if you have other idea of how this schould work.
Best regards,
Ladislav
One Star

Re: How to detect encoding of a text file

Any update on this? I would love to have such an component. I couldn't manage implementing the juniversalchardet in Talend.