how to clean a file from wrong encoded characters ?

Highlighted
Six Stars

how to clean a file from wrong encoded characters ?

hello, 

 

I have been given a .txt file which contains lines looking like that : 

 testDjoWrongEncodedCharacters.JPG 

Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters. 

 

What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly. 

 

I tried to use a regex in a tMap component in order to erase or replace the wrong characters.

tMapWrongEncodedCharacters.JPG

 

But it didn't work out ! My wrong characters still stay the same... 

jobTestWrongEncodedCharacters.JPG 

 

I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too. 

 

I am starting to run out of ideas and options. Anyone has a good idea to share ? 

 

ps : i join my test file if anyone want to run some tests


Accepted Solutions
Ten Stars

Re: how to clean a file from wrong encoded characters ?

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.

View solution in original post


All Replies
Ten Stars

Re: how to clean a file from wrong encoded characters ?

Try WINDOWS-1252 / CP-1252
Is it data directly from a database, ask its owner/sender which collation is used for the table settings.

 

Six Stars

Re: how to clean a file from wrong encoded characters ?

Hello @Dijke

 

thank you for your quick anwser, but I tried that and it didn't work out. 

See : 

jobTestWrongEncodedCharacters2.JPG

 

Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible. 

 

Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but I couldn't capture/change/erase the wrong characters. Did I missed something here ?

 

 

Ten Stars

Re: how to clean a file from wrong encoded characters ?

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.

View solution in original post

Six Stars

Re: how to clean a file from wrong encoded characters ?

Ok, now I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution. 

Thank you for your input and all the explanations, @Dijke !

2019 GARTNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

Talend Cloud Developer Series – Fetching Studio License

This video will help someone new to using Talend Studio get started by connecting to Talend Cloud and fetching the Studio License

Watch Now

Talend Cloud Developer Series - Introduction

The Talend Cloud Developer Series was created to give you a solid foundational understanding of Talend’s Cloud Integration Platform

Watch Now

Talend Cloud Available on Microsoft Azure

An integration platform-as-a-serviceto help enterprises collect, govern, transform, and share data from any data sources

Watch Now