how to clean a file from wrong encoded characters ?

Five Stars

how to clean a file from wrong encoded characters ?

hello, 

 

I have been given a .txt file which contains lines looking like that : 

 testDjoWrongEncodedCharacters.JPG 

Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters. 

 

What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly. 

 

I tried to use a regex in a tMap component in order to erase or replace the wrong characters.

tMapWrongEncodedCharacters.JPG

 

But it didn't work out ! My wrong characters still stay the same... 

jobTestWrongEncodedCharacters.JPG 

 

I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too. 

 

I am starting to run out of ideas and options. Anyone has a good idea to share ? 

 

ps : i join my test file if anyone want to run some tests


Accepted Solutions
Ten Stars

Re: how to clean a file from wrong encoded characters ?

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.

All Replies
Ten Stars

Re: how to clean a file from wrong encoded characters ?

Try WINDOWS-1252 / CP-1252
Is it data directly from a database, ask its owner/sender which collation is used for the table settings.

 

Five Stars

Re: how to clean a file from wrong encoded characters ?

Hello @Dijke

 

thank you for your quick anwser, but I tried that and it didn't work out. 

See : 

jobTestWrongEncodedCharacters2.JPG

 

Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible. 

 

Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but I couldn't capture/change/erase the wrong characters. Did I missed something here ?

 

 

Ten Stars

Re: how to clean a file from wrong encoded characters ?

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.
Five Stars

Re: how to clean a file from wrong encoded characters ?

Ok, now I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution. 

Thank you for your input and all the explanations, @Dijke !