I have been given a .txt file which contains lines looking like that :
Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters.
What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly.
I tried to use a regex in a tMap component in order to erase or replace the wrong characters.
But it didn't work out ! My wrong characters still stay the same...
I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too.
I am starting to run out of ideas and options. Anyone has a good idea to share ?
ps : i join my test file if anyone want to run some tests
Solved! Go to Solution.
Try WINDOWS-1252 / CP-1252
Is it data directly from a database, ask its owner/sender which collation is used for the table settings.
thank you for your quick anwser, but I tried that and it didn't work out.
Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible.
Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but I couldn't capture/change/erase the wrong characters. Did I missed something here ?
Ok, now I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution.
Thank you for your input and all the explanations, @Dijke !