One Star

[resolved] Encoding issue with tfileOutputDelimited

I use a tfileOutputDelimited with encoding set on UTF-8 (default) in advanced parameters.
Nevertheless the produced file has an UTF-16LE BOM (FF FE) with UTF-16LE character encoding.
I tried to pipe a tChangeFileEncoding (UTF16->UTF8) with and without custom input encoding.
Both tests failed, i'm stuck with utf16.
Any idea ?
Frankie.
BTW : I use TOS Version: 4.1.2
Build id: r53616-20110106-0635
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] Encoding issue with tfileOutputDelimited

Finally... Produced file is UTF-8, as expected, but without BOM.
My bad, the default configuration for my tool (ultraedit32) was to translate file to UTF when recognized as such, showing wrong BOM in my case (and adding it if saved).
I'll set this post as resolved.
4 REPLIES
Community Manager

Re: [resolved] Encoding issue with tfileOutputDelimited

Hi
Can you send me an example file for testing?
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] Encoding issue with tfileOutputDelimited

Seems it is related a " Will Not Fix" Java Bug.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
A quick and dirty workaround (in a tJava) :
//reading as UTF-16LE
FileInputStream fis = new FileInputStream("inpufile.txt");
BufferedReader r = new BufferedReader(new InputStreamReader(fis, "UTF-16LE"));
//writing as UTF-8
FileOutputStream fos = new FileOutputStream("ouputfile.txt");
Writer w = new BufferedWriter(new OutputStreamWriter(fos, "UTF-8"));
//copy data
for (String s = ""; (s = r.readLine()) != null;) {
w.write(s + System.getProperty("line.separator"));
w.flush();
}
//closing streams
w.close();
r.close();

The BOM is still wrong, but the encoding is right.
I did not find a convenient way to put binary files online.
So here is a small example of what I mean :
- Actual input data (readable) : ?
(LATIN CAPITAL LETTER A WITH DIAERESIS + DEGREE SIGN)
- Correct UTF-16LE (hexa) : FF FE 00 C4 00 B0
as written by talend in my case (supposed to be utf8)
- Actual output file (hexa mixed) : FF FE C3 84 C2 B0
after the above quick'n dirty conversion
- Expected output (hexa utf8) : EF BB BF C3 84 C2 B0
Edit : Oops seems that ultraedit converts automatically to utf16 when opening. Trying with a decent binary viwer/editor now.
One Star

Re: [resolved] Encoding issue with tfileOutputDelimited

Finally... Produced file is UTF-8, as expected, but without BOM.
My bad, the default configuration for my tool (ultraedit32) was to translate file to UTF when recognized as such, showing wrong BOM in my case (and adding it if saved).
I'll set this post as resolved.
Community Manager

Re: [resolved] Encoding issue with tfileOutputDelimited

Hi
Glad to see that you find the cause! Maybe you can try this component tWriteHeaderLineToFileWithBOM to output the records with BOM.

Best regards
Shong

----------------------------------------------------------
Talend | Data Agility for Modern Business