One Star

Not able to preview UTF-16LE (w/BOM) encoded CSV file

Hi,
I am using TOS 2.1.2, with Java 1.5.0_12 jvm.
I have searched through this forum a few times but there is no answer.
My CSV file is UTF-16 encoded (little endian) on Windows with Byte Order Mark (0xFEFF) set.
It consists of lines like this
AB_00001.001,AB_00001,1919991,2001100,AB01
AB_00001.002,AB_00001,2001101,2007890,AB01
AB_00001.003,AB_00001,2007891,2010920,AB01
When previewing step 2 to step 3, the dreaded
"org.xml.sax.SAXParseExpcetion: Content is not allowed in prolog."
appears.
I have tried to copy the sample .csv file to c:\, nope, doesn't work.
Initially my JVM was defaulted to Java 6 update 2, which didn't work.
So I went to Windows -> Preferences -> Talend and switched the Java interpreter to c:\java\jdk.1.5.0_12\bin\java.exe
and changed the Java -> Installed JREs to the same library at c:\java\jdk.1.5.0_12
and it still didn't work for me.
Anyone has a better idea on what I can do to solve this?
Thank you very much!!!
t.
Tags (1)
16 REPLIES
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

I'm not sure this information is helpful to you, but the BOM of UTF-16 (little endian) is 0xFFFE. 0xFEFF is big endian.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

I'm not sure this information is helpful to you, but the BOM of UTF-16 (little endian) is 0xFFFE. 0xFEFF is big endian.

Uhhh... what I can get from the source file that generate the CSV is this:
   open( FILE, ">:raw:encoding(UTF16-le):crlf:utf8", $filename ) or
die "Cannot open input file " . $filename . " for writing: $!\n";
print FILE "\x{FEFF}";

Is that right?
Guess I'd will take a copy out for my own testing. ("file" in cygwin and Ultraedit and Notepad++ does say it is UTF-16 little endian though)
Hmm...
Thanks for the response!
Test time!
t^2.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Just to workaround the issue, I tried 2 more methods...
In my Perl project, I created a new Job with just 2 components:
tFileInputCSV -> tLogRow
I manually configured the metadata (ie. using "Built-in" property and schema).
I tried different files Encoding Types such as 'UTF16', 'UTF-16LE', 'UTF16-LE', 'UTF16'.
The output from tLogRow seems to indicate that the "unicode" mode was not enabled. There is a black square in the output window which means that my CSV file was not interpreted as a UTF16 at all.
So I think I can tentatively conclude that the problem has nothing to do with the preview feature of the Metadata repository for the project.
...
Next, I downloaded and tried TOS-Win32-V2.2.0M1. Although the encoding is also correctly interpreted by Talend as "UTF16-LE" during preview, it is still not working (ie. preview of the delimited data is no available). However, the error dialog box was not shown (the SAXParseException is not shown).
I think the Unicode support for UTF16 is not really there yet? Does anyone reading this forum have a trouble-free experience with their UTF16 files, do let me know Smiley Happy
....
I have generated a UTF-16 Little Endian file with the latest version of Ultraedit for testing, just in case my source files are encoded wrongly. Notepad++ can only generate UCS2-LE files, but they are ok for my purposes, but no joy.
By the way, I have tried both workarounds on Windows XP Pro and Vista Business edition (also Java 5 and Java 6). No luck still... not my day, I guess... Smiley Happy
...
Anything I can do to make the testing more exhaustive in order to pinpoint the issue? Smiley Tongue
Can someone point me to the appropriate file in Talend's sourcecode (I have checked out the entire 700MB+ *whew* trunk via subversion) that handles the "interpretation" of UTF16 encoded files? I can try poking in there with my meagre skills. Smiley Happy Or can I paste the code generated by Talend here to help with identifying the issue (if that would help)?
Do share your experience with UTF16 encoded files and/or any other relevant pointers that you might have Smiley Happy
Thank you!!!
t.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Ok, after sleeping on this, I think my observation is flawed.
I checked the generated code and tLogRow uses the Text::ASCIITable module, which, I think is not able to handle Unicode output.
So I will now try to pump the data into a MSSQL 2005 table instead (MySQL still does not handle UTF16 for now).
Will update again on how it goes Smiley Happy
t.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

May be this can help you: In older times I've used an odbc-connection to read the data out of a csv-file.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

May be this can help you: In older times I've used an odbc-connection to read the data out of a csv-file.

Ah, this is a good idea for a workaround! Smiley Happy Using ODBC is a little awkward for me, but maybe it can solve the problem, worth a try Smiley Happy
I have resorted to learning to write my own component (modeled after the tFileInputCSV) to see if that should solve the issue... Smiley Tongue The tutorial on the wiki is a little outdated, but I'm getting the hang of the "structure" now... I'm not giving up on this tool yet!!! Smiley Very Happy
Thanks!!! Smiley Very Happy Hopefully this thread will help another poor soul stuck with the same problem as me in the future Smiley Very Happy

t.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

=== Testing with my own component ===
I have copied tFileInputCSV to my own tUnicodeFileInputCSV component using talend_duplicate_component.sh (cygwin on Win XP).
Modifications that I have made to the original tFileInputCSV_begin.perljet file:
1. Added a "use Encoding;" to the top of the file, below use "Text::CSV_XS;"
2. On line 59 of the original file, the line
open($input_FH_<%=cid%>, '<', $desc_<%=cid%>{filename})
has been changed to
open($input_FH_<%=cid%>, '<:raw:encoding(UTF-16)', $desc_<%=cid%>{filename})
I admit that it is bad practice (ie. hardcoded), but I just want to get to the bottom of this! Smiley Tongue
I think the above changes work and I am able to open UTF-16le encoded files with no problems, HOWEVER, the output window in TOS tells me that there is an error with parsing the data.
It displays the data from my CSV file (yes, there is no more square boxes every other character), and a warning about a parsing error. The error is caused by line 137 in fFileInputCSV_begin.perljet
    if (not $csv_<%=cid%>->parse($_<%=cid%>)) {

Yup, so I guess the limitation of the tFileInputCSV comes from Text::CSV_XS... alternative 1 is to customize my own perl module (huh??) or I have to find something else as a replacement.. or?
Hmm... any ideas? Smiley Very Happy
t.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

So what do I do with the deadline looming and with limited resources (ie. only myself)? Find another tool that can do ETL (ie. $$$), outsource the problem, or what?
I created my own tUnicodeFileInputDelimited from tFileInputDelimited.
Made the same changes as above (use Encoding and the "<:raw:encoding(UTF16)")...
AND TAAADAAAAHH!!! YESSSS!!!
tLogRow can now output my data!!!
And now, onwards to loading it into the db via tMSSQLOutput! (i hope it works) Smiley Very Happy
t.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Hi t,
I think It would be a good idea to open a feature (or bug?) and submit your changes to the Talend-Team as input.
Bye
Volker
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Hi VB (hope you dun mind me calling you VB)
Err.. the problem is not solved yet Smiley Happy Also, why I post here is that I hope the development team will give some comments (if it's important to them, I hope/think they will).
The FileInput components are only 1 stage in the process. I have to check that the other components are also UTF-16 ready.
Currently, I realized that tLogRow is not working if the charset is not Latin-based. Smiley Happy I tried to print some data based on Mandarin/Chinese out and it flops Smiley Happy
I am now working on the following:
- Seems like setting the encoding type to UTF-16LE on the tFileInputDelimited component does not have any impact (other than the default UTF-8 and iso-8859 encodings).
- I have customized a tUnicodeLogRow component that sets the binmode on STDOUT to ensure that the wide character in print warnings do not come out.
I was able to put something into MSSQL2005, but I'm not sure if it is correct, or what Collations should I be using and if "BIN2" is the right way to go, etc... Smiley Happy Once I am done, I hope to provide a small conclusion/sumary on my findings here because it seems that information/knowledge on UTF-16 is really sparse and pathetic on the web.
Smiley Happy back to work... (warcraft: "werk werk, more werk??!!")
t.
Employee

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Interesting topic, sorry to come so late, the subject is a bit... complicated (often the case with encoding issues). Currently (2.1.x), encoding is very lightly supported in TOS. Even if the "Encoding" property is present in many input/output components, it has effect only on a few of them.
In the past, I've written two blog posts about encoding issues with Perl when working with Oracle, files and STDOUT. See Oracle to file in UTF-8, with Perl and Oracle to standard output in UTF-8, with Perl for details.
My current view of the problem is that:
- the input components should tell Perl code what is the encoding of input data (setting UTF-8 in a tFileInput* means that data are stored in UTF-8)
- the output components should tell Perl code what is the encoding of output data (setting UTF-8 in a tFileOutput* means that we want data to be written in UTF-8)
- user must be warned when the encoding of the input is wider than the encoding of the output (all UTF-8 characters can't be stored in ISO-8859-1)
- encoding does only concern input and output components (a tMap doesn't need to know)
Concerning your problem with tFileInputCSV (and other tFileInput*), the encoding property should be used when opening the filehandle. Same for tLogRow. nrousseau and I have designed the "ENCODING_TYPE" property so that it can have a list of developer defined encodings and a "CUSTOM" option. When "CUSTOM" is selected, a text box appear next to the encoding defined list. In your case, ':raw:encoding(UTF-16)' should be the solution. (it doesn't do anything right now, it's a proposal).
What's your opinion?
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Hello,
your "current view" is exactly what I think the components would (or should) do.
Additional one option for 'binary' data (?) and often mentioned: Give detail information about converting problems (e.g. which line, which char and so on).
In your example I see the problem that you need detail information in which way the component is coded to decide what CUSTOM means. But this is a general (problem /design decision) of TOS.
By
Volker
(BTW: Move this thread to suggestions?)
Employee

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

(BTW: Move this thread to suggestions?)

Good advice :-)
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Quick update:
(I was out of office for a couple of days, just came back and got it working).
I compile DBD:Smiley SurprisedDBC v1.14 (recently released) on Windows XP and got the MSSQLOutput component working!
More updates tomorrow... don't have time now to do more thorough testing Smiley Happy
YAY! It works!! I can see the multibyte characters in MS SQL 2005 mgmt studio Smiley Happy
t.
One Star

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Hi tlittle,
I have been banging my head on this same problem for the past 3 weeks (nmd i am a total noop to programming)
Im not sure if our problems are exactly the same, i guess it never hurts to explain it. I am trying to change the column delimiters on a csv file so that i can import the report into another application. At first i was getting all these square characters, which i didnt understand where they coming from. Until recently i was told that the csv file is actually a TSV in UTF-16LE (or UTF-16) with BOM.
Here is the Perl code i had for changing the delimiters (I am using Stand. input and output)
foreach (<STDIN>) {
#copy this line:
my $line = $_;
# modify the delimiters
# replace every tab with ^
while ($line =~ s/"\t"/^ / ) { }
# remove double quotation marks
while ($line =~ s/"//) { }

print $line
}
Can someone help me modify this code to handle the encoding first then run the text manipulation
Employee

Re: Not able to preview UTF-16LE (w/BOM) encoded CSV file

Salem, are you using Talend Open Studio (in such a case, you should not need to write this piece of code, Talend Open Studio does it for you).
I see that you read directly from standard input and ouput directly to standard output. I would rather open a file with an explicit encoding, the same for the output.