One Star

tFileInputDelimited, text enclosure option, and numerical data

I've scoured the forums about this problem and didn't find anything, so forgive me if this is already addressed somewhere.
This is about TOS 5.4.1 r111943-20131212-1133 which I downloaded and installed in february of this year.
I'm using a tFileInputDelimited component to read a tab-delimited file. I have no control over the format of the file delivered to me. The rows in the file contain both textual and numerical fields so I've set up a scheme accordingly; some columns are String, and some are BigDecimal for the numbers.
I've examined the file and noticed that some of the text fields were surrounded by double quotes (interestingly not because they had tabs in them but commas), so I also set up the 'CSV options' so that Escape Char and Text Enclosure are both "\"".
The file parses fine and out of about a 100K rows I get 5 rejects. Upon examination each of these reject rows turns out to have a number that contains a comma (thousands separator) and is surrounded by double quotes.
Further experimentation and examination of the generated code has convinced me that the escape char and text enclosure settings only apply to fields that are read into a String, and are not applied to numerical columns.
The most telling evidence is that if I remove the reject row and let the component die on error, a NumberFormat exception thrown from the BigDecimal class is revealed.
I then used the following minimal data set to try and isolate the problem: (I've used to denote a tab char)
"Text"12
Text"12"
Reading this file with a tFileInputDelimited (two columns, first = String, 2nd = Integer, delim = tab, CSV options both on "\"") results in row one being accepted and row two being rejected. therefore, it is not a problem with the BigDecimal class, it is not related to the thousands separator and it works for text data and fails for number data.
What do I need to do to make this work?
4 REPLIES
One Star

Re: tFileInputDelimited, text enclosure option, and numerical data

Seems I misread the post. Could you please paste the exact error message?
One Star

Re: tFileInputDelimited, text enclosure option, and numerical data

I used a tJavaFlex component to read a CSV file line by line. Instead of trying to figure out conversions within the java code, I set the schema on the CSV file so everything was a string. I then used a tConvertType component to convert them to the final destination type. I just set it to Auto Cast and it seemed to do ok converting strings to float, bigdecimal, and int.
Four Stars

Re: tFileInputDelimited, text enclosure option, and numerical data

"Text"12
Text"12"
First row would not create error, but second would create.
Use StringHandling.EREPLACE to remove double quotes from integer field and try., but in this case, you will have to read it as string and again parse it to integer if you want. Use (Integer)yourcolumn
Vaibhav
One Star

Re: tFileInputDelimited, text enclosure option, and numerical data

Replies so far seem to confirm that the escape char (double quote in my case) is only applied by talend for string values. This is not the way delimited files normally work - surrounding a numeric column with quotes is completely valid. At least, it is for CSV files where this behavior is copied from.
I'd like an official confirmation of the behavior of talend on this. If this is really how talend treats the escape character I think I need to file a bug report.
I know I can work around this by simply reading everything as strings and then converting with additional components, but I consider that a last resort approach.