[resolved] Collect rejects from tFileInputDelimited

One Star

[resolved] Collect rejects from tFileInputDelimited

Hello Team,
I need to process delimited file and collect all the rejects.
To do so I use tFileInputDelimited -> Rejects -> tFileOutputDelimited where tFileOutputDelimited is configured to output to a .txt file. Unfortunately in this output .txt file I collect partially parsed rejected lines along with error message. What I really need to collect in this file is the original line(s) from input file that were rejected without any additional information. 
Here is an example. in the input file I have the following line:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
this line will be rejected since it is not properly formed according to csv schema that I defined. In output file, instead of this full line above, I only get the following:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|||||||||||||For input string: "Softcover" - Line: 0
Is there any way that I can collect original input lines?
Thank you!
Svetlana

Accepted Solutions
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi Sabrina,
I am not sure what column you are referring to, but I tried the following:
1. if I modify the schema and read all columns according to their types (first two columns as strings) then yes, everything works fine, all lines reach tSchemaComplianceCheck elements and can be validated.
2. when I modify input to not match schema (in this case one extra string column in the beginning of the line which is read as string) - please see attached screenshot. As you can see the rejects happened in tFileInputDelimited - they did not even reach tSchemaComplianceCheck for schema validation. Rejects happen because one of the fields of this line was expected to be long but turned out to be String.

3. I also tried to read entire line as one big "input" of type String and pass it to tSchemaComplianceCheck hoping it will figure out how to parse it but it did not and it gave me "input cannot be resolved or not a field" error:


The only workaround that I see so far is to read each line of input file as one long string input and pass it to tExtractDelimitedFields for parsing. Then when it fails to parse a line I can use tJavaRow to collect value of the input row to tExtractDelimitedFields. But I am facing strange problem here. For whatever reason it does not parse correctly my input line. It splits every letter as a separate field (see snapshot below). If you could help me to figure out this one, I can use this workaround to collect info that I need. Configuration of tExtractDelimitedFields is attached.



Thank you!
Svetlana

All Replies
Moderator

Re: [resolved] Collect rejects from tFileInputDelimited

Hi,
What does your expected result look like? How did you define csv schema?
Have you tried to use the component TalendHelpCenter:tSchemaComplianceCheck which is used to validate all input rows against a reference schema or check types, nullability, length of rows against reference values to see if it works?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi Sabrina,
My expected result will contain full line from input file which was rejected. 
i.e. this line will throw exception:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
in this case,  my expected result (file with rejects) will contain this line in full:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
My schema definition is attached as a screenshot.
I did not try to use tSchemaComplianceCheck component, so I am going to take a look at it.
Thank you!
Svetlana
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi Sabrina,
tSchemaComplianceCheck doesn't seem to help. The line that I want to have in rejects file will fail because the number of columns is different from what is defined in the schema so it will always fail in tFileInputDelimited without even reaching tSchemaComplianceCheck element - see attached screenshot. 
Thank you!
Svetlana
Moderator

Re: [resolved] Collect rejects from tFileInputDelimited

Hi,
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|||||||||||||For input string: "Softcover" - Line: 0

This field contains string values such as "Softcover", but you are using "Long" data type to read it. Try to read this column with string data type and validate the input rows against a reference schema by using tSchemaComplianceCheck to see if it works?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi Sabrina,
you are absolutely right - this line is deliberately incorrect. If I remove "AAA|BBB" part, it will be processed without any issues. My goal here is to collect all original lines from input file that may throw exception. Our program will process input csv file from our client application and there is absolutely no guarantee that all the lines in the file will be well formed. We will need to process what we can, and collect the rest in a rejects file to send it back to the client, so that they can deal with these rejects, fix them and resubmit. So I need to be able to send them back original line as it came from their input file, meaning I need to have this full incorrect line in my rejects file:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
Thank you,
Svetlana
Moderator

Re: [resolved] Collect rejects from tFileInputDelimited

Hi,
Could you please try to read this column with string data type (Actually, there is no check for String in talend)and validate the input rows against a reference schema by using tSchemaComplianceCheck to see if it works?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi Sabrina,
I am not sure what column you are referring to, but I tried the following:
1. if I modify the schema and read all columns according to their types (first two columns as strings) then yes, everything works fine, all lines reach tSchemaComplianceCheck elements and can be validated.
2. when I modify input to not match schema (in this case one extra string column in the beginning of the line which is read as string) - please see attached screenshot. As you can see the rejects happened in tFileInputDelimited - they did not even reach tSchemaComplianceCheck for schema validation. Rejects happen because one of the fields of this line was expected to be long but turned out to be String.

3. I also tried to read entire line as one big "input" of type String and pass it to tSchemaComplianceCheck hoping it will figure out how to parse it but it did not and it gave me "input cannot be resolved or not a field" error:


The only workaround that I see so far is to read each line of input file as one long string input and pass it to tExtractDelimitedFields for parsing. Then when it fails to parse a line I can use tJavaRow to collect value of the input row to tExtractDelimitedFields. But I am facing strange problem here. For whatever reason it does not parse correctly my input line. It splits every letter as a separate field (see snapshot below). If you could help me to figure out this one, I can use this workaround to collect info that I need. Configuration of tExtractDelimitedFields is attached.



Thank you!
Svetlana
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

I figured that. As a field separator instead of "|" I have to use "\\|". It works now. 
Thank you.
Svetlana
Four Stars

Re: [resolved] Collect rejects from tFileInputDelimited

What is the difference between "|" and "\\|"?
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Take a look at this documentation: https://help.talend.com/search/all?query=tExtractDelimitedFields&content-lang=en
This is what it says about field Separator:
Since this component uses regex to split a filed and the regex syntax uses special characters as operators, make sure to precede the regex operator you use as a field separator by a double backslash. For example, you have to use "\\|" instead of "|".
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi,
Did you try to join input data file with the reject file on a separate subjob after the 1st one is finished?
A little transformation plus playing with separators should works.
Regards,
TRF
One Star

Re: [resolved] Collect rejects from tFileInputDelimited

Hi TRF,
my problem was to get rejects file. The rejected lines returned by tFileInputDelimited were not complete original lines. It would return partially processed lines, so if parsing failed on the very first column of the line, I would not have anything that I could compare with the input data. My backup plan was however to compare input data with successfully processed data to find out what was rejected. Solution with tExtractDelimitedFields works fine for my puposes.
Cheers,
Svetlana