[resolved] Automatically trim non-String fields?

Seven Stars

[resolved] Automatically trim non-String fields?

Just wanted to put a thought out there before I submit a feature request:
One of our newbies was having difficulty because the input file he was trying to use in a job had unnecessary spaces trailing the data for a column defined in the schema as Integer, which caused every row to be rejected. It seems to me that it would be appropriate for all input components to automatically trim any numeric column before attempting to parse it to the correct data type, regardless of the advanced trim settings.
Comments, please?

Accepted Solutions
Seven Stars

Re: [resolved] Automatically trim non-String fields?

While I certainly understand janhess's point of view, I also look at SQL Server which happily equates a number to a string with leading or trailing spaces i.e. 1 = ' 1 '.
However, there is an easy solution (if you understand the problem in the first place). Also, making a change like this to tFileInput suggests, for consistency, tSchemaComplianceCheck and tConvertType should also be changed...
So, for now, I've left well-enough alone. (I've already submitted enough feature requests Smiley Happy! Someone else can have a go if they feel more strongly about it.)

All Replies
Community Manager

Re: [resolved] Automatically trim non-String fields?

Hi
There will be a lot of work to do, we need update the javajet file of all input components, This is unacceptable changes. ;(
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: [resolved] Automatically trim non-String fields?

Just wanted to put a thought out there before I submit a feature request:
One of our newbies was having difficulty because the input file he was trying to use in a job had unnecessary spaces trailing the data for a column defined in the schema as Integer, which caused every row to be rejected. It seems to me that it would be appropriate for all input components to automatically trim any numeric column before attempting to parse it to the correct data type, regardless of the advanced trim settings.
Comments, please?

Yes it sound reasonable to me... the user declared clearly its intent to get an Integer from the flow, so it should ignore the "spaces/tabs" or at least produce a very clear exception during parsing about the reason.
Not such big show stopper, but it will improve the learning curve of the product ( and many times a lot of time with Talend is spent in debugging this little glitches in the flow .... )
One Star

Re: [resolved] Automatically trim non-String fields?

The data should be rejected as it is an invalid integer. It would be if it was validated against the schema. This sort of data should be defined as string and trimmed. It can be converted to integer later if this is necessary.

Re: [resolved] Automatically trim non-String fields?

Shong is right here... this would require changes to all the input components.
at least produce a very clear exception during parsing about the reason.

This might be just as broad-- the components would have to figure out what the exception was caused by -- it could be a value of 'aaa' or ' 123 '
the only (not so elegant) easy solution to this would be to read all values as strings, trim everything and then convert it all to the types you want.
I use the following code to apply the same transform to all columns-- it could be used to apply trim in this instance:
//grab fields with reflection magic
java.lang.reflect.Field[] input_fields = input_row.getClass().getDeclaredFields();
//now we go through the fields
for( java.lang.reflect.Field field : input_fields ) {
//set accessable so we dont explode with exception.
field.setAccessible(true);
//skip the silly byte arrays
if(field.getName() == "commonByteArrayLock" || field.getName() == "commonByteArray" ) {
continue;
}
//retrieve the field so we can play with it
Object col_value = field.get(input_row);


//Now we populate output schema
//this should feel familiar :)
java.lang.reflect.Field[] output_fields = output_row.getClass().getDeclaredFields();
for ( java.lang.reflect.Field output_field : output_fields ) {
output_field.setAccessible(true);

//apply the transform
col_value = routines.StringHandling.TRIM(((String) col_value));

//if these are matching schema columns, set the value of output to our changed value.
if(field.getName() == output_field.getName()) {
output_field.set(output_row, col_value);
}
}
}
Seven Stars

Re: [resolved] Automatically trim non-String fields?

I should perhaps clarify that simply selecting that field to be trimmed using the advanced settings solved the problem i.e. the value was trimmed as a String before it was cast to an Integer. I just thought Talend should perhaps always take this approach for numerics, even if the field is not set to be trimmed using the advanced settings.
Seven Stars

Re: [resolved] Automatically trim non-String fields?

While I certainly understand janhess's point of view, I also look at SQL Server which happily equates a number to a string with leading or trailing spaces i.e. 1 = ' 1 '.
However, there is an easy solution (if you understand the problem in the first place). Also, making a change like this to tFileInput suggests, for consistency, tSchemaComplianceCheck and tConvertType should also be changed...
So, for now, I've left well-enough alone. (I've already submitted enough feature requests Smiley Happy! Someone else can have a go if they feel more strongly about it.)
Seven Stars

Re: [resolved] Automatically trim non-String fields?

Shong is right here... this would require changes to all the input components.

I don't believe this is true. All input components should use Talend's ParserUtils so only these need be modified and not every component e.g. becoming
public static int parseTo_int(String s) {
if (s!=null) {
s = s.trim();
}
return Integer.parseInt(s);
}
One Star

Re: [resolved] Automatically trim non-String fields?

I would like some help. I want to know how to read csv files and make calculations as grouped together every 12 lines and calculate their average use over some places the rows to replace a string by another and end insert these values into a table in a mysql data base
tFileList -> tFileInputDelemited -> tjava -> tmap -> tLogRow 
here I managed to recover and play mas info I do not manage to use the replace and consolidate every 12 line together and calculated the average
thx