Four Stars

[resolved] Type conversions - best practices

I am interested what are the best practices to converting data types and at which point of the etl process would you apply it. Bellow is a list of possible converting options with their pros and cons:
- conversion at the db layer (cast option in the sql query): best performance but difficult to manage / conversion is hidden from talend developer
- implicit conversion with defining a schema for the input component (ex. db type = varchar, java type = Integer): easier to maintain but difficult to catch exceptions
- java conversion methods (applicable in tMap, tJava,..)
- tConvertType: probably the best option to convert types which support handling of rejected rows. Cons is that there is no option to just convert a subset of the schema attributes but you have to list all the conversion mappings.
1 ACCEPTED SOLUTION

Accepted Solutions
Six Stars

Re: [resolved] Type conversions - best practices

- conversion at the db layer
I'm not sure I agree this is the best performance option, especially when joining with CASTs can wreak havoc on indexes and impact performance. In Java, an upcast is free, it is downcasts that cost you.
- implicit conversion with defining a schema for the input component
As you pointed out, this can be dangerous.
- java conversion methods
I do use this when casting subsets of the data, or just a few numeric types, or when I'm doing something crazy in a tJavaRow (like working with Object types)
- tConvertType
My preferred method (and this is much more important when using files as a source as opposed to a DB) Is to create a generic schema which reads all fields as a java String. This is my permissible schema, this allows me to avoid parsing errors and capture any rows that may not be in compliance with the target schema (rejects are the most interesting!) After reading the data into a permissive schema, I run through a tSchemaComplianceCheck against the stated/actual schema then proceed through the tConvertType (which should be safe since we have checked all rows) and auto-cast to the stated/actual.
1 REPLY
Six Stars

Re: [resolved] Type conversions - best practices

- conversion at the db layer
I'm not sure I agree this is the best performance option, especially when joining with CASTs can wreak havoc on indexes and impact performance. In Java, an upcast is free, it is downcasts that cost you.
- implicit conversion with defining a schema for the input component
As you pointed out, this can be dangerous.
- java conversion methods
I do use this when casting subsets of the data, or just a few numeric types, or when I'm doing something crazy in a tJavaRow (like working with Object types)
- tConvertType
My preferred method (and this is much more important when using files as a source as opposed to a DB) Is to create a generic schema which reads all fields as a java String. This is my permissible schema, this allows me to avoid parsing errors and capture any rows that may not be in compliance with the target schema (rejects are the most interesting!) After reading the data into a permissive schema, I run through a tSchemaComplianceCheck against the stated/actual schema then proceed through the tConvertType (which should be safe since we have checked all rows) and auto-cast to the stated/actual.