One Star

[resolved] ETL including Chinese Characters

Hey folks. I'm having a problem loading data from an MS SQL database into a MySQL database when the string contains Chinese characters. According the to the trace that I enabled for the job it looks like this is a problem when using tMSSQLOutput. In the trace it shows the string at '???' and then it passed NULL to MySQL. This is a big problem for me as our company is going to be doing more and more work with Asian character sets in our databases. How do I go about making this work?
- Peter
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] ETL including Chinese Characters

Here is an update and resolution to this issue...
1) I attempted to resolve this issue by installing the latest version of TOS. No improvement.
2) I set the default client character set to UTF8 in MySQL. No Improvement.
3) I set an additional JDBC parameter in the connection to set the character set to utf8. No improvmenet.
4) I recreated the data set in MySQL so I would transfer data from MySQL to MySQL, thus to remove the MS SQL input component as a possible issue. Had the same problem so we know that this is an issue with the job as a whole and not the MS SQL input component.
Resolution = Used SSIS.
Due to this issue we have no choice but to recreate all of our jobs using SSIS, which isn't a huge problem since the jobs that we currently have are not that complex. However it's my understanding that SSIS does not have a connector for Infobright, which was a major benefit to using TOS. It is possible to transfer data from a MySQL server to an infobright server by using a BASH script that utilizes the 'load data infile' command on the infobright instance and the 'select into outfile' command on the mysql instance with the data to be extracted, but it would be a lot easier to create and manage that process if we had a development studio such as TOS or SSIS for the jobs. Since we're using international data with Asian characters we simply can't use TOS as it appears that it cannot transfer this data.
I'm hoping that in the future someone may be able to shed some light on this or that this issue may be resolved in a future version of TOS. If so please post to this thread.
- Peter
9 REPLIES
Community Manager

Re: [resolved] ETL including Chinese Characters

Hi
There must be something wrong on tMap, the data are not transferred to output table. Can you upload a screenshot of tMap?
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] ETL including Chinese Characters

Shong,
I did find the problem with the tMap. I didn't have anything transferring across, so I fixed that but the situation hasn't gotten any better. When I run the job now, it detects the Chinese characters are '???' and then passes in a series of question marks to MySQL as the output from the tMap. Please see the attached pictures.
So now I'm getting data, but it's all question marks. Is there something I'm missing? Is there something I need to install for TOS to operate in Unicode or something similar?
- Peter S
Hi
There must be something wrong on tMap, the data are not transferred to output table. Can you upload a screenshot of tMap?
Best regards
Shong
One Star

Re: [resolved] ETL including Chinese Characters

I have the same issue. Even if i put -Dfile.encoding=utf8 in Java run/debug parameter. It doesn't work. If I switch to chinese interface, the interface display correctly but the content doesn't display correctly. I am sure my source is UTF8 but it just doesn't show the characters. Please help
One Star

Re: [resolved] ETL including Chinese Characters

Hey guys. Are there any answers on this? This is something that is going to be absolutely necessary if we want to use TOS for our ETL jobs. Please let me know if I'm doing something wrong here.
Also, just as an updated, I did check the MS SQL input component and the MySQL output component to verify that the encoding is UTF-8, but that has not helped. I have also tried using the MS SQL connection over ODBC and that also has not helped. Also, I'm currently using version 4.0.3, is this something that's been addressed in later versions?
One Star

Re: [resolved] ETL including Chinese Characters

Hey guys, just checking again if anyone has an answer to this. Anyone?... Anyone?.... Bueller?
One Star

Re: [resolved] ETL including Chinese Characters

Here is an update and resolution to this issue...
1) I attempted to resolve this issue by installing the latest version of TOS. No improvement.
2) I set the default client character set to UTF8 in MySQL. No Improvement.
3) I set an additional JDBC parameter in the connection to set the character set to utf8. No improvmenet.
4) I recreated the data set in MySQL so I would transfer data from MySQL to MySQL, thus to remove the MS SQL input component as a possible issue. Had the same problem so we know that this is an issue with the job as a whole and not the MS SQL input component.
Resolution = Used SSIS.
Due to this issue we have no choice but to recreate all of our jobs using SSIS, which isn't a huge problem since the jobs that we currently have are not that complex. However it's my understanding that SSIS does not have a connector for Infobright, which was a major benefit to using TOS. It is possible to transfer data from a MySQL server to an infobright server by using a BASH script that utilizes the 'load data infile' command on the infobright instance and the 'select into outfile' command on the mysql instance with the data to be extracted, but it would be a lot easier to create and manage that process if we had a development studio such as TOS or SSIS for the jobs. Since we're using international data with Asian characters we simply can't use TOS as it appears that it cannot transfer this data.
I'm hoping that in the future someone may be able to shed some light on this or that this issue may be resolved in a future version of TOS. If so please post to this thread.
- Peter
One Star

Re: [resolved] ETL including Chinese Characters

I had a chance to take the Java that the job was creating and get it over to one of the java developers that I work with and we were able to isolate the problem. I plan on updating this thread with the resolution to the problem within the next 48 hours.
One Star

Re: [resolved] ETL including Chinese Characters

As promised, here is the solution to the problem that I found. Though I'll note that this is a little involved. In order to make this work you will need to be create a Java job in TOS and have another package on your system that allows you to write and compile java. On my system I'm using netbeans working with that latest JDK.
1) Create the job in TOS.
2) If you are outputting to a MySQL component, make sure that in the advanced settings of the component you have the following string as part of your optional JDBC parameter.
&characterEncoding=UTF-8
In my job, my optional JDBC parameters look like this....
"noDatetimeStringSync=true&characterEncoding=UTF-8"
I have not been able to test this to my satisfaction as of yet, but my server has a default client character set setting of UTF-8, so I don't know why I have to specify this as an optional JDBC parameter, but I do know that regardless of the system server variable being set, if I don't have this optional JDBC parameter in there it doesn't work.
3) Run the job to make sure there are no logic errors. Essentially you are just checking to make sure that the job is moving data from point A to point B. At this point if it's not moving non-latin characters it's not an issue, you just want to make sure that the only problem with your job is the inability to move non-latin characters.
4) Export the job. In your repository, under Job design, right click on your job and select ?Export Job?. When you do the job export MAKE SURE you have ?Source Files? checked as this is what you are going to need.
5) The job typically exports in a zip compressed format, though you can choose to decompress this as the job is exported. Regardless, if you haven't decompressed the job file, do so now.
6) Now find the file that has the source code for the job that you just created. You'll find it in a path SIMILAR to this in the decompressed data....
\<job name>\<job name>\src\<project name>\<job name>\<job name>.java.
So for example, if my job name was ?transfer?, version 0.1, and was part of a project called ?Company?, and the job's files were extracted to c:\temp, you would find the file here....
C:\Temp\transfer_0.1\transfer\src\Company\transfer_0_1\transfer.java.
Find this file and open it in your java development studio of choice.
7) Look for the readString function. In my code it looks like this....
private String readString(ObjectInputStream dis) throws IOException {
String strReturn = null;
int length = 0;
length = dis.readInt();
if (length == -1) {
strReturn = null;
} else {
if (length > commonByteArray.length) {
if (length < 1024 && commonByteArray.length == 0) {
commonByteArray = new byte;
} else {
commonByteArray = new byte;
}
}
dis.readFully(commonByteArray, 0, length);
strReturn = new String(commonByteArray, 0, length, utf8Charset);
}
return strReturn;
}
delete this and replace it with this....
private String readString(ObjectInputStream dis) throws IOException {
String strReturn = null;
int length = 0;
length = dis.readInt();
if (length == -1) {
strReturn = null;
} else {
strReturn = dis.readUTF();
/*
if (length > commonByteArray.length) {
if (length < 1024 && commonByteArray.length == 0) {
commonByteArray = new byte;
} else {
commonByteArray = new byte;
}
}
dis.readFully(commonByteArray, 0, length);
strReturn = new String(commonByteArray, 0, length, utf8Charset);
*/
}
return strReturn;
}
The biggest thing that we're changing here is just commenting out some un-needed code.
Bare in mind that this function will occur several times in the code. I believe there is at LEAST 1 for every input component that you have in your job, so make sure you get ALL of them.
8) Last, there is a try catch for getting context. You should be able to find this by searching for the following string in your code....
// call job/subjob with an existing context, like:
// --context=production. if without this parameter, there will use
// the default context instead.
Just below it you will find the following code....
java.io.InputStream inContext = stateprov.class.getClassLoader()
.getResourceAsStream(
"ideallife/stateprov_0_1/contexts/" + contextStr
+ ".properties");
This loads data from the context properties file that is exported along with the job. What we want to do is change this to just point it directly to the file instead of having the code try to figure out where the file is. Going back to the example that we used earlier when finding the java source code, you will be able to find the file in the following location....
C:\Temp\transfer_0.1\transfer\Company\transfer_0_1\contexts\Default.properties
So in this case we would want to revise the code to look like this....
java.io.InputStream inContext = new java.io.FileInputStream("C:\\Temp\\transfer_0.1\\transfer\\Company\\transfer_0_1\\contexts\\Default.properties");
//stateprov.class.getClassLoader()
// .getResourceAsStream(
// "ideallife/stateprov_0_1/contexts/" + contextStr
// + ".properties");
What we've done here is set the inContext to look to the file. We use double backslashes because the backslash is an escape character and needs to be noted twice to be handled as a litteral backslash. The four lines that followed have been commented out as they are no longer needed.
One you have made the revisions, compile and run the code. You will find it will run EXACTLY like the job you made using the GUI, only now it will actually transfer non-latin characters.
And boom goes the dynamite.
- Peter
PS. Devs are complete prima donnas, so I have to give credit to the dev that found the work around. His name is Shlomi Rosenzweig and he rocks. Unfortunately there is not enough space on this thread page for his ego, so this will have to suffice.
One Star

Re: [resolved] ETL including Chinese Characters

PeterTheDBA
You made a great job writing about this thing...
But maybe I could ask you how you describing what is the main class for example in net beans because it wont work without it (I have in mind exported job) Smiley Wink