Japanese caracters

One Star

Japanese caracters

Hi, i am using talend 1.1.1 to move data from an sql server base to a postgresql base.
Both databases use UTF-8 for caracters encoding.
In this project we have to deal with japaneses translation, and i got some japaneses caracters in the sql server base.
They display properly when i browse the database with sql server studio, and i have check that japanese caracters are well displayed within the talend IDE (i just copy paste some caracters from amazon.jp in a text field).
When I execute a job that perfoms some data extraction to a file, or when a use the SQLBuilderDialog japanese caracters are replaced with '?' (one for each caracter).
I have put 'UTF-8' everywhere I saw an encoding field.
What should I do to properly handle my japanese caracters?
Thanks in advance
David
Tags (1)
One Star

Re: Japanese caracters

Just to say that i have a better diagnostic : I tried to attack the source database with the native sql server jdbc driver and i was able to get my data with the correct encoding, and when i execute the same snippet of jdbc code with the jdbc-odbc driver i got encoding problem.
My odbc source is configured with ansi identifier off, and i unchecked "translate characters data".
I tried to modify the perl generated to add the following two statements :
use utf8;
use encoding "utf8";
but it didn't change the job's execution.
I think that using a "nightly build" of the upcoming TOS (with java code generated) could help me (because i would be able to use the native sql server jdbc driver)
I have checked the wiki but i didn't find any step by step tutorial on how to build the latest tos version.
If someone can point me to a topic in this forum or to a related document, it would be great Smiley Happy
David
One Star

Re: Japanese caracters

I got some similar problem in the past, when migrating from oracle to sql server, using a jdbc driver. Some characters where going like "?".
I needed to turn of the unicode translation and forcing the correct charset.
I managed to solve that by passing some parameters to the jdbc driver, like "sendstringparameterasunicode=false" and "charset=ISO-8898-1".
I know that this is not the solution for you, but i think that it could be related to charsets and collation, and in my case i did found out that the unicode was not being the right option, so i had to turn it off.
Regards,
Luiz Filipe
Employee

Re: Japanese caracters

I've made some tests, with 2.0.0M2. I've succeeded in migrating some japanese characters from MS SQL Server to PostgreSQL. Here is how I made it.
1. create a table in your MS SQL Server database and fill it with Japanese characters (copy/paste from amazon.jp as suggested)
$ isql MSSQL root *****
SQL> create table topg (string varchar(20));
SQLRowCount returns -1
SQL> insert into topg (string) values ('??');
SQL> insert into topg (string) values ('???');
SQL> insert into topg (string) values ('?');
SQL> insert into topg (string) values ('à');
SQL> insert into topg (string) values ('e');
SQL> select string, len(string) from topg;
+---------------------+------------+
| string | |
+---------------------+------------+
| ?? | 6 |
| ??? | 9 |
| ? | 3 |
| à | 2 |
| e | 1 |
+---------------------+------------+
SQLRowCount returns 5
5 rows fetched

2. create a Talend Open Studio job that read MS SQL Server "topg" (to PostgreSQL) table, print the current row to STDOUT and insert it in a PostgreSQL table. See attached screenshots. My output is:
Starting job topic306 at 17:21 15/03/2007.
??
???
?
à
e
Job topic306 ended at 17:21 15/03/2007.

3. check what has been inserted in the PostgreSQL table
$ psql -U root -W talend
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.
talend=> \d fromms
Table "public.fromms"
Column | Type | Modifiers
--------+-----------------------+-----------
string | character varying(20) | not null
talend=> \encoding
UTF8
talend=> select string, length(string) from fromms;
string | length
--------+--------
?? | 2
??? | 3
? | 1
à | 1
e | 1
(5 rows)

Note: in tDBInput, the "Encoding" parameter is not used in Perl code, but it is used in tPostgresqOutput.