Japanese caracters

Highlighted
One Star

Japanese caracters

Hi, i am using talend 1.1.1 to move data from an sql server base to a postgresql base.
Both databases use UTF-8 for caracters encoding.
In this project we have to deal with japaneses translation, and i got some japaneses caracters in the sql server base.
They display properly when i browse the database with sql server studio, and i have check that japanese caracters are well displayed within the talend IDE (i just copy paste some caracters from amazon.jp in a text field).
When I execute a job that perfoms some data extraction to a file, or when a use the SQLBuilderDialog japanese caracters are replaced with '?' (one for each caracter).
I have put 'UTF-8' everywhere I saw an encoding field.
What should I do to properly handle my japanese caracters?
Thanks in advance
David
One Star

Re: Japanese caracters

Just to say that i have a better diagnostic : I tried to attack the source database with the native sql server jdbc driver and i was able to get my data with the correct encoding, and when i execute the same snippet of jdbc code with the jdbc-odbc driver i got encoding problem.
My odbc source is configured with ansi identifier off, and i unchecked "translate characters data".
I tried to modify the perl generated to add the following two statements :
use utf8;
use encoding "utf8";
but it didn't change the job's execution.
I think that using a "nightly build" of the upcoming TOS (with java code generated) could help me (because i would be able to use the native sql server jdbc driver)
I have checked the wiki but i didn't find any step by step tutorial on how to build the latest tos version.
If someone can point me to a topic in this forum or to a related document, it would be great Smiley Happy
David
One Star

Re: Japanese caracters

I got some similar problem in the past, when migrating from oracle to sql server, using a jdbc driver. Some characters where going like "?".
I needed to turn of the unicode translation and forcing the correct charset.
I managed to solve that by passing some parameters to the jdbc driver, like "sendstringparameterasunicode=false" and "charset=ISO-8898-1".
I know that this is not the solution for you, but i think that it could be related to charsets and collation, and in my case i did found out that the unicode was not being the right option, so i had to turn it off.
Regards,
Luiz Filipe
Employee

Re: Japanese caracters

I've made some tests, with 2.0.0M2. I've succeeded in migrating some japanese characters from MS SQL Server to PostgreSQL. Here is how I made it.
1. create a table in your MS SQL Server database and fill it with Japanese characters (copy/paste from amazon.jp as suggested)
$ isql MSSQL root *****
SQL> create table topg (string varchar(20));
SQLRowCount returns -1
SQL> insert into topg (string) values ('??');
SQL> insert into topg (string) values ('???');
SQL> insert into topg (string) values ('?');
SQL> insert into topg (string) values ('à');
SQL> insert into topg (string) values ('e');
SQL> select string, len(string) from topg;
+---------------------+------------+
| string | |
+---------------------+------------+
| ?? | 6 |
| ??? | 9 |
| ? | 3 |
| à | 2 |
| e | 1 |
+---------------------+------------+
SQLRowCount returns 5
5 rows fetched

2. create a Talend Open Studio job that read MS SQL Server "topg" (to PostgreSQL) table, print the current row to STDOUT and insert it in a PostgreSQL table. See attached screenshots. My output is:
Starting job topic306 at 17:21 15/03/2007.
??
???
?
à
e
Job topic306 ended at 17:21 15/03/2007.

3. check what has been inserted in the PostgreSQL table
$ psql -U root -W talend
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.
talend=> \d fromms
Table "public.fromms"
Column | Type | Modifiers
--------+-----------------------+-----------
string | character varying(20) | not null
talend=> \encoding
UTF8
talend=> select string, length(string) from fromms;
string | length
--------+--------
?? | 2
??? | 3
? | 1
à | 1
e | 1
(5 rows)

Note: in tDBInput, the "Encoding" parameter is not used in Perl code, but it is used in tPostgresqOutput.

What’s New for Talend Spring ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

APIs for Dummies

View this on-demand webinar about APIs....

Watch Now