One Star

The problem of Analyses is use russian ALPHABET in patterns

The problem of Analyses is use russian ALPHABET in patterns.
1. I create pattern with russian chars.
FIRST_TWO_RUS_CHARS = '^(?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?|?){2} *'
The pattern (analog , but for russian chars) doesn't work. I don't know why.
2. I create Analyse DOC_ANALYSIS for column DOC in my table.
3. I selecte FIRST_TWO_RUS_CHARS pattert for analyse column.
4. Run profiling. Result is good for me.
5. Close TDQ.
6. Look in the file workspace\PROFILING\TDQ_Data Profiling\Analyses\ ... .ana






































I see that russian chars is good in this file.

7. Open TDQ.
8. Look in the file workspace\PROFILING\TDQ_Data Profiling\Analyses\ ... .ana





































I see that russian chars is not good in this file!!! The encode was broken!!!
But when I create pattern, for example '^(?|?|???|???|?|?|???|???)$' (russian gender), TDQ workes OK.

Why pattern don't work?
Why russian chars was broken in .ana file when I opened TDQ?
What do You think about this?

  • Data Quality
13 REPLIES
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

Hi mkovtun, what's the system character encode? UTF-8?
You can use these codes:
String encoding = System.getProperty("file.encoding");
System.out.println("Default System Encoding:" + encoding);
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

Hi,
can you create a zip with those two files, please ? The first being after you close TDQ, the second one being the file after you open it with TDQ.
We need that you zip them in order to avoid that there is an encoding conversion somewhere during the transfer.
can you attach this zip in 8923, please?
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

mkovtun,
you can know your OS encoding in TDQ:
Click the "Help" menu, choose "About Talend Data Quality"
This opens a small window.
Click then on "Configuration details"
you can copy the information to clipboard and paste it here.
The important information is "file.encoding" but we are also interested in the other information.
One Star

Re: The problem of Analyses is use russian ALPHABET in patterns

All configuration details (Configuration details.txt) in bugtrack.
This is part of them:
....
eclipse.startTime=1252325365906
eclipse.vm=C:\Program Files\Java\jre1.6.0_07\bin\client\jvm.dll
eclipse.vmargs=-Xms40m
-Xmx500m
-XX:MaxPermSize=256m
-Djava.class.path=C:\TDQ_EE-All-r26090-V3.1.3\TDQ_EE-All-r26090-V3.1.3\plugins\org.eclipse.equinox.launcher_1.0.100.v20080509-1800.jar
file.encoding=Cp1251
file.encoding.pkg=sun.io
file.separator=\
java.awt.graphicsenv=sun.awt.Win32GraphicsEnvironment
java.awt.printerjob=sun.awt.windows.WPrinterJob
java.class.path=C:\TDQ_EE-All-r26090-V3.1.3\TDQ_EE-All-r26090-V3.1.3\plugins\org.eclipse.equinox.launcher_1.0.100.v20080509-1800.jar
java.class.version=50.0
....
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

mkovtun,
could you try to edit your "TalendDataQuality-win32-x86.ini" file and add at the end: "-Dfile.encoding=UTF-8"
You should have something like
-nl
en_EN
-vmargs
-Xms40m
-Xmx500m
-XX:MaxPermSize=256m
-Dfile.encoding=UTF-8
One Star

Re: The problem of Analyses is use russian ALPHABET in patterns

It works.
I think that problem with russian comments in table is the same reason...

But pattern (analog , but for russian chars) doesn't work.
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

good! first problem resolved :-)
Now let's tackle the second issue. What's the problem with the pattern ? What do you mean by "doesn't work"? Is it still a file encoding issue or is it another problem?
One Star

Re: The problem of Analyses is use russian ALPHABET in patterns

Open Pattern Test View.
Input '^$' pattern. In Test Area input S. Click on the Test button. Matches (Good).
In Test Area input ?. Click on the Test button. Non-matches (Good).
Input '^$' pattern. In Test Area input ?. Click on the Test button. Non-matches (Not good. Must be Matches).
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

Is '?' a single character?
With my configuration, MySQL sees it as two distinct characters, because the following query works
SELECT  '?' REGEXP '^.$'  AS OK;

In my opinion, this is due to the fact that MySQL regular expressions may not work properly with UTF8; see http://dev.mysql.com/doc/refman/5.1/en/regexp.html where it is said:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
One Star

Re: The problem of Analyses is use russian ALPHABET in patterns

'?' as '?' is a single character.
I want to change charset set in Mysql on CP1251 and try again ...
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

yes, that's what I would try. But it's really a pity that we cannot work with the UTF8 encoding.
Let us know if it works.
Thank you.
One Star

Re: The problem of Analyses is use russian ALPHABET in patterns

What I have:
In MySQL encoding is UTF-8
Connection in TDQ have parameter encoding Cp1251
The pattern '^.$' work.
Thank you.
Employee

Re: The problem of Analyses is use russian ALPHABET in patterns

Be careful, I have added a dot in the pattern. '^.$' is not the same as what you wanted at the beginning because it means that you match 2 characters.
Your initial pattern was '^$' which should match a single character.
The pattern matching runs on the database with a SQL query, so I'm not sure that changing the connection parameters is enough. If MySQL encoding is UTF-8, it's a multi-byte encoding and MySQL may not work as expected.
Please, check that your pattern is the correct one.