By default, the pattern frequency table indicator, available in column analysis, only supports Latin characters.
Some characters, such as Eastern Asian characters, are not addressed by this indicator. Since version 6.1, an indicator called East Asia Pattern Frequency Table is introduced in the Studio, and it allows you to extract patterns with Japanese, Chinese, and Korean characters. You can use this indicator only with the Java engine.
A more efficient solution is to use custom SQL statements in a User Defined Indicator editor, but not all databases provide functions to work with regular expressions. This article provides an example of how to create such an indicator that can run on PostgreSQL databases.
This procedure applies to version 6.1.1 and later of the Studio.
Suppose that you want to support Japanese character pattern mappings with the SQL engine in a PostgreSQL database, and use this indicator in a column analysis.
PostgreSQL natively supports the function of regular expression replacement. What you need to do is to create a UDI (User Defined Function) in the DQ Repository tree view and design the SQL template appropriately.
In the editor, set the indicator metadata and click the plus button in the Indicator Definition section to add the definition of the indicator specific to PostgreSQL.
From the database list, select PostgreSQL and click the Edit button to open the Expression editor.
Click the Full SQL Template tab in the editor and enter the following SQL template.
SELECT REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE(<%=__COLUMN_NAMES__%>, '[\u0061-\u007A|\uFF41-\uFF5A|\u00E0-\u00F6|\u00F8-\u00FF]', 'a','g'), '[\u0041-\u005A|\uFF21-\uFF3A|\u00C0-\u00D6|\u00D8-\u00DF]','A','g'), '[0-9|\uFF10-\uFF19]','9','g'), '[\uAC00-\uD7AF]','G','g'), '[\u3041-\u3096|\u3099-\u309F]','H','g'), '[\u3041,\u3043,\u3045,\u3047,\u3049,\u3063,\u3083,\u3085,\u3087,\u308E,\u3095,\u3096]','h','g'), '[\u4E00-\u9FFF]','C','g'), '[\u30A1-\u30FA|\uFF66-\uFF9F]','K','g'), '[\u31F0-\u31FF|\uFF67-\uFF6F|\u30A1|\u30A3|\u30A5|\u30A7|\u30A9|\u30C3|\u30E3|\u30E5|\u30E7|\u30EE|\u30F5|\u30F6]','k','g') AS rp, COUNT(*) FROM <%=__TABLE_NAME__%> GROUP BY rp
Note: For databases, such as Oracle, SQL Server, and MySQL that don't currently support regular expression functions such as REGEX_REPLACE or the use of Unicode characters encoded with the hexadecimal encoding value of the form \xxxx, other solutions should be considered.
TDQ-11088 Create a UDI of category pattern frequency count which support Eastern Asian chars replacement for SQL engine - CLOSED