How to create a custom pattern frequency indicator for Japanese characters

Overview

By default, the pattern frequency table indicator, available in column analysis, only supports Latin characters.

 

Some characters, such as Eastern Asian characters, are not addressed by this indicator. Since version 6.1, an indicator called East Asia Pattern Frequency Table is introduced in the Studio, and it allows you to extract patterns with Japanese, Chinese, and Korean characters. You can use this indicator only with the Java engine.

 

A more efficient solution is to use custom SQL statements in a User Defined Indicator editor, but not all databases provide functions to work with regular expressions. This article provides an example of how to create such an indicator that can run on PostgreSQL databases.

 

Environment

This procedure applies to version 6.1.1 and later of the Studio.

 

Description

Suppose that you want to support Japanese character pattern mappings with the SQL engine in a PostgreSQL database, and use this indicator in a column analysis.

 

Resolution

PostgreSQL natively supports the function of regular expression replacement. What you need to do is to create a UDI (User Defined Function) in the DQ Repository tree view and design the SQL template appropriately.

  1. In the Profiling perspective, right-click the User Defined Indicators folder and select New Indicator.
  2. In the editor, set the indicator metadata and click the plus button in the Indicator Definition section to add the definition of the indicator specific to PostgreSQL.

    custompatern_japchar.png

     

  3. From the database list, select PostgreSQL and click the Edit button to open the Expression editor.

    custompatern_japchar2.png

     

  4. Click the Full SQL Template tab in the editor and enter the following SQL template.

    SELECT
        REGEXP_REPLACE(
          REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_REPLACE(<%=__COLUMN_NAMES__%>,
                  '[\u0061-\u007A|\uFF41-\uFF5A|\u00E0-\u00F6|\u00F8-\u00FF]',
                  'a','g'),
              '[\u0041-\u005A|\uFF21-\uFF3A|\u00C0-\u00D6|\u00D8-\u00DF]','A','g'),
            '[0-9|\uFF10-\uFF19]','9','g'),
              '[\uAC00-\uD7AF]','G','g'),
            '[\u3041-\u3096|\u3099-\u309F]','H','g'),
          '[\u3041,\u3043,\u3045,\u3047,\u3049,\u3063,\u3083,\u3085,\u3087,\u308E,\u3095,\u3096]','h','g'),
        '[\u4E00-\u9FFF]','C','g'),
          '[\u30A1-\u30FA|\uFF66-\uFF9F]','K','g'),
        '[\u31F0-\u31FF|\uFF67-\uFF6F|\u30A1|\u30A3|\u30A5|\u30A7|\u30A9|\u30C3|\u30E3|\u30E5|\u30E7|\u30EE|\u30F5|\u30F6]','k','g')
    AS rp,
    COUNT(*)
    FROM <%=__TABLE_NAME__%> GROUP BY rp
  5. Click OK to close the editor and create the indicator. The new indicator is listed under the User Defined Indicators folder in the DQ Repository tree view.
  6. Use this indicator to compute pattern frequencies for both Latin and Eastern Asian characters on PostgreSQL.

 

Note: For databases, such as Oracle, SQL Server, and MySQL that don't currently support regular expression functions such as REGEX_REPLACE or the use of Unicode characters encoded with the hexadecimal encoding value of the form \xxxx, other solutions should be considered.

 

Related Jira Issues

TDQ-11088 Create a UDI of category pattern frequency count which support Eastern Asian chars replacement for SQL engine - CLOSED

Version history
Revision #:
2 of 2
Last update:
‎06-30-2017 10:51 AM
Updated by:
 
Labels (1)