Memory issues when profiling large data sets using indicators

Overview

When you use the Java engine to run column analyses or column set analyses with indicators on large data sets, your Studio may run out of memory and you may end up with a Java heap error.

 

Environment

Talend Open Studio for Data Quality and all platform Studios with data quality.

 

Description

You may get a Java heap error when you use certain indicators to run column or column set analyses on large data sets. Executing analyses with the Java engine uses a lot of disk space, as all data is retrieved and stored locally using physical memory (MapDB mode). More data needs more memory, which will lead to this potential error when profiling large data.

eclipse.buildId=unknown
java.version=1.7.0_15
java.vendor=Oracle Corporation
BootLoader constants: OS=win32, ARCH=x86_64, WS=win32, NL=en_US
Framework arguments:  -talendReload false
Command-line arguments:  -talendReload false -os win32 -ws win32 -arch x86_64
Error
Tue Nov 04 10:32:02 CET 2014
An internal error occurred during: "Run Analysis".
java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2220)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2044)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3549)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:489)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3240)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2411)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2834)
...

The indicators that may cause memory issues with the Java engine are:

  • Unique count, duplicate count, distinct count
  • Median and quartiles indicators
  • Pattern (Low) Frequency table indicator
  • Soundex (Low) Frequency table
  • All advanced statistics indicators (frequency tables), depending on the aggregation level (e.g. a year frequency table will fit in memory if the analyzed period does not contain millions of years)

 

Resolution

It is advisable to define a maximum memory size threshold for column and column set analyses used on large data sets and executed with the Java engine.

A detail procedure on how to define a memory threshold is outlined in Defining the maximum memory size threshold.

However, if this does not solve the memory problem you must first configure the default Java Virtual Machine (JVM) parameters by setting new values in the platform.ini file of your Studio. You can then go back to define the memory size threshold from the Studio Preferences page.

The new JVM values you must set depend on the size of the data set you want to profile, and on the available memory on your machine. Below is an example of these parameters:

 

-vmargs
-Xms512m
-Xmx1536m
-XX:MaxPermSize=512m
-Dfile.encoding=UTF-8
-Dtalend.mapdb.cacheSize=1024
-Dtalend.mapdb.mmapFileEnable=true
-Dtalend.mapdb.closeDelayTime=300000
-Dtalend.mapdb.valuesOutsideNodesEnable=false

 

Note: As JDK 1.7 64 needs more memory than JDK 1.6 64, you must assign even more memory to the Xmx value when you use JDK 1.7 64 with your Studio. For example:

-Xmx3072m
Version history
Revision #:
2 of 2
Last update:
‎06-07-2017 06:15 PM
Updated by:
 
Labels (1)
Tags (1)