Four Stars

Guess schema: problem with big tables

Hi,

 

at the moment I'm trying to unload the data from an old sql server to our new datalake. I experienced an issue while trying to read the schema info from our tables.

 

When guessing the schema of a table Talend starts a mock job in the background, which fetches the data from the database. It seems that the guessing process is based on the result set. This mock job can be found within the workspace directory (Mock_job_for_Guess_schema.java) and is generated in the moment you're pressing the "guess schema" button. The query which is going to be generated is something like "select a, b, c, d ... from table). So ALL data is going to be fetched only for guessing the schema.

 

Unfortunately most of our datasets have between 500 Million and 1.5 billion entries and between 50 an 100 columns (an we are talking about ~100 tables). So setting up the schema manually would a huge **** in the ****.

 

Does anyone has an idea how to either limit the amount of rows fetched by the "guess schema" process or a workaround for this problem?

 

Best

Arne

3 REPLIES
Seven Stars

Re: Guess schema: problem with big tables

If your requirement is only about the retrieving the schema for tables, why don't you  create a connection in Metadata and use Retrieve Schema option. 

 

dadadadad.png

Four Stars

Re: Guess schema: problem with big tables

When I'm starting the schema retrieval process, all column names are fetched correctly from the database BUT the Db Types are not set and the Typ is always String

 

Bildschirmfoto 2017-10-20 um 15.12.29.png

 

For a correct mapping I have to start the "guess schema process" again by pressing the according button which results in the process I described above. It would literally last forever to fetch all data from the server in order to guess the correct Db- and Java-types

Four Stars

Re: Guess schema: problem with big tables

This is the process which is started, when retrieving the correct DB Schema

/usr/bin/java -Xms1024M -Xmx2048M -Dfile.encoding=UTF-8 -cp /Applications/TalendStudio-6.4.1/studio/workspace/.Java/target/classes:.:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/dom4j-1.6.1.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/jtds-1.3.1-patch.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/log4j-1.2.16.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/talend_DB_mssqlUtil.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/talend_file_enhanced_20070724.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/talendcsv.jar: test_ci.mock_job_for_guess_schema_0_1.Mock_job_for_Guess_schema --context=Default