at the moment I'm trying to unload the data from an old sql server to our new datalake. I experienced an issue while trying to read the schema info from our tables.
When guessing the schema of a table Talend starts a mock job in the background, which fetches the data from the database. It seems that the guessing process is based on the result set. This mock job can be found within the workspace directory (Mock_job_for_Guess_schema.java) and is generated in the moment you're pressing the "guess schema" button. The query which is going to be generated is something like "select a, b, c, d ... from table). So ALL data is going to be fetched only for guessing the schema.
Unfortunately most of our datasets have between 500 Million and 1.5 billion entries and between 50 an 100 columns (an we are talking about ~100 tables). So setting up the schema manually would a huge **** in the ****.
Does anyone has an idea how to either limit the amount of rows fetched by the "guess schema" process or a workaround for this problem?
If your requirement is only about the retrieving the schema for tables, why don't you create a connection in Metadata and use Retrieve Schema option.
When I'm starting the schema retrieval process, all column names are fetched correctly from the database BUT the Db Types are not set and the Typ is always String
For a correct mapping I have to start the "guess schema process" again by pressing the according button which results in the process I described above. It would literally last forever to fetch all data from the server in order to guess the correct Db- and Java-types
This is the process which is started, when retrieving the correct DB Schema
/usr/bin/java -Xms1024M -Xmx2048M -Dfile.encoding=UTF-8 -cp /Applications/TalendStudio-6.4.1/studio/workspace/.Java/target/classes:.:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/dom4j-1.6.1.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/jtds-1.3.1-patch.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/log4j-1.2.16.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/talend_DB_mssqlUtil.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/talend_file_enhanced_20070724.jar:/Applications/TalendStudio-6.4.1/studio/workspace/.Java/lib/talendcsv.jar: test_ci.mock_job_for_guess_schema_0_1.Mock_job_for_Guess_schema --context=Default
Watch the recorded webinar!
Introduction to Talend Open Studio for Data Integration.
Test drive Talend's enterprise products.
Practical steps to developing your data integration strategy.