One Star

Talend Bigdata -POC-Use case Help

Hello Experts,
We are planning to start a POC for around 15 use cases by using Talend Big data open source edition and on successful we are planning to replace our existing commercial ETL tool with Talend Enterprise BIG data edition.
Could you please someone help me on implementing the below use case in Talend:-
One of our Sql server source table will be updating frequently in every 2 hrs with some transnational data (through some front end application) and we need to load that data into an Hadoop HDFS file (which is Big data environment)
We need to load the data from Sql server table to HDFS file in every 2hrs and each time we load the file we should extract only new or modified rows from the table instead of loading whole data in to the file (due to waste of space)
There is a ‘Load_date_Time’ column in source table but we can’t trust it.Hence each time we extract the data we need to compare the data with already loaded in previous cycle and load only new or changed rows in to target HDFS file.Also we don't have any control on source tables apart from just extracting the data.
And Talend job should be automated to run for every 2hrs.
How do we achieve above 2 scenarios,any help would be appreciated!
Thanks in advance!
Abhi
2 REPLIES
Moderator

Re: Talend Bigdata -POC-Use case Help

Hi Abhikriti,
Thanks for posting your job requirement here.
We have redirected your requirement to our bigdata experts and then come back to you as soon as we can.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Employee

Re: Talend Bigdata -POC-Use case Help

Abhikriti,
What DB you're using ? MS-SQL ? What does the data look like ?
We can offload the data from the DB to HDFS using tSqoop component and then do some post-processing using HIVE, MAPREDUCE or SPARK. (Please note the MAPREDUCE and SPARK are only available with Talend entreprise, that you can download and try).
The tELTHiveXXX components can be used to do the post-processing by creating Hive tables and results.
Best Regrads,