One Star

Need Suggestion for Best Practice to choose Big Data Batch (with Spark) or Standard Job with tSqoop (with MR)

Hi expert,

 

first of all, I haven't seen any Main Topic for Big Data Discussion, like in your old Forum. Only BD sandbox which currently available. So I decided to ask here.

Any of you can give better idea, which one should we choose when we want to do the Data Ingestion from RDBMS to HDFS/Hive.
Been thinking of these 2 ways, please give the idea which one is better (or any other ways better):

1. In Standard Job: tSqoopImport --component ok--> tHiveLoad
OR
2. In Big data batch Job (Spark) : tXXXInput (RDBMS, such as Oracle/mssql/etc)  --main Job--> tFileOutputDelimited (to put to the HDFS) --> Load to Hive from HDFS
or maybe any of you have any better solution?

Huge thanks

  • Big Data
2 REPLIES
Moderator

Re: Need Suggestion for Best Practice to choose Big Data Batch (with Spark) or Standard Job with tSqoop (with MR)

Hi,

You can import data from RDBMS to hadoop using sqoop without using tHiveLoad
Please take a look at a related scenario in component reference about:TalendHelpCenter:tSqoopImport
Best regards
Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Need Suggestion for Best Practice to choose Big Data Batch (with Spark) or Standard Job with tSqoop (with MR)

Hi @xdshi, thanks for the reply.

Yes currently I'm using tSqoopImport and brings the data to HDFS, and since the destination is to Hive, so I used tHiveLoad.

My main confusion is, with the same scenario (RDBMS source to Hive), If I'm not mistaken, it is also possible to use Big data Batch Job, which can perform the task using Spark framework. The flow components more or less will be like the statement in point number 2

So back again to the question, which would be faster? (or the best practice)
Using big data standard job for ingestion from rdbms.
Or use Big Data Batch Job using Spark Framework?
CMIIW.

Thanks