Iteration in Spark Job

Four Stars

Iteration in Spark Job

HI All,

We have a requirement to read multiple hdfs files and convert them into parquet. the input files will be present in different directories and recursive path.

We want to iterate all the files and pass it to output file component. do we have any component that can iterate files and hold the file name as global variable?

Employee

Re: Iteration in Spark Job

Hi,

 

    You can do all the control part with a DI job and can trigger the BD job using independent child process option selected as on.

 

Warm Regards,

 

Nikhil Thampi

Four Stars

Re: Iteration in Spark Job

Thanks Nikhil. we designed our job with same logic but we are facing processing slowness when we use standard job.

 

We are using below operations in master job

1. Download file from S3 to local & copy to hdfs

2. Convert csv file to parquet hdfs

3. Copy hdfs file to local & upload to S3

 

Currently we are not able to run more than 10 parallel flows. job server is 8 cpu machine and accepting only 8 tRunJob flows. do we have any solution for increase the parallel threads.

 

As as we are getting slowness, we decided to use pure big data jobs. 

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Put Massive Amounts of Data to Work

Learn how to make your data more available, reduce costs and cut your build time

Watch Now

How OTTO Utilizes Big Data to Deliver Personalized Experiences

Read about OTTO's experiences with Big Data and Personalized Experiences

Blog

Talend Integration with Databricks

Take a look at this video about Talend Integration with Databricks

Watch Now