We have a requirement to read multiple hdfs files and convert them into parquet. the input files will be present in different directories and recursive path.
We want to iterate all the files and pass it to output file component. do we have any component that can iterate files and hold the file name as global variable?
You can do all the control part with a DI job and can trigger the BD job using independent child process option selected as on.
Thanks Nikhil. we designed our job with same logic but we are facing processing slowness when we use standard job.
We are using below operations in master job
1. Download file from S3 to local & copy to hdfs
2. Convert csv file to parquet hdfs
3. Copy hdfs file to local & upload to S3
Currently we are not able to run more than 10 parallel flows. job server is 8 cpu machine and accepting only 8 tRunJob flows. do we have any solution for increase the parallel threads.
As as we are getting slowness, we decided to use pure big data jobs.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Learn how to make your data more available, reduce costs and cut your build time
Read about OTTO's experiences with Big Data and Personalized Experiences
Take a look at this video about Talend Integration with Databricks