I have a spark streaming job which is consuming messages from MapR stream. I am trying to put the messages into HDFS location and from there I am trying to process it using a batch process.
The problem is every batch (I have set a 2 minutes batch) in the streaming job is creating a separate directory in HDFS with a timestamp value. I am not sure how to merge all the files for a particular day and feed the merged file to my end of day batch job to be processed.
Can anybody please help here?
You can use tHiveoutput component to append to the same directory, which invokes df.append method in partitioned manner.One way i could think of.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Learn how to make your data more available, reduce costs and cut your build time
Read about OTTO's experiences with Big Data and Personalized Experiences
Take a look at this video about Talend Integration with Databricks