I have a spark streaming job which is consuming messages from MapR stream. I am trying to put the messages into HDFS location and from there I am trying to process it using a batch process.
The problem is every batch (I have set a 2 minutes batch) in the streaming job is creating a separate directory in HDFS with a timestamp value. I am not sure how to merge all the files for a particular day and feed the merged file to my end of day batch job to be processed.
Can anybody please help here?
You can use tHiveoutput component to append to the same directory, which invokes df.append method in partitioned manner.One way i could think of.