Four Stars

S[ark Streaming creating multiple hdfs directory

Hi All,


I have a spark streaming job which is consuming messages from MapR stream. I am trying to put the messages into HDFS location and from there I am trying to process it using a batch process. 

The problem is every batch (I have set  a 2 minutes batch) in the streaming job is creating a separate directory in HDFS with a timestamp value. I am not sure how to merge all the files for a particular day and feed the merged file to my end of day batch job to be processed.


Can anybody please help here?


Six Stars

Re: S[ark Streaming creating multiple hdfs directory

You can use tHiveoutput component to append to the same directory, which invokes df.append method in partitioned manner.One way i could think of.