S[ark Streaming creating multiple hdfs directory

Four Stars

S[ark Streaming creating multiple hdfs directory

Hi All,

 

I have a spark streaming job which is consuming messages from MapR stream. I am trying to put the messages into HDFS location and from there I am trying to process it using a batch process. 

The problem is every batch (I have set  a 2 minutes batch) in the streaming job is creating a separate directory in HDFS with a timestamp value. I am not sure how to merge all the files for a particular day and feed the merged file to my end of day batch job to be processed.

 

Can anybody please help here?

 

Six Stars

Re: S[ark Streaming creating multiple hdfs directory

You can use tHiveoutput component to append to the same directory, which invokes df.append method in partitioned manner.One way i could think of.

 

Calling Talend Open Studio Users

The first 100 community members completing the Open Studio survey win a $10 gift voucher.

Start the survey

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Put Massive Amounts of Data to Work

Learn how to make your data more available, reduce costs and cut your build time

Watch Now

How OTTO Utilizes Big Data to Deliver Personalized Experiences

Read about OTTO's experiences with Big Data and Personalized Experiences

Blog

Talend Integration with Databricks

Take a look at this video about Talend Integration with Databricks

Watch Now