One Star

read big bz2 in spark (Big Data Batch)

I'm trying to read a big bz2 file in spark batch (the file is in hdfs). I noticed that the spark job
is not splitting the file, and only using one executor to read the whole file. Takes more than 1 hour!.
The component that I'm using to read the file is tFileInputDelimited on Big Data Batch.
I analyzed the code generated and found out that the argument minPartitions in ctx.hadoopRDD 
is not being used.
I'm wondering if there is any way to specify the number of partitions, so many executors can be 
generated and the time to read the bz2 is decreased.
Thanks.-
1 REPLY
Moderator

Re: read big bz2 in spark (Big Data Batch)

Hi,
Could you please indicate the build version you are using? What does your spark job look like? Could you please post your work flow screenshot into forum?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.