Four Stars

Avoid multiple header rows?

I have a couple of CSV files that I load into Data Prep. All at once (I only specify a directory in "Add Dataset", no individual files). So far, so good.

 

All files have the same structure, the first line is the header.

 

 

Is there a way to globally set the first row as header for all files? I know there is this "Row" -> "Make as header..." feature, but what happens in my case is:

 

file1.csv:

Firstname;Lastname;Age

Felix;Kjellberg;23

Julian;Ilett;43

 

file2.csv:

Firstname;Lastname;Age

Ben;Heck;58

Dave;Jones;48

 

The result in Data Prep is:

Firstname|Lastname|Age

Ben Heck 58

Dave Jones 48

Firstname Lastname Age

Felix Kjellberg 23

Julian Ilett 43

 

So even if I set the blue line as header, the green line will stay. Is there a way to avoid this?

 

  • Dataprep
1 REPLY
Employee

Re: Avoid multiple header rows?

Hi,

 

Out of curiosity, can you confirm the following?

  • You are using Data Prep 2.0.
  • The CSV files are on HDFS.

 

To answer your question: there is no dedicated data set parameter or function to remove subsequent occurrences of the header but you can do it in a single preparation step: set a filter on the first column with the column header as filter value (so filter on "Firstname" in your example below) and use the function "delete filtered rows".

 

Regards,

 

Gwendal