Remove header dynamically for variable rows header

Four Stars

Remove header dynamically for variable rows header

Hi,

I have a use case where a file comes to my system daily has multiple headers.

The no. of header counts are not same for each file. They can be 1,5,10, etc but there is an identification that says where the header ends.

 

E.g Inside the file, here in the below example the record begins with "Cards_14Q1_1" then headers, end of header(EOH), actual data header followed by actual data elements

 

Cards_14Q1_1
20180331
Month-End
45678098
Feed Name| Load Status|Date
TN-MT-34671| WAITING|
TN-MT-34871| WAITING|
NR-PP-14678|WAITING|
TN-MT-34872| WAITING|

EOH

AccId|Branch Address|Legal Entity|Cross Border

XB124|920|100|N

67894|920|110|N

B1000|857|100|Y

 

Expected output after removing the headers:

XB124|920|100|N

67894|920|110|N

B1000|857|100|Y

 

I don't want to hard code the header record "AccId|Branch Address|Legal Entity|Cross Border" to find the start of data records, as the same framework can be used for different files having same format with different header record.

 

The solution I am thinking is as follows.

 

1. Assign Record Number to each Record.

2. Find out the Record Number with "EOH"  Say EOHRECNUM (Any other record having EOH in between field will not qualify, eg GEOHYDRO)

3. Write an expression to filter record number less than equals to EOHRECNUM + 1

 

Thanks for your help.

 

Best,

Dhiren


Accepted Solutions
Sixteen Stars

Re: Remove header dynamically for variable rows header

Your idea would work. However, I think you *may* have missed an important deviation from the rule you have indicated. You say that there can be 1, 5, 10, etc header rows and that an identifier ("EOH") tells you when the header has ended. You then have your data column header after that. Great, but that doesn't work for 1 header row, given the example you have included. Your example shows 10 header rows and EOH is the 10th row. If EOH has its own row, does that mean you could have data like below for the 1 header row example.....

 

EOH

AccId|Branch Address|Legal Entity|Cross Border

XB124|920|100|N

67894|920|110|N

B1000|857|100|Y


....if so, this will work. Otherwise you may need to tidy up the rules a bit. However, I will go with the assumption that the rule stands and my example above is legitimate.

Here is a list of steps I would take....

 

1) Set your fileinput schema up to be of 4 columns (all would have to be String). Also configure your separators, etc. If your row does not have a separator, Talend will assume it only has one column. Be careful if header rows contain more than 4 separators for whatever reason. If that is the case, you will need to set up that many columns.

2) Use a tMap with a tMap boolean variable (for example, "header_found"). With every row that arrives. When the first column holds "EOH", set the tMap boolean variable to true. Otherwise set it to itself. This will have the effect of a switch as soon as the EOH record is found. 

3) In a tMap variable below the "header_found" variable, create an integer variable called "count". Use an expression similar to below to increment a count once "header_found" is true.....

 

Var.header_found==true ? Var.count== null ? 1 : Var.count + 1 :  0

That will apply this logic "If the header_found is true and count is null, set count to 1. If the header_found is true and count is not null, add 1 to count. If header_found is not true, set count to 0". Since "header_found" will remain true once set, this will create an incrementing count from the EOH record.

4) In your tMap output, simply set a filter based on the "count" value. The filter should read....

Var.count > 2

This will not output the EOH record or your column header records, but will output everything afterwards.

 


All Replies
Sixteen Stars

Re: Remove header dynamically for variable rows header

Your idea would work. However, I think you *may* have missed an important deviation from the rule you have indicated. You say that there can be 1, 5, 10, etc header rows and that an identifier ("EOH") tells you when the header has ended. You then have your data column header after that. Great, but that doesn't work for 1 header row, given the example you have included. Your example shows 10 header rows and EOH is the 10th row. If EOH has its own row, does that mean you could have data like below for the 1 header row example.....

 

EOH

AccId|Branch Address|Legal Entity|Cross Border

XB124|920|100|N

67894|920|110|N

B1000|857|100|Y


....if so, this will work. Otherwise you may need to tidy up the rules a bit. However, I will go with the assumption that the rule stands and my example above is legitimate.

Here is a list of steps I would take....

 

1) Set your fileinput schema up to be of 4 columns (all would have to be String). Also configure your separators, etc. If your row does not have a separator, Talend will assume it only has one column. Be careful if header rows contain more than 4 separators for whatever reason. If that is the case, you will need to set up that many columns.

2) Use a tMap with a tMap boolean variable (for example, "header_found"). With every row that arrives. When the first column holds "EOH", set the tMap boolean variable to true. Otherwise set it to itself. This will have the effect of a switch as soon as the EOH record is found. 

3) In a tMap variable below the "header_found" variable, create an integer variable called "count". Use an expression similar to below to increment a count once "header_found" is true.....

 

Var.header_found==true ? Var.count== null ? 1 : Var.count + 1 :  0

That will apply this logic "If the header_found is true and count is null, set count to 1. If the header_found is true and count is not null, add 1 to count. If header_found is not true, set count to 0". Since "header_found" will remain true once set, this will create an incrementing count from the EOH record.

4) In your tMap output, simply set a filter based on the "count" value. The filter should read....

Var.count > 2

This will not output the EOH record or your column header records, but will output everything afterwards.