Best way to count number of commas in a row?

Highlighted
Nine Stars

Best way to count number of commas in a row?

I have a large comma delimited file which has a single column which sometimes contains commas in the data, unfortunately there aren't double quotes around this column.

 

What is the best way to count the number of commas in a row and add double quotes around the column which needs them?

 

For example:

Col1,Col2,Col3,Col4

junk,junk,junk,junk

junk,junk,j,u,n,k,junk

junk,junk,junk,junk

junk,junk,ju,nk,junk


Accepted Solutions
Nine Stars

Re: Best way to count number of commas in a row?

I disagree because I know the following:

1. How many commas are supposed to be in a good row

2. I know which specific column has the commas in the data

 

This is what I am doing:

1. Filter the rows which have too many columns into file1

2. Reverse the data in the rows from file1

3. Put a double quote to close the column which has commas in the data

4. Reverse the data in the corrected rows, put in file2

5. Put a double quote before the start of the column which has commas in the data, put in file3

6. Insert the good rows from the original file into file3

 


All Replies
Eight Stars

Re: Best way to count number of commas in a row?

Hi @talendtester ,

 

You can read the file with tFileInputFullRow and in tMap for the fulle input string use StringUtils.countMatches(input_row,',').

 

Whenever this count+1 is greater than your number of columns, bifurcate those rows to some other files as those will contain extra commas and correct and re-process them.

 

You cannot correct them automatically as comma is a delimiter as well as it is appearing in the file.

 

Thanks and Regards,

Subhadip

Nine Stars

Re: Best way to count number of commas in a row?

I probably need to do one pass which counts from the left side and then I need to do another pass which counts the commas starting from the right side of the row because I have a variable number of commas in the column which doesn't have double quotes around it.

Employee

Re: Best way to count number of commas in a row?

Hi,

 

    You do not have to count two times one from left and another from right. Either way, the number of commas will be more which means the records is not as per standard.

 

     Its always better to agree with source flow during starting of the project to have robust data segregation and schema definition. Else you never know when the issue will hit you and it may even lead to data privacy law suits (For example, due to schema miscalculation, the data from person A went to person B).

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

 

Nine Stars

Re: Best way to count number of commas in a row?

The source of the data isn't going to change how they format the data.

Yes, I know there will extra commas.

 

If I split the good columns file1 and everything else file2, how do I clean up file2 when there is a variable amount of extra commas?

 

Is there a way to reverse all the data in a row in file2 and then count the number of commas before splitting the column with the extra commas into file3?

 

Employee

Re: Best way to count number of commas in a row?

Hi,

 

   You will have to send the data back to source to get the right data. 

     If you are trying to do the correction, it may (I would rather say will) lead to unwanted data errors.

 

Lets take your example. You got below data with extra columns. How do you know how to merge the data to which column?

 

 junk,junk,j,u,n,k,junk

 junk,junk,j,u,n,k,junk

 junk,junk,j,u,n,k,junk

 junk,junk,j,u,n,k,junk

 

You may merge it as below.

 

 junk,junk,junk,junk

 junk,junk,junk,junk

 junk,junk,junk,junk

 junk,junk,junk,junk

 

But source might have expected the data in below format.

 junk,junkju,nk,junk

 junk,junkj,unk,junk

 junk,junkjun,k,junk

 junk,junk,junk,junk

 

So you never know about a data other than taking some assumptions. So ideal way is to send back to the source or send the data to a Data Steward Team for validation. I hope the answer cleared the fog :-)

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

Nine Stars

Re: Best way to count number of commas in a row?

I disagree because I know the following:

1. How many commas are supposed to be in a good row

2. I know which specific column has the commas in the data

 

This is what I am doing:

1. Filter the rows which have too many columns into file1

2. Reverse the data in the rows from file1

3. Put a double quote to close the column which has commas in the data

4. Reverse the data in the corrected rows, put in file2

5. Put a double quote before the start of the column which has commas in the data, put in file3

6. Insert the good rows from the original file into file3

 

Calling Talend Open Studio Users

The first 100 community members completing the Open Studio survey win a $10 gift voucher.

Start the survey

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.

Download