How can I find keywords in each row on a excel column?

Highlighted
Six Stars

How can I find keywords in each row on a excel column?

Hi,

 

I have an excel source with 10 columns, I need to read the column user_comment and try to find one or more keywords from a list, and input in a new column called comment_keywords. The keywords list can be any type. I have both excel and csv with comma separator key words. How would be the best approach to do that? 

 

Thanks in advance!

Highlighted
Employee

Re: How can I find keywords in each row on a excel column?

Hi,

 

   Could you please share some sample records for your scenario to get better clarity about your use case?

 

    Please add the sample records as text or csv data instead of image so that we can copy and try to do some analysis.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Community Manager

Re: How can I find keywords in each row on a excel column?

OK, this was a fun problem to look at. I should point out that the method I am demonstrating is not the only way of achieving this, but it works. I am demonstrating without Excel or CSV files (just to make it easier for me to build), but you can add these in. The job that I have created looks like below...

 

Screenshot 2019-12-19 at 14.55.59.png

Your Excel is represented by tFixedFlowInput_1 and your list of Strings is represented by tFixedFlowInput_2.

 

What I am doing is using a tMap with no joins between the Main input and the Lookup. This means that every Main row will be tested against every Lookup row.  I then use some logic in tMap variables to identify which Main row is being processed and when the Main row changes, to wipe the list of matched Strings I have found. I also use a Routine (essentially a Java method) to assess whether each word appears in the String. The tMap looks like below....

 

Screenshot 2019-12-19 at 14.56.26.png

The order of the tMap variables is very important. The variables are processed from top to bottom and they can keep their values between rows. Each of the variables is described below....

Expression  Type Nullable Variable Description
Var.ThisText  String Yes LastText This is set to the value of the ThisText variable. For the first row, it will be null as it will be set before ThisText is set. This is used to help identify when we have finished checking for all of the Lookup Words.
row1.Text  String No ThisText This is set by your incoming Text
Var.LastText==null || Var.LastText.compareToIgnoreCase(Var.ThisText)!=0   Boolean No NewText This is a check to see whether we are now checking a new piece of text. If so, it tells us how to deal with the next variable.
Var.NewText ? routines.TestRoutine.findWordsInString(row1.Text, row2.Words,",") : Var.WordsList + routines.TestRoutine.findWordsInString(row1.Text, row2.Words,",") String Yes WordsList This is used to create the list of words found. Pay attention to the Var.NextText check. If it is a new Text, it will overwrite the WordList variable. Otherwise it just adds to it. This uses a Routine that I will show below.

 

The routine that I have written for this covers several things. When I was initially testing this, I found that words such as "at" could be found in words like "what", "that", etc. So in order to make sure this doesn't happen, I wrote this to strip out all punctuation and text the words of the sentence one by one. The whole word. The routine method can be seen below. I added it to an existing "TestRoutine" Routine. You will need to create one and add the following code....

 

	public static String findWordsInString(String data, String word, String separator) {
		String[] wordsInData = data.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
		String returnVal = "";
		
		for(int i=0; i<wordsInData.length; i++) {
			if(wordsInData[i].compareToIgnoreCase(word.toLowerCase())==0) {
				returnVal = word+separator;
				break;
			}
		}
		
		return returnVal;
	}

Once you have this, the last step is to tidy up the results. If you have 4 Main rows and 10 Lookup rows, this job will return 40 rows. To prevent this I have added a tAggregateRow component which is grouped on "Text" and the function on WordsList is set to Last. This essentially returns the last row for each group of Text.

 

Screenshot 2019-12-19 at 14.56.39.png

 

Let me know how you get on.

2019 GARTNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

Best Practices for Using Context Variables with Talend – Part 2

Part 2 of a series on Context Variables

Blog

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog