Seven Stars

LIKE operator on same dataset

Hi all,

 

Just for an example, consider the following data

like.PNG

 

I want to apply LIKE operator on NAME and CITY and Equal operator on STATE and ZIP columns so that I can expect following output :

 

Monroe Township, NJ  |  Monroe Township  |  NJ  |  08831 ........ i.e. First occurrence only.

 

I've tried tFilterRow component but don't know how to apply it for this requirement.Which component or steps or function should I apply to get this result ?

 

Thanks !

 

  • Data Integration
Tags (1)
3 REPLIES
Ten Stars

Re: LIKE operator on same dataset

This sounds like a tFuzzyMatch component solution IF you cannot define some logic to make this A LOT quicker and more accurate.

 

For example, you gave the following.....

 

Monroe Township, NJ  |  Monroe Township  |  NJ  |

 

Now can you say that everything before a comma should match the second column? If so, just use String manipulation. It makes sense to consider applying some pre-processing rules to this before going down the tFUzzyMatch route.

Rilhia Solutions
Seven Stars

Re: LIKE operator on same dataset

No its not the case of everything before a comma.
tUniqueRow - I can use tUniqueRow, but it is based on only equality condition and not the LIKE kind of thing.
Same in the case of tAggregateRow.

I simply want to group the table based on like condition instead of equality condition.

Can we achieve it using tMap. I'm gonna give a try to it.


Ten Stars

Re: LIKE operator on same dataset

The problem you have is that you are assuming a "world knowledge" of a human. Computers can't work like that (here is where my AI degree comes into play :-) ).

 

Consider the numbers 1 and 11. They are not "like" each other to us or a computer....they are very different. Now consider 1111111111111111111 and 11111111111111111111. To us (on first inspection) they look "like" each other...until we actually count the 1s. To a computer, they are different. They are massively different. Now if we change numbers to text, our brains automatically spot patterns. So the following text is seen as "the same".....

Hello my name is Richard

Hello my name si Richard

 

We autocorrect (which is both good and bad), a computer won't. To a computer that is just a series of bits without a context. That is why "Like" is such a difficult task.

 

It has been solved by many mechanisms, but they are not always very efficient or easy to implement in Data Integration. What I was suggesting was that you look for rules to apply. For example, if you make the Strings uppercase, remove leading and trailing spaces, etc. Once you have done that, then you *might* be able to use Java String functionality like "indexOf" (https://docs.oracle.com/javase/7/docs/api/java/lang/String.html).

 

However, if you cannot apply these rules you may have to use Fuzzy Matching. This is a clever mechanism, but requires a lot of work to get it perfect...if you can get it "perfect" at all.

Rilhia Solutions