Six Stars

Customer Data Cleansing including data pre-processing/standardization

Apologies if it sounds a stupid question. I have been relentlessly searching to get a high level answer. I am working on Customer Personal Identifier/Information. Over a period of years Customer information like Name/Address/Email/Phone....has not been standardized/cleansed.

 

I am looking to first 

1) Standardize data i.e. remove any white space characters...ensure email address is correct and so on

2) There after I need to de-duplicate data but based on some algorithm 

 

  • Same Address or Payment Card: 0.3
  • Last Name SoundEx: 0.25
  • First Name SoundEx: 0.1
  • Title: 0.05
  • Email, Telephone, or Visitor Id : 0.3

define upper and lower threshold...

 

The total of above adds to one/1

 

So I need to check row by row every record with all other records and come up with 

Customer Key1Cusomer Key2Match Type
ABSame Customer
ACSame Customer
D NoMatch

 

 

In above case for e.g. A and B are 2 separate records but they have same payment card information and same address and same email so Same customer

record A and C also match as they have same First name Last Name and address....after that within this i will create a Golden record

 

I can see and have tried Talend for Data Quality does Data Profiling only.....not actual transformation. This gives you stats on how good or bad your data is....

 

I have seen Talend for Data Preparation..here I can load a file apply my basic preparations i.e. remove white spaces...etc..and use this preparation in a job.

 

My fundamental question was where can use a component where I can define my weight and match (threshold) and then decide which ones are my Customer Golden records???

 

I seem to have got lost. 

I am looking to standardise/cleanse/merge  to a golden customer record.

 

Any pointers will be greatly appreciated.

 

Please can you refer to this video

https://youtu.be/sozxWzAXLBM?list=PLZrVWXgbuqT5OEM_QwwgopJHlUHAZzp2i&t=1477

 

here in this step through talend the key value match is given weights. 

 

 

Thanks

 

  • Data Quality
2 ACCEPTED SOLUTIONS

Accepted Solutions
Employee

Re: Customer Data Cleansing including data pre-processing/standardization

Hi,

 

What you see in the video are the Data Quality components which can be leveraged in a Talend job an (namely tMatchGroup here), which address your deduplication use case. These components are only available in the commercial version of Talend Data Quality, not in Talend Open Studio for Data Quality. See the feature matrix in https://www.talend.com/products/data-quality for more details.

 

Let me know if you need additional details.

 

Regards,

 

Gwendal

Community Manager

Re: Customer Data Cleansing including data pre-processing/standardization

Hi Ashish

Talend Open Studio for Data Quality is the open source free studio, it does not contain the cleansing components you're looking for such as tMatchgroup.
If you can find tMatchgroup in your palette, then you're on a Subscription-based product.
HTH
Elisa
4 REPLIES
Employee

Re: Customer Data Cleansing including data pre-processing/standardization

Hi,

 

What you see in the video are the Data Quality components which can be leveraged in a Talend job an (namely tMatchGroup here), which address your deduplication use case. These components are only available in the commercial version of Talend Data Quality, not in Talend Open Studio for Data Quality. See the feature matrix in https://www.talend.com/products/data-quality for more details.

 

Let me know if you need additional details.

 

Regards,

 

Gwendal

Six Stars

Re: Customer Data Cleansing including data pre-processing/standardization

Thanks for your reply.

 

Now I understand, the component label has been renamed in the demo... We have licensed version of Talend Open Studio for Big Data. I can see the palette does have all the required Data Quality components, would be great if you can please re-confirm the same.

 

Many Thanks for your quick reply.

 

 

Community Manager

Re: Customer Data Cleansing including data pre-processing/standardization

Hi Ashish

Talend Open Studio for Data Quality is the open source free studio, it does not contain the cleansing components you're looking for such as tMatchgroup.
If you can find tMatchgroup in your palette, then you're on a Subscription-based product.
HTH
Elisa
Six Stars

Re: Customer Data Cleansing including data pre-processing/standardization

Thanks very much for all your responses.