Four Stars

Identify duplicates and merge records

Hi,  I am fairly new to talend and first time using this forum, so apologies in advance.

I use Talend open studio Big Data (free version) and as part of my job I have to carry certain integrations.

I have searched and tried several solutions mentioned by other users, but unfortunately it does not serve the purpose of my need

Currently I am working on a Project where I need to merge records where if duplication occurs, however, if the master and child record contain same data then no merge is required, this is based on the personal data.

We are using maria DB and our database contain few  hundred tables and from those table, we are only interested in few tables that will help identify duplicated record. The concept of this to eliminate duplication.

 

So, we have a person table(personID, firstname, suranme), which links to the person address table (personaddid, personID, town, city, county etc), person email table (personemailID, personID, email, emailtype(home, private, work) and person telephone number (personTelephoneID, personID, telephone, telephonetype(home, private, work). 

 

Based on these tables, I first need to identify if the record is duplicated, if so, only merge missing data to master record.

For example,

Person Table

 

 

PersonID

Firstname

Surname

1

Jo

Blog

2

Jo

Blog

 

 

 

Person Telephone

 

 

PersonTelephoneID

PersonID

Telephone

1

1

(+44)01234567890

2

1

01234567890

1

2

(+44) 01234567890

2

2

(+44)01234567891

3

2

01234567890

 

These two candidate are classed as duplicate because candidate 2 contains all data same as candidate 1 except with one unique number
So, I want to be able to identify these two candidates are duplicates and then based on I want to merge the candidate 2 data into the candidate 1, but that also needs to take in consideration the regular expressions, for example, candidate 1=(+44)01234567890 is same as candidate 2= 01234567890, therefore do not insert this number into candidate 1 (candidate 2 telephone number will read "merged" or even additional rows deleted)

 

Outcome candidate 1

 

PersonTelephoneID

PersonID

Telephone

1

1

(+44)01234567890

New id will be created when inserted in table

2

(+44)01234567891

 

 

 

Outcome candidate 2

 

 

PersonTelephoneID

PersonID

Telephone

1

2

"merged"

5 REPLIES
Moderator

Re: Identify duplicates and merge records

Hello,

If we understand your requirement very well, are you trying to insert in your database only the new data?  If so, you can compare your tables by using tMap and set the "Catch lookup inner join reject" as true to get the changed data.

Feel free to correct us if we misunderstand your requirement.

Best regards

Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Four Stars

Re: Identify duplicates and merge records

@xdshi Thank you for taking your time to reply.

Yes, you are correct to a certain degree. I want to only insert new data in the person telephone table if the duplicated candidate contains unique values. So, I will first need to identify the duplicate records based on telephone numbers, then from this duplication list I will need to establish which record from the duplication will be master and which will be child- depending on various scenarios, but as an example, let say latest activity against the record is classed as master. If the record contains same value in this case telephone number, then skip insert, otherwise insert new data in Master record (person telephone table), then remove child data or perhaps overwrite one of the values to "merged" in order to carry out deletion at later stage.

In order to identify duplicates I would have written query in the talend, however, I do not think this will help, as this will only pull through records where the numbers are same, I want duplication but also to see additional numbers against the record.

select
if(act.actDate >b.duplicateDate,'Master',
if(b.duplicateDate > act.actDate,'Master','Child')) Classification
,if(act.actDate >b.duplicateDate,a.PersonID,
if(b.duplicateDate > act.actDate,b.duplicateID,'Unsure')) ClassificationID
,concat(p.firstname,' ',p.surname) name
,a.PersonID
,a.CleanedTelephoneNumber
,b.duplicateName

,b.duplicateID

,b.duplicateTele
, act.actDate
,b.duplicateDate

from person_telephone a
join person p on p.PersonID = a.PersonID
join candidate c on c.Candidate_ID = p.PersonID
join
(
select
t.PersonID duplicateID,CleanedTelephoneNumber duplicateTele,concat(p.firstname,' ',p.surname) duplicateName,p.firstname,p.surname, act.actDate duplicateDate
from person_telephone t
join person p on p.PersonID = t.PersonID
join candidate c on c.Candidate_ID = p.PersonID
left join
(
select
j.JournalItemId
,j.ItemCreationDate actDate
,mx.can
from journal_item j
join
(
select max(je.journalitemid) ID
,je.actionid can
from journal_entity je
join journal_item j on j.JournalItemId = je.JournalItemId
where EntityType = 1
#and je.ActionId = 87009
group by je.ActionId
) mx on mx.id = j.JournalItemId
) act on act.can = c.candidate_id

) b
on b.duplicateTele = a.TelephoneNumber and b.duplicateID <> a.PersonID and b.surname = p.Surname and b.firstname = p.FirstName

left join
(
select
j.JournalItemId
,j.ItemCreationDate actDate
,mx.can
from journal_item j
join
(
select max(je.journalitemid) ID
,je.actionid can
from journal_entity je
join journal_item j on j.JournalItemId = je.JournalItemId
where EntityType = 1
#and je.ActionId = 87009
group by je.ActionId
) mx on mx.id = j.JournalItemId
) act on act.can = c.candidate_id


limit 10
;

Now from that I want only the unique phone number from child to go into master record in the person telephone table.

 

**Note: all our tables have auto incremented PK therefore the two duplicate records  will contain different ID**

I hope this makes sense.

Five Stars

Re: Identify duplicates and merge records

why dont you do a second query as a lookup to tmap 

SELECT col1, col2, col3, col4

FROM table

GROUP BY col1, col2, col3, col4

HAVING COUNT(*) > 1

and in tMap have the condition row1.phoneumber != lookup.phonenumber 

Five Stars

Re: Identify duplicates and merge records

the above will wipe out all the duplicates but it gives you a path to follow or a way to cleanse your database

Four Stars

Re: Identify duplicates and merge records

@jcruie Thank you for your reply.

@jcruie and @xdshi  are both saying once I have written a script to identify the duplicate, I write the same query again and then use the tmap to do inner join , then make outcome action true? I am not sure i understand how that can be achieved as i need to insert all numbers that are unique from he duplicate record.

there is a merge query written in mysql,however this merges all the data rather than ignoring duplicate and inserting only the unique records. 

//please not this code is not mine, it was written by ex employee

BEGIN

DECLARE iscont BIGINT default null;
DEClARE candloc BIGINT default null;
DECLARE candjob BIGINT default null;
DECLARE emptycursor BIGINT default 0;

DECLARE contcheck cursor for (select count(clientcontactid) from client_contact_relationship ccr
where ccr.clientcontactid=pchild and relationshipstatus = 8653);

DECLARE curloc cursor for (select candidate_preferred_location.location
from candidate_preferred_location
where candidate_preferred_location.CandidateiD=pchild
and candidate_preferred_location.location not in
(select location from candidate_preferred_location where candidate_preferred_location.candidateid=pmaster)
);

DECLARE jobcat cursor for (select candidate_job_categories.jobcategory
from candidate_job_categories
where candidate_job_categories.candidateid=pchild
and candidate_job_categories.jobcategory not in
(select jobcategory from candidate_job_categories where candidate_job_categories.candidateid=pmaster));

DECLARE EXIT HANDLER FOR SQLEXCEPTION ROLLBACK;
DECLARE EXIT HANDLER FOR SQLWARNING ROLLBACK;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET emptycursor = 1;


open curloc;

loc_loop: LOOP
FETCH curloc into candloc;
IF emptycursor = 1 THEN
LEAVE loc_loop;
else
update candidate_preferred_location set candidateid=pmaster where candidateid=pchild and candidate_preferred_location.location=candloc;

END IF;
END LOOP;

close curloc;

open jobcat;

job_loop:LOOP
set emptycursor = 0;
FETCH jobcat into candjob;
IF emptycursor = 1 THEN
LEAVE job_loop;
else
update candidate_job_categories set candidateid=pmaster where candidateid=pchild and jobcategory=candjob;
END IF;
END LOOP;

close jobcat;

START TRANSACTION;

open contcheck;
fetch contcheck into iscont;
if iscont > 0 then

select 'This Candidate is a live Contact, cannot be merged.';

else

update journal_entity set actionid=pmaster where actionid=pchild;
update arc_journal_entity set actionid=pmaster where actionid=pchild;
update requirement_candidate_application set Candidateid=pmaster where candidateid=pchild;
update candidate_assignment set candidateid=pmaster where candidateid=pchild;
update candidate_industries set PersonID=pmaster where PersonID=pchild;
update candidate_certificates set CandidateID=pmaster where CandidateID=pchild;
update candidate_companies_no_approach set CandidateID=pmaster where CandidateID=pchild;
update candidate_education set PersonID=pmaster where PersonID=pchild;
update candidate_reference_web_sites set CandidateID=pmaster where CandidateID=pchild;
update candidate_references set CandidateID=pmaster where CandidateID=pchild;
update candidate_experience set CandidateID=pmaster where CandidateID=pchild;
update candidate_lanquages_spoken set CandidateID=pmaster where CandidateID=pchild;
update candidate_payroll_detail set Candidate_ID=pmaster where Candidate_ID=pchild;
update candidate_ppe_issued set CandidateID=pmaster where CandidateID=pchild;
update candidate_qualifications set CandidateID=pmaster where CandidateID=pchild;
update candidate_unspent_convictions set CandidateID=pmaster where CandidateID=pchild;
update person_email set PersonID=pmaster where PersonID=pchild;
update person_address set PersonD=pmaster where PersonD=pchild;
update person_emergency_contact set PersonID=pmaster where PersonID=pchild;
update person_skills set PersonID=pmaster where PersonID=pchild;
update person_telephone set PersonID=pmaster where PersonID=pchild;
update document_attached_entities set entityid=pmaster where entityid=pchild and entityType_id=1;
update user_favourite set entityID=pmaster where entityID=pchild and entity_type=1;
update search_entry set entityId = pmaster where entityId=pchild and entityType=1;
update candidate set candidatestatus=9338,
timestamp=NOW()
where candidate_id=pchild;

insert into
merged_entities (entityid,entitytype, date_merged)
values
((select candidate_id from candidate where candidate.candidate_id=pchild),
'1',
NOW());

delete from selected_results where entityid=pchild and entityType=1;
delete from search_entry where entityid=pchild and entityType=1;
delete from sb_candidate_lookup where candidateid=pchild;
delete from candidate_permissions where candidateid=pchild;
delete from candidate_preferred_location where candidateid=pchild;
delete from candidate_job_categories where candidateid=pchild;
delete from user_favourite where entityid=pchild and entity_Type=1;

COMMIT;

END if;

 

if its easy could I put this script in talend and enter parameters then talend steps through the code and only inserts the unique record?