MDM Version Upgrade Methodology Part 1

The Problem

 

A new version of Talend MDM has been released, and you wish to migrate from an older version of Talend MDM to this new version. Talend MDM is one of the few products in the Talend suite that contains both user-developed artifacts (such as models and roles) and actual business data. The other products where parallels can be drawn in terms of migration requirements are Data Preparation and Data Stewardship, but these are considerably simpler applications, and thus the migration process for these products is also simpler.

 

There has been considerable misunderstanding and misinformation on the topic of MDM migrations in the past. This document outlines in detail the Talend-supported approaches to MDM migration. Obviously, MDM is part of a wider platform of products: Data Integration (DI), Data Quality (DQ), Enterprise Service Bus (ESB), and Big Data. We may have migration actions to perform with all of these platform components, but the scope of this article is limited to MDM.

 

Assessing the Situation

 

An MDM migration will usually involve migrating your MDM objects: Models, Roles, Views, and so on. However, assuming that:

  • these objects have been properly version controlled
  • you have a tightly controlled approach to deployment of these objects to the MDM Server

then deployment of these objects onto the new version of the MDM server is simply a case of performing the deployment using either the new version Studio or CommandLine (to do an automated deployment)[1]. You must always deploy from a studio that is the same Talend release version as the MDM server. However, for environments where maintaining the state of the hub is important, you may also wish to migrate the contents of the hub—not just the business data in your MDM model(s), but also things like:

  • Users
  • AutoIncrement counters
  • Journal Entries

 

Background

 

To understand the problem further, some history is required. In v5.1 and earlier, this data was stored in an XML database—Qizx or eXist.

In v5.2, Talend switched the storage of the following MDM Containers to a relational database:

  • MASTER—one or more databases[2] that contain the business entities that you are Mastering in your MDM projects
  • STAGING—in v5.2, a database used in some scenarios as a load mechanism (External sources –> MASTER). Later versions utilise staging only for Integrated Matching.
  • UpdateReport—The MDM Journal. Depending on the datasources.xml configuration, this may be in the same database as MASTER.
  • CrossReferencing—little used, but still required to exist. Again, may optionally be physically placed in the same database as MASTER depending on how MDM is configured.

The key point is that the MDM models that you build and deploy now map to a physical, normalised, human-readable schema in the Relational Database, rather than schema-less documents in a document database. The process of how this is achieved is out of scope for this article.

 

The remaining internal data stores/containers, collectively known as the SYSTEM database, continued to be stored in an XML database in v5.2. In v5.3, the SYSTEM database migrated to relational storage as well.

 

In v5.2 and v5.3 a mechanism was required for migrating from the XML database to the Relational databases. This was provided in the form of the MDM DB migration tool. It is this tool and process of migrating from XML to relational that is documented in the official migration guide in the MDM section, see Automatically migrating from Talend XML database to a relational database or between two relational .... However, the process documented can be considered incomplete at this time.

 

Migration Challenges

 

So, given that MDM uses a relational database for its storage, during an upgrade why can you not just point the new MDM server to the existing databases (or a clone of the existing databases)? This is absolutely not a supported approach for customers to take. On rare occasions, Talend Professional Services may use this approach with the prior approval of Talend R&D, but the circumstances where this is both possible and necessary are extremely limited. The reasons for this are as follows:

  1. The manifestation of the physical MDM databases may change between releases. Talend reserves the right to change anything in both the system and user databases. For example:

    comparison.png

     

    Given that Talend reserves the right to change the schema, migration must occur using the application layer, as opposed to directly within the physical storage. As the storage is a 3rd normal form manifestation of the MDM model, it would be nearly impossible for the tool to alter the database ‘on the fly’. Instead, you must go through the application layer, as the entity definition within the model (and therefore the XML representation of an instance of an entity) as used by the application layer will remain identical between versions. The application layer actually has no concept of how the entity is physically stored—it just understands model definitions, XML documents, and relations between XML documents (foreign key relationships in the model).

  2. The full text search indexes (Lucene indexes) provide the contains operations and their variations within MDM—fast, non-exact searching. They are created as records are inserted into MDM through the application layer. It is possible to ‘re-generate’ the Lucene indexes in an emergency (there is an API for this). However, this is potentially a time consuming process. For example, 75 million records in MDM could take 18 hours or more to do a full index regeneration. Thus, combined with the issue highlighted in issue 1, this is not usually viable or desirable in a migration scenario.

These instructions continue in MDM Version Upgrade Methodology Part 2, which defines the assumptions, prerequisites, and supported approaches to MDM migration.

 

[1] MDM does not currently have the concept of an MDM artifact (binary) that can be properly versioned in a Nexus repository in line with the other developed artifacts (such as Jobs and routes) that can be built using Talend. This is a feature request: https://jira.talendforge.org/browse/PMMDM-261, but not a critical one due to the fact that an MDM publish event is much less common than DI or ESB.

[2] This article uses the term database in the manner accepted by most databases, except in sections dealing specifically with Oracle, where schema will be used.

Version history
Revision #:
8 of 8
Last update:
‎09-19-2017 04:40 PM
Updated by:
 
Labels (1)
Contributors