Dear Talend Community.
I recently found Talend and started using it and I am still very new to the whole Data Warehousing concept so please bear with me..
Suppose I owned two companies that sell different goods. Both got their own databases, which both look slightly different.
To get to the point quickly, I got these two tables from both of my databases which are called product group:
The product groups for one database are:
The product groups for the other database are:
Each product in each of the databases correspond to exactly one productgroup. (Bread = Bred and buns; Chair = Furniture)
I also got my data warehouse, which is just another database which I created based on a star schema that can hold all data from both databases.
This data warehouse also got a table called product group.
I would like to load the data from both databases into the data warehouse.
So I should have 12 columns after loading the data into the warehouse, with Id's 1 to 12..?
How can I create such an ETL process? I tried around with tMap's but could not get the desired 12 columns as result.
A follow-up question would then be: what happens if both databases add one more product group?
Suppose DB1 adds "Paintings" and DB2 adds "Care Product".
What happens to the data in the data warehouse? DB1 will have 4. Paintings which will be a duplicate key because 4 will
already exist as bread and buns, which was added from an earlier ETL Process?
Am I understanding something wrong, or thinking wrong about it? I just dont get it how to do it, please help
Thanks for helping
Since there has been no answer to my question yet, I'll go ahead and answer with what I found to be working. (Still no idea if this is a correct use of a data warehouse or not though).
I made two database inputs, which extract the table "Product" from each of my two databases.
I then mapped each output individually and created an output that corresponds to the data warehouse table where the data should get loaded into.
For the problem with the Id's of the products, I came up with the idea, that I could just add +1000 to the id of the first database table and +2000 to the second database table. Which solves the whole "which id belongs to which product" issue. It also gets rid of the overlapping Id's if one of the databases adds more products, since you can have 1000 products, until the id's will overflow to 2000 where a duplicate key would happen again.
I then put a tUnite after both maps, to merge the data together into one table. The merged (or should I say united?) table will then get loaded into my data warehouse.
And et voilà the problem is solved.
As I already mentioned, I don’t know if this is the correct approach to data warehousing.
I hope someone can clarify, if what I found out is actually correct and good to do or if there is another method of doing some manual ETL process.
For now, this will be a first prototype that actually works very well.
One problem that is not solved, or rather came up now, is that if both databases have the same product. For example, DB 1 got chairs and DB 2 got chairs as well, the same exact chairs. Those could be stored under one id in the database. But they are duplicate entries with different keys right now.
Well that’s everything for now
Thanks for reading and for any answers to questions that came up in this thread.
Can you publish a screenshot of your job to discuss and be found a more optimized solution? because I too am inert at this imigration of data.
Watch the recorded webinar!
Accelerate your data lake projects with an agile approach
Create systems and workflow to manage clean data ingestion and data transformation.
Introduction to Talend Open Studio for Data Integration.