First of all, i warn you, this topic is quite long and may take some time to read and understand so if you run out of time, you might prefer to come back later . Good luck others, stay strong !
So as the title says, I am currently trying to design a quite complex flow in talend with multiple SQL statements that need to be run all along the flow. I have already designed the whole job, which is working, but I was wondering if it could be done in a better way (with or without talend). First, let me explain you the flow step by step and its goal.
I - Goal
The goal is to migrate datas from an online application (datas can be reached by api rest calls so let's name this app "api") to a MySQL database of another application (let's name this app "crm" because ... this is a CRM, what a surprise).
We do not develop these two applications so unfortunately we can't do whatever we want with both databases...
As I said, api datas in input are reached with api rest calls, that means we do not even have access to the database. However we do have access to the crm database in output so we can insert datas in it but we can't alter the schema because it is used by the crm application.
II - Context
It is all about schools, courses and learners. Schools are subscribed to courses, learners are part of a school and are also subscribed to courses. All schools and courses are already migrated, now we only need to migrate learners. This is a job which may be run multiple times in order to synchronise both applications, so we need to update learners that have already been migrated.
III - Learners migration step by step
1) First we need to execute a query in crm database in order to get all existing courses because we can only get learners by course with api rest request. Of course ids are different so we query a specific table (crm_course_id, api_course_id) which gives us matching between both app ids.
2) For each course, we run a rest request in order to get learners (and school in which they are registred).
3) Now for each learner, we need to insert or update it in crm database. But this is not that easy :
3.1) We do have api_learner_id so now we need to get crm_learner_id by querying another specific table (crm_learner_id, crm_school_id, api_learner_id). If it exists, we have crm_learner_id and also crm_school_id (where learner is registered) and we are ready for update. If it doesn't exist we need to insert it.
3.2) If we are going to insert, we need crm_school_id. We are going to get it by querying another table (crm_school_id, api_school_id).
3.3) We proceed to update or insert
IV - Drawbacks
Create this whole flow with talend is quite annoying for two reasons :
1) We can't (or I don't know how) simply run queries that can return no result directly inside our flow. tDBInput doesn't allow entry, so it can't be placed along the job. tDBRow can do the job but if it returns no row then the job stops here. This is annoying when you search for an existing crm learner id because the result may be null meaning you need to insert the entity.
The only way I found to go through this is to build a child job that take as parameter api_learner_id, execute query to search for crm_learner_id and store result in global with tSetGlobalVar. Then it runs a tJava component and tBufferOutput which returns to the main job crm_learner_id if found, null otherwise.
I don't know if there is another way to do it but this is a very simple need quite difficult to build in talend ... One simple select statement that will return either an id or null is done with a whole childjob, (tDBInput -> tsetGlobalVar) -> onSubjobOk -> tJava -> tBufferOutput ...
2) Performance are poor ... Why ? If we set aside the fact that we may do lots of api calls because we can't get learners in another way ..., we are still running lots of SQL queries ... it would have been way better if we could store all our existing learners ids in an array at the beginning of the job. This way, we could have looked for existing ids there instead of running queries. I don't know if I can manage such a thing with talend...
V - Python tests
In comparison I have build this same flow with a Python script (Pandas & SQL Alchemy) and it is way easier if you know the basics of development (python or not, that was my first attempt with python). Moreover, execution process went faster because I could store temporary datas in DataFrame instead of running tons of queries. However, a python script is much less user friendly and maintainable than a talend job. Finally, all our flows are made with talend and I can't mix talend and python. I can't decide to build a talend job one time, and a python script another time... things must be homogeneous and that's why I was wondering if I could improve this talend job with your help & advice.
Thank you very much for reading,
Note : I tried to simplify the flow as much as I could for this topic. In fact, there is more than one query to run if the learner has not been found in crm database. There is some others ids that need to be retrieved before inserting a new learner.
Any advice or whatever ?
My most needed answer here can be "How to retrieve data from database directly inside a job (with components before & components after) when result can only be null or unique ?". Currently, the only way I found to handle this is to use a child job which seems quite complex for such a simple need in an ETL flow...
Then I am interesting in knowing if it is possible to store query results in array in order to re-use them later in the job. The goal here is to retrieve different datas at different times, always from the same table without being forced to query the database (which is slow).
If needed I can provide some simplified screenshots in order to better understand this topic & my questions Just let me know.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Find out how to migrate from one database to another using the Dynamic schema
Learn how to use an API-First Approach to Modernize your Applications
Pick up some tips and tricks with Context Variables