Hello everybody !
I have an SQL input, I want to process the data by batches of 50 to give to a webservice. I would prefer to avoid running multiple queries. Is there an easy way to do that?
So far the only way I found was TMSSQLInput, then TFlowToIterate, then TFixedFlowInput for 50 rows (where I have to map field by field again), then the web service component. I think it should work, but before finishing it (the FixedFlowInput is a bit long to fill if there is a lot of columns !), I wanted to make sure there was not faster solution ; )
Thank you !
Your method will not work I'm afraid. The tFlowToIterate will iterate for every row and will not group the data into 50 rows.
What is wrong with running multiple queries? The easiest and most efficient way of achieving this will be to work out how many batches of 50 will be required, then call the TMSSQLInput that many times returning 50 rows each time. This needs to be driven by a tFlowToIterate component. So, the layout will be something like this....
(Component(s) to calculate number of iterations required) ---row--> tFlowToIterate ---iterate---> tMSSqlInput ----> (the rest of the job)--->
Indeed,the flowIterate will not be enought, that's the TFixedFlow component that will make it.
I prefer to avoid running too many queries on the database if I know that I need ALL the lines. The query is a bit big, so it's impacting the DB quite a lot. And I checked, the query is as good as it can be more or less, no index problem. The batch processing should be in the hand of Talend, this way I don't use any resources on the DB. We know how to loop on multiple queries, it's indeed working fine.
The other solution we will probably use it to dump all the content of the query in a temporary table, and run the queries on this table. it's much faster than to run the query on multiple tables.
As always, several solutions to the same problem Being a beginner with talend, I'm sometimes surprised by how easy it can be to do something with the right components. So I'm making sure I'm not missing something obvious by asking here now. Thanks for your feedback !
Unfortunately the tFixedFlow component doesn't work as you think it does. Given what you have described, you would end up with however many rows which are returned in your query being multiplied by 50 and being sent in batches of 50 of the same row to your web service. This is because the tFlowToIterate component returns only one row at a time. If you passed that to the tFixedFlow component, only that row would be processed until the end of the subjob, then the next row would be used by the iterator.
If your query really is the monster you say it is, your suggestion to load the data to a temporary table and then firing multiple queries at that might be the best solution. Doing it that way you could use a Talend job to move the data with your massive query AND pre-batch it (give it a group key of some sort). The you could dynamically switch your WHERE clause in your iterating query to accommodate that group key. This would be relatively easy in Talend.
The other way you could do this (if you wanted to just use Talend to handle this) would be to store the data rows in a Java collection of collections. This would allow you to hold your data entirely in memory and iterate over batches of 50 records at a time. This would involve using a fair amount of Java. I actually demonstrate something VERY similar in this tutorial (https://www.rilhia.com/tutorials/load-xml-files-batches-records). I wouldn't necessarily jump at this solution, it was really something I did to show it could be done. I would opt for using the database to batch your data.
Introduction to Talend Open Studio for Data Integration.
Practical steps to developing your data integration strategy.
Create systems and workflow to manage clean data ingestion and data transformation.