I would like a sanity check as I am very new to Talend DI.
I am using version 6.4.1 of Talend DI, the 'free' edition.
I need to retrieve a large amount of data (via tREST JSON or XML) from Elastic Search, about 1 GB of JSON per hour.
I am planning to use tRESTclient and tREST in Open DI.
If I were to write a standalone Java program (just for sake of example) then I would first post a request to Elastic Search to obtain a scroll_id which is conceptually like a database SQL Open Cursor statement. In this initial scroll set-up request I would establish the max payload size to be returned (for example 500 'rows').
Next I would use the returned scroll_id value in a loop to make repeated calls to Elastic Search to get the next batch of data, returned in a JSON/XML document.
I would need to loop this call until end-of-data condition is reached and inside the loop I would somehow store the retrieved JSON/XML returned payload in a database or a file and repeat.
In Talend I intend to use tRESTclient to set up and return the scroll_id.
Then in a loop I plan to use tREST to pass the scroll_id and return next batch of payload in JSON/XML form.
Also in the same loop I would map the data and store it in a database/file.
Is this going to work?
Is this a good use-case for Talend DI or am I better off just writing a Java program without using Talend ?
Is this going to perform with large amount of data in Talend DI?
I will be retrieving about 1 GB of JSON/XML data every hour via the above tREST loop calls.
If there is a better solution using the free version of Talend DI, please advise.
Many thanks in advance
Solved! Go to Solution.
The tREST component will be calling a service. How long will it take to download 1 GB of JSON? What will be the payload per request? Remember that a webservice is just like a webpage, there is a timeout setting. If your payload is too big, and it takes too long to download the data for 1 request, the service may time out.
Also, would you be running 1 instance of this logic or multiple instances of this logic on multiple servers to parallelise?
Talend Big Data (paid version) has ElasticSearch components. That may simplify your need. You can try the Big Data Sandbox, get a trial license, and test it out.
If you have Groovy doing the same logic, I am sure you can reproduce the same in Java. However, there is no ElasticSearch component in the open source version. You may end up writing some code. And you will need to test it to figure out whether Java perform the same as Groovy. That is the comparison, since Talend just generates Java code.
Watch the recorded webinar!
Accelerate your data lake projects with an agile approach
Create systems and workflow to manage clean data ingestion and data transformation.
Introduction to Talend Open Studio for Data Integration.