Seven Stars

tREST and/or tRESTclient - how to loop to retrieve data?

Hello everyone

 

I would like a sanity check as I am very new to Talend DI.

I am using version 6.4.1 of Talend DI, the 'free' edition.

 

I need to retrieve a large amount of data (via tREST JSON or XML) from Elastic Search, about 1 GB of JSON per hour.

 

I am planning to use tRESTclient and tREST in Open DI.

 

If I were to write a standalone Java program (just for sake of example) then I would first post a request to Elastic Search to obtain a scroll_id which is conceptually like a database SQL Open Cursor statement. In this initial scroll set-up request I would establish the max payload size to be returned (for example 500 'rows').

 

Next I would use the returned scroll_id value in a loop to make repeated calls to Elastic Search to get the next batch of data, returned in a JSON/XML document.

I would need to loop this call until end-of-data condition is reached and inside the loop I would somehow store the retrieved JSON/XML returned payload in a database or a file and repeat.

 

 

In Talend I intend to use tRESTclient to set up and return the scroll_id.

Then in a loop I plan to use tREST to pass the scroll_id and return next batch of payload in JSON/XML form.

Also in the same loop I would map the data and store it in a database/file.

 

Is this going to work?

Is this a good use-case for Talend DI or am I better off just writing a Java program without using Talend ?

Is this going to perform with large amount of data in Talend DI?

I will be retrieving about 1 GB of JSON/XML data every hour via the above tREST loop calls.

If there is a better solution using the free version of Talend DI, please advise.

 

Many thanks in advance

 

 

  • Data Integration
Tags (1)
1 ACCEPTED SOLUTION

Accepted Solutions
Seven Stars

Re: tREST and/or tRESTclient - how to loop to retrieve data?

I think I solved it now

 

in place of a "string" I use the global variable like so ((String) globalMap.get("variable_name"))

I can use this in http body and in relative path for tREST and tRESTclient

 

 

5 REPLIES
Employee

Re: tREST and/or tRESTclient - how to loop to retrieve data?

The tREST component will be calling a service.  How long will it take to download 1 GB of JSON?  What will be the payload per request?  Remember that a webservice is just like a webpage, there is a timeout setting.  If your payload is too big, and it takes too long to download the data for 1 request, the service may time out.

Also, would you be running 1 instance of this logic or multiple instances of this logic on multiple servers to parallelise?  

 

Talend Big Data (paid version) has ElasticSearch components. That may simplify your need.  You can try the Big Data Sandbox, get a trial license, and test it out. 

 

Seven Stars

Re: tREST and/or tRESTclient - how to loop to retrieve data?

many thanks for the reply!
I probably did not supply enough information, my apologies.

1 GB of data per hour is the total, not in a single tREST call.
After tRESTclient call to get scroll_id this is passed to tREST http body to make the loop of calls.
So the tREST call will loop perhaps 50-100 times retrieving the next chunk of data per each call, say ~20 MB per call * 50-100 times = 1-2 GB total inside each hour.
Currently there is a Groovy program which does this job running in a single instance, not parallel.
We are replacing Groovy (and adding some more functionality) with either Talend DI or pure Java.
Based on Groovy performance with Elastic Search I figure that I probably will not need to run multiple/parallel tasks in Java or Talend.
I hope this provides enough information for you to give me further guidance.
Many thanks again!
Employee

Re: tREST and/or tRESTclient - how to loop to retrieve data?

If you have Groovy doing the same logic, I am sure you can reproduce the same in Java.  However, there is no ElasticSearch component in the open source version.  You may end up writing some code.  And you will need to test it to figure out whether Java perform the same as Groovy.  That is the comparison, since Talend just generates Java code.

 

Seven Stars

Re: tREST and/or tRESTclient - how to loop to retrieve data?

thanks,

(1) Can I use tREST in a loop in a Talend DI job and to pass at run-time the value for its HTTP Body (as done in Basic Settings UI, statically) ? The value of scroll_id will need to be deposited at run-time to HTTP Body.
Which Talend document will tell me how?

(2) Can I pass to tRESTclient the value as seen in the "Relative Path" (in Basic Settings UI) - at run-time ? Once again, this must be done dynamically, i.e. using some sort of a 'variable' to set the value of Relative Path. Which Talend document will tell me how?

I am willing to write some Java helper code to be called inside Talend DI to do this assuming I will save effort overall compared to writing everything in Java myself.

thanks!
Seven Stars

Re: tREST and/or tRESTclient - how to loop to retrieve data?

I think I solved it now

 

in place of a "string" I use the global variable like so ((String) globalMap.get("variable_name"))

I can use this in http body and in relative path for tREST and tRESTclient