Six Stars

Extract more than 10k records from thttpsrequest component

I am using Talend big data 6.4 and I have got a scenario which requires your guys expertise.
Here is the scenario:
I am using thhtprequest component (GET method) to extract the data which is hosted on kinvey server. Due to some restrictions at source; if any tables having more than 10k records, only first 10k records are extracted and the remaining records are discarded and not sent through that request.
Here, I require your expertise to help me to extract all records by some work around.

While doing some investigation, I came to know a concept called pagination can be used to solve the problem. But I don't know how to configure this pagination in Talend or to use any other components for this purpose.

It would be more beneficial if you can share some ideas on how to get this working and also show us the screenshot about list of components used for that job.

Any other ways to accomplish this work around is also greatly welcome. (I heard another way is to use tloop component). kindly share the screenshot of the components used along with any java codes written for this purpose.
  • Big Data
  • Data Integration
2 ACCEPTED SOLUTIONS

Accepted Solutions
Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

The code uses the number of iterations of the loop to calculate the record numbers....

//Set the limit value
int limit = 1000;
//Set the skip value....(1000 x the current iteration of the loop) - 1000
int skip = (1000 * ((Integer)globalMap.get("tLoop_1_CURRENT_ITERATION")).intValue()) -1000;

//Set the query value
String query = "?query={}&limit=" +limit+"&skip="+skip;

//Assign the query value to the query globalMap variable
globalMap.put("query", query);

If we assume that the first iteration is iteration 1, then the query string will be....

"?query={}&limit=1000&skip=0"

The second iteration it will be....

"?query={}&limit=1000&skip=1000"

 

The third iteration it will be....

"?query={}&limit=1000&skip=2000"

Rilhia Solutions
Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

Use a tJava component instead of a tJavaRow. tJavaRow components cannot be connected using "iterate" links, they need "Main" rows. Since the tLoop only provides an "Iterate" link, this needs to be considered. Link the tJava to the tHttpRequest using an "Iterate" link. 

It should look something like this.....

 

tLoop --iterate--> tJava ---iterate--> tHttpRequest

Rilhia Solutions
19 REPLIES
Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

OK, this depends on how the pagination is enabled in your service, but I have written a tutorial for retrieving Spotify listening history that makes use of a type of this functionaliy. You may be able to extrapolate from this in order to solve your problem. The tutorial is here https://www.rilhia.com/tutorials/using-talend-get-your-spotify-listening-history-facebook

You will need to search for the "The GetMySpotifyListeningHistory Job" and look at steps 4,5,6,7 and 8. It's not the easiest of things to get conceptually, but hopefully you can extrapolate from that.

Essentially the process is....
1) Connect your HttpRequest to a tLoop
2) Run your HttpRequest using a globalMap variable holding the URL.
3) Retrieve the data, process it (or store it) and retrieve the new URL (for the next batch of data). Store it in the globalMap

4) Perform logic to enable the Loop to run again

......and so on.

Hope that helps :-)

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Yours is the one of the first place I went and try to replicate from my side and tailored it for my needs

I did exactly what you mentioned, but unfortunately nothing happens. Can you check why it is not working? Please note the endpoint value i gave is dummy, but the actual one i give is correct and it is working independently while using through thttprequest component. Below is the screenshot of the same.

endpoint has a valid value, used here is dummyendpoint has a valid value, used here is dummytBigqueryoutput have all valid values and can connect to BigQuery with no problemtBigqueryoutput have all valid values and can connect to BigQuery with no problem

Community Manager

Re: Extract more than 10k records from thttpsrequest component

Hello 

I am working a task that has the similar problem, the rest API only returns limited number of records each time, however,  the API provides external parameters limit N offset N to read all records by calling the API multiple times. I am using a tLoop to do a loop in the job. see 

The URL looks like:

"https://......?q=SELECT * from messages limit 1000 offset "+((Integer)globalMap.get("tLoop_1_CURRENT_VALUE"))"

1.png2.png

Hope it helps you.

 

Regards

Shong

----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Great shong.
You are the man. Can you send me that job; so that I can tailor that job to
my needs. Kindly share.
Community Manager

Re: Extract more than 10k records from thttpsrequest component

I export the job items from v6.4.0, you should use the same version or higher version to import the job to your studio. 

----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Hi Shong,
Thankyou very much for the help.
However, I see you are getting the count using a query in URI, but in my
case there is a column in my JSON file as count containing the total no.of
records for that load and all rows will have same value for that particular
JSON load.
So if you run my job then the count variable is set to 20000 (say you have
20k records at input) for 20000 times. So will it be a problem if such
assignment is happening 20k times? Can it be changed to load this variable
only once instead? Because I am getting 20000 times that the count value is
set to 20000 while running.
Also I see the iteration is set to 3. Does this mean that this block will
run only 3 times Max? If so will it load only 3000 records Max for your
requirement as you set the limit to 1000.
Please clarify and give advice on the above queries I have.

Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

Can you show us the JSON response you get. My way will only work if the JSON returns a new URL for the next set of records (which is implemented in the Facebook/Spotify domain), @shong's solution will only work if you are asked to set a numeric parameter at the end of the URL.

In order to help you we will need to see what we are working with :-)

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Hi Rhall_2_0,

What you say is the case, if i want to extract 1st 10k records i should append this code "?query={}&limit=10000&skip=0" to existing URI to get 1st 10k records and to extract 2nd 10k records, the uri should have to be changed to URI + "?query={}&limit=10000&skip=10000" and so on. May I know how exactly do we work to get all the records by using a loop with this varying URI?

Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

OK, you can keep most of your job structure (I would used @shong's example for this). What you have to remember is that the last part of your URL will adjust with every call. The code you will need to change is below.....

 

//Set the limit value
int limit = 1000;
//Set the skip value....(1000 x the current iteration of the loop) - 1000
int skip = (1000 * ((Integer)globalMap.get("tLoop_1_CURRENT_ITERATION")).intValue()) -1000;

//Set the query value
String query = "?query={}&limit=" +limit+"&skip="+skip;

//Assign the query value to the query globalMap variable
globalMap.put("query", query);

You will then need to append ((String)globalMap.get("query")) to your URL.

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

I am not sure on which component you want to make this change. Do you want to create a tjava component? If possible can you send me the modified job . (Shong's job is available with this thread) as I am not an expertise in java and sending the job like how you did on your previous post, it will be more understandable and can help to extend this feature by me and anybody who is having such a similar issue in future.

Kindly assist.


rhall_2_0 wrote:

OK, you can keep most of your job structure (I would used @shong's example for this). What you have to remember is that the last part of your URL will adjust with every call. The code you will need to change is below.....

 

//Set the limit value
int limit = 1000;
//Set the skip value....(1000 x the current iteration of the loop) - 1000
int skip = (1000 * ((Integer)globalMap.get("tLoop_1_CURRENT_ITERATION")).intValue()) -1000;

//Set the query value
String query = "?query={}&limit=" +limit+"&skip="+skip;

//Assign the query value to the query globalMap variable
globalMap.put("query", query);

You will then need to append ((String)globalMap.get("query")) to your URL.


 

Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

I'm afraid I cannot do this for you since I do not have the time to reconfigure my system to do this. However, I have given you the bulk of the code you will need for this. You are right that this will need to be done in a tJava component. 

 

The best way for you to get better at this is to struggle through this. You have all of the information you need, you just now need to work out how to implement it. Most of the work is done, you just need to think about variable names, etc

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Does this code of yours require finding the count of the records before getting into the loop (for iteration)

Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

The code uses the number of iterations of the loop to calculate the record numbers....

//Set the limit value
int limit = 1000;
//Set the skip value....(1000 x the current iteration of the loop) - 1000
int skip = (1000 * ((Integer)globalMap.get("tLoop_1_CURRENT_ITERATION")).intValue()) -1000;

//Set the query value
String query = "?query={}&limit=" +limit+"&skip="+skip;

//Assign the query value to the query globalMap variable
globalMap.put("query", query);

If we assume that the first iteration is iteration 1, then the query string will be....

"?query={}&limit=1000&skip=0"

The second iteration it will be....

"?query={}&limit=1000&skip=1000"

 

The third iteration it will be....

"?query={}&limit=1000&skip=2000"

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Hi rhall_2_0,

I am attaching you the screenshot of the job which i have build, but I could not able to join a tloop with tjavarow directly also i am not sure what should be the schema to be on the tjava as the http component can only have responsecode as an output. shall i have the schema as query and the input in thttpresponse component as query and the output to be responsecode?

Can you see if I made all the parts correctly and also how shall i connect the tloop component with tjavarow component.

Let me know if i missed anything out. Kindly assist

count_job_details.jpgmain_loop_job.jpg

Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

Use a tJava component instead of a tJavaRow. tJavaRow components cannot be connected using "iterate" links, they need "Main" rows. Since the tLoop only provides an "Iterate" link, this needs to be considered. Link the tJava to the tHttpRequest using an "Iterate" link. 

It should look something like this.....

 

tLoop --iterate--> tJava ---iterate--> tHttpRequest

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Thanks for the hint.
But this takes only records between 10k to 20k and loads (source have 20k
records) them into the target but neglects loading the first set of 10k
records.
May I know where did I do wrong?
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Problem fixed. Some problem with my settings. Kindly ignore my previous posts.

Huge thanks to rhall_2_0 and shong for helping me out and making me to learn more about Talend and its capabilities.

Please continue your service to the community.

Eleven Stars

Re: Extract more than 10k records from thttpsrequest component

Glad you got it sorted!

I felt a little bad about appearing to try to make you struggle, but I was a little busy and do think struggling a little massively increases the amount you learn :-)

Rilhia Solutions
Six Stars

Re: Extract more than 10k records from thttpsrequest component

Hi rhall_2_0,

Never feel bad. Given the timeline for me, I thought it cannot be done on time, but with your help, I can able to finish it in time.Thanks anyways and perhaps it did made me to learn more about the implementation.

 

I am also getting a WARNING message right now, though it does not affect the outcome of the job, I am little interested to know how can this be fixed. Let me know if you have any reasons in mind due to which this WARNING pops up while executing every time. Below is the URL of the topic which I posted,

https://community.talend.com/t5/Design-and-Development/Unable-to-find-mime-types-file-in-classpath-w...