Four Stars

[resolved] Salesforce Bulk Queries for Large data sets

I want to query using the Salesforce Bulk API all the contacts in my org. The largest batch size is 10K, I would have expected Talend to send multiple batches. How can I achieve this.
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] Salesforce Bulk Queries for Large data sets

I still have this problem where talend DI will only send one batch request.
I'm trying to track down where it should write the logic for batch files, but I can only find it for the upload in the SalesforceBulkAPI.java:
    private List<BatchInfo> createBatchesFromCSVFile() throws IOException, AsyncApiException {
List<BatchInfo> batchInfos = new ArrayList<BatchInfo>();
BufferedReader rdr = new BufferedReader(new InputStreamReader(new FileInputStream(bulkFileName), FILE_ENCODING));
// read the CSV header row
byte[] headerBytes = (rdr.readLine() + "\n").getBytes("UTF-8");
int headerBytesLength = headerBytes.length;
File tmpFile = File.createTempFile("sforceBulkAPI", ".csv");
// Split the CSV file into multiple batches
try {
FileOutputStream tmpOut = new FileOutputStream(tmpFile);
int currentBytes = 0;
int currentLines = 0;
String nextLine;
boolean needStart = true;
boolean needEnds = true;
while ((nextLine = rdr.readLine()) != null) {
int num = countQuotes(nextLine);
// nextLine is header or footer of the record
if (num % 2 == 1) {
if (!needStart) {
needEnds = false;
} else {
needStart = false;
}
} else {
// nextLine is a whole record or middle of the record
if (needEnds && needStart) {
needEnds = false;
needStart = false;
}
}
byte[] bytes = (nextLine + "\n").getBytes("UTF-8");
// Create a new batch when our batch size limit is reached
if (currentBytes + bytes.length > maxBytesPerBatch || currentLines > maxRowsPerBatch) {
createBatch(tmpOut, tmpFile, batchInfos);
currentBytes = 0;
currentLines = 0;
}
if (currentBytes == 0) {
tmpOut = new FileOutputStream(tmpFile);
tmpOut.write(headerBytes);
currentBytes = headerBytesLength;
currentLines = 1;
}
tmpOut.write(bytes);
currentBytes += bytes.length;
if (!needStart && !needEnds) {
currentLines++;
needStart = true;
needEnds = true;
}
}
// Finished processing all rows
// Create a final batch for any remaining data
if (currentLines > 1) {
createBatch(tmpOut, tmpFile, batchInfos);
}
} finally {
tmpFile.delete();
}
return batchInfos;
}
12 REPLIES
Four Stars

Re: [resolved] Salesforce Bulk Queries for Large data sets

I want to the same ability you get on the BulkExec import. I want to be able to query for all 7M contacts in one go and have salesforce output the file back to Talend.
One Star

Re: [resolved] Salesforce Bulk Queries for Large data sets

Hi ,
I couldn't find the 'query' action for the salesforceBulkExec component, so I ends up with the same question as you raised.
Did you found any work-around for this?
Thanks
Four Stars

Re: [resolved] Salesforce Bulk Queries for Large data sets

You have to use tSalesforceConnection + tSalesforceInput component. Configure these two to work together making sure you select "For salesforce bulk component" inside the tSalesforceConnection component; then output to CSV as the last step.
The issue I have is this configuration only allows you to export out up to the Salesforce maximum batch size of 10,000.
I was hoping that the component would create (n) of batches if required automatically, but it looks like I have to create that logic. Waiting on talend.
One Star

Re: [resolved] Salesforce Bulk Queries for Large data sets

Hi, thanks a lot for the reply, just found that my Talend version 5.1 was outdated as the bulk query only available in 5.2.x
Referring to this guideline: http://www.salesforce.com/us/developer/docs/api_asynch/index.htm
- Bulk query would has limitation max. 10 files, each max. size 1 gig
- If the query needs to return more than 10 files, the query should be filtered to return less data. Bulk batch sizes are not used for bulk queries.
So, I am a bit confuse with the 10,000 limitation. I would think this is the batch size limit to upload data to Salesforce, but not the other way round as in the bulk query
Four Stars

Re: [resolved] Salesforce Bulk Queries for Large data sets

Thanks for replying. I understand the limitations of the Salesforce API, but still struggling with configuring Talend to query large data (millions of records) volumes using Bulk. Non bulk is not an option.
If I fire the same query using Jitterbit/workbench then these apps simply create additional batches automatically and return the records in the query. In my case 7M+ records are returned in a single file within 10-15 minutes.
Similarly if I am importing using the Bulk API talend just keeps sending batches with 10K of records until it hits the limit of 2000 in 24 hrs.
Any ideas?
One Star

Re: [resolved] Salesforce Bulk Queries for Large data sets

Hi,
Thanks a lot of pointing out the limitation was something 'implemented' by Talend instead of Salesforce, I am clear as what you means now.
I think, the existing Talend that with batches with 10K may never meet your requirement to get 7M+ of records.
Is it possible to filter your data by certain criteria, for e.g. lastModifiedDate ?
I am working on a flow something like this:
get lastSynDate in local DB -> add this as filter criteria in the query condition -> populate data -> update lastSynDate in local DB.
Not too sure if this flow applicable for your scenario ...
Again, thanks a lot for your info.
Four Stars

Re: [resolved] Salesforce Bulk Queries for Large data sets

That's one solution. Thanks for sharing.
I still feel this should be natively handled by the Talend component. Will raise an an enhancement request.
I am planning on building a process incorporates first bulk exporting using Workbench and then using Talend to process the file. It's not critical process, only something which we run each time we refresh a sandbox so it doesn't really matter if it has a few more steps.
Four Stars

Re: [resolved] Salesforce Bulk Queries for Large data sets

I'm now working with another large salesforce.com client and the latest version of Talend 5.3.0r101800
Same as before we are attempting to perform large bulk queries and hitting 10K record limit. I would expect Talend to be able to spool multiple batches out to CSV. The full data set is being queried (see attachment), but Talend cannot process the output.
Any advice / pointers
Community Manager

Re: [resolved] Salesforce Bulk Queries for Large data sets

Hi agentgill
I seen there have been a same issue report on jira
https://jira.talendforge.org/browse/TDI-7779
and it had been fixed since v4.0.4. Don't know tSalesforceBulkExec have been updated in the latest version, but I would like to suggest you to download v4.1.0, which is closer to v4.0.4, and test if it works on v4.1.0. You can download v4.1.0 from this link:
http://sourceforge.net/projects/talend-studio/files/Talend%20Open%20Studio/4.1.0/
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Four Stars

Re: [resolved] Salesforce Bulk Queries for Large data sets

Thanks for response. This is a really old version. I'm using version 5.3.0 and also uploading large batches is working fine. My issue is with bulk querying large recordsets in bulk. Any ideas on that?
Community Manager

Re: [resolved] Salesforce Bulk Queries for Large data sets

This problem may still appear in the latest version 5.3.0, i suggest you to reopen TDI-7779 or create a new issue in our bugtracker, so that our developers from R&D team can see it.
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] Salesforce Bulk Queries for Large data sets

I still have this problem where talend DI will only send one batch request.
I'm trying to track down where it should write the logic for batch files, but I can only find it for the upload in the SalesforceBulkAPI.java:
    private List<BatchInfo> createBatchesFromCSVFile() throws IOException, AsyncApiException {
List<BatchInfo> batchInfos = new ArrayList<BatchInfo>();
BufferedReader rdr = new BufferedReader(new InputStreamReader(new FileInputStream(bulkFileName), FILE_ENCODING));
// read the CSV header row
byte[] headerBytes = (rdr.readLine() + "\n").getBytes("UTF-8");
int headerBytesLength = headerBytes.length;
File tmpFile = File.createTempFile("sforceBulkAPI", ".csv");
// Split the CSV file into multiple batches
try {
FileOutputStream tmpOut = new FileOutputStream(tmpFile);
int currentBytes = 0;
int currentLines = 0;
String nextLine;
boolean needStart = true;
boolean needEnds = true;
while ((nextLine = rdr.readLine()) != null) {
int num = countQuotes(nextLine);
// nextLine is header or footer of the record
if (num % 2 == 1) {
if (!needStart) {
needEnds = false;
} else {
needStart = false;
}
} else {
// nextLine is a whole record or middle of the record
if (needEnds && needStart) {
needEnds = false;
needStart = false;
}
}
byte[] bytes = (nextLine + "\n").getBytes("UTF-8");
// Create a new batch when our batch size limit is reached
if (currentBytes + bytes.length > maxBytesPerBatch || currentLines > maxRowsPerBatch) {
createBatch(tmpOut, tmpFile, batchInfos);
currentBytes = 0;
currentLines = 0;
}
if (currentBytes == 0) {
tmpOut = new FileOutputStream(tmpFile);
tmpOut.write(headerBytes);
currentBytes = headerBytesLength;
currentLines = 1;
}
tmpOut.write(bytes);
currentBytes += bytes.length;
if (!needStart && !needEnds) {
currentLines++;
needStart = true;
needEnds = true;
}
}
// Finished processing all rows
// Create a final batch for any remaining data
if (currentLines > 1) {
createBatch(tmpOut, tmpFile, batchInfos);
}
} finally {
tmpFile.delete();
}
return batchInfos;
}