Slow Insertion in Amazon Redshift

One Star

Slow Insertion in Amazon Redshift

Hi,
We have just created a simple job to fetch data from MySQL table (Local database and from Amazon RDS), having rows 300,000 and to insert these rows into Redshift. It took us more than 4 hours to do that.
1. Why is it very slow to fetch data from one single table and to insert it in Amazon Redshift using Talend OpenStudio Big data?
2. Is there a way to do a fast insertion? where it should insert it in less than 5 minutes?
Please find the attached screenshots for details.
thanks!
Moderator

Re: Slow Insertion in Amazon Redshift

Hi,
Do you set the "Commit every" in tRedshiftOutput and is there any complicate Sql query in your input component? What your current rate?
Best reards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Slow Insertion in Amazon Redshift

Hi,
Yes Commit every is set to 10,000. There is no complicated query, it is rather a simple query given below:
"select
`dim_ipdata_id` ,
`ipdata_ip` ,
`ipdata_isp` ,
`ipdata_org` ,
`ipdata_country` ,
`ipdata_city` ,
`ipdata_postal_code` ,
`ipdata_longitude` ,
`ipdata_latitude` ,
`ipdata_area_code` ,
`ipdata_metro_code` ,
`ipdata_category`
from dim_ipdata"
Current rate is 8 rows per second. Any idea what could go wrong there?
Moderator

Re: Slow Insertion in Amazon Redshift

Hi,
8 rows per second is not a normal rate. I have seen your screenshot and found that tMap component is only used to map data without other action. For a large data, the tMap component consume too much memory. How about removing it and the work flow should be : tAmazonMysqlInput-->tRedshiftoutput or storing the data on disk instead of memory?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Slow Insertion in Amazon Redshift

Hi,
Thanks for the valuable suggestions.
I removed tMap to make it like: tAmazonMysqlInput-->tRedshiftoutput , but even that didn't help. Regarding your second suggestion of storing data on disk, I still would have to use tMap for that or do we have other alternative?
Thanks!
Best Regards,
Ilyas
Moderator

Re: Slow Insertion in Amazon Redshift

Hi,
I don't think the second suggestion does work for you, because when you have removed tMap, there is no any help, which means tMap is not the block one.
For the "Commit every" option, is there any good news if you change the value "10,000"? It depends on your database, and each time submit will consume DB server resources.
In addition, resource is not the same for different database servers, so there is no fixed standard.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Slow Insertion in Amazon Redshift

Hi,
Yes but even removing tMap may not help us in a long run, because that's the one will allow us to manipulate strings, urls, joins, variables, etc.. For now we are just testing Talend for Redshift by applying simplest possible data transformation.
I've changed "Commit every" from 10000 to 1000 and still no luck! Smiley Sad
I used tRedshiftConnection to setup connection and then set tRedshiftInput's connection to "Use existing connection" where tRedshiftConnection was set as ref there. But that keep giving me a NullPointerException, so I had to provide all the connection details inside tRedshiftInput, so it doesn't use "Use existing connection" anymore, could that be a problem?
Best Regards,
Ilyas
Moderator

Re: Slow Insertion in Amazon Redshift

Hi,
To be honest, it is a very new component and I'm building a testing environment for it to see if I can get the same issue as yours. I'll come back to you asap, sorry for the inconvenience.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Slow Insertion in Amazon Redshift

Hi,
To be honest, it is a very new component and I'm building a testing environment for it to see if I can get the same issue as yours. I'll come back to you asap, sorry for the inconvenience.
Best regards
Sabrina

Hi,
I've got exactly the same problem. So is there any solution? Insert is too slow when inserting in Redshift.
Best Regards,
~Sergejs
One Star

Re: Slow Insertion in Amazon Redshift

Hi there,
Wwe have the same problem, 4 or 8 rows per second.
We tested different sources as mysql and postgresql but still the same problem.
next, we tried with talend Open Studio for data integration and Talend for big data, but still that problem.
Also, we tried with a postgresql bulk insert, but it had an error like this:
"Exception in component tPostgresqlOutputBulkExec_1_tPBE
org.postgresql.util.PSQLException: ERROR: COPY CSV is not supported"
Any help, please?
Thanks!
Leo
One Star

Re: Slow Insertion in Amazon Redshift

Hello,
We are facing the same problem too. Our Mysql database is installed on amazon EC2 (which is on the same region as of our Redshift instance).
I have set the "Commit every" option to 10000 in tRedshiftoutput component and not using any tMap component. Also it is a plain select statement from Mysql.
For 10300 rows (Table size is just about 10 MB in Mysql) it took about 7-8 min and for 440000 rows(about 50 MB in size) it took about 7 hours.
I have tried using jdbc-output component as well, but it dint make any difference.
Any solution for increasing the performance while using the Redshift component?
Right now the best way I am finding is writing the output to a flat file then upload it to an S3 bucket and use copy command to load to Redshift. This approach is taking less than a minute for the whole thing but is not very convenient and also requires some external script.
Thanks
Aditya
Moderator

Re: Slow Insertion in Amazon Redshift

Hi Aditya,
It is appreciated that open a JIRA issue in the Talend DI project of the JIRA bugtracker. Our developers will see if it is a bug and give a solution.
Post the jira issue link on forum to let others community user know it.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Employee

Re: Slow Insertion in Amazon Redshift

All,
I have reported this behaviour on jira, our R&D team will investigate.
The issue url is: https://jira.talendforge.org/browse/TDI-26155
Regards,
Moderator

Re: Slow Insertion in Amazon Redshift

Hi All,
Please vote for the jira issue https://jira.talendforge.org/browse/TDI-26155 created by adiallo and adding your comments into it.
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Employee

Re: Slow Insertion in Amazon Redshift

Hi,
The current component is using single INSERT statement in order to write into Redshift. This way of doing is totally inefficient according to the Redshift documentation and best pratices.
There are several ways to fix this issue. One of them is the COPY command to load data file which are located on S3 or DynamoDB. You could use this command with the tRedshiftRow component. Another one is the multiple insert, which is going to be implemented by the R&D in the TDI-26155.
Rémy.
One Star

Re: Slow Insertion in Amazon Redshift

Hi,
Are there any improvements, in Redshift components, have been made so far with the new version of Talend BD 5.4?
BR!
One Star

Re: Slow Insertion in Amazon Redshift

Any news on this? I am interested in using Talend to ETL into Redshift from mysql..
i have gotten much faster performance by using Talend to pump out files to S3 then using Amazon tools to pipe them to redshift.  the issues was that large files still took a while and lots of IO happening to go to file then up to cloud.  One can use Amazon's data pipleline i suppose.. but we lose the rich features of talend transformations...
Five Stars

Re: Slow Insertion in Amazon Redshift

I think these connectors don't have the BULK feature. On the input you're not able to set a Cursor Size on the output you're not able to set a Batch size. Try to use Regular MySQL/PostgreSQL components, which do have these features.
We had something similar with Greenplum.
One Star

Re: Slow Insertion in Amazon Redshift

Now that the bulk feature exists - how do we connect to it?
One Star

Re: Slow Insertion in Amazon Redshift

Hi.. did anyone find a solution for this, i am facing the same problem
Reading the data from MySQL and loading to Redshift, but the jobs are too slow....
Moderator

Re: Slow Insertion in Amazon Redshift

Hi naveed_aq,
What does your job design look like? Could you please upload your job design screenshots into forum?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Slow Insertion in Amazon Redshift

Hi naveed_aq,
What does your job design look like? Could you please upload your job design screenshots into forum?
Best regards
Sabrina

i have attached the snapshot, tmap does not have anything fancy, it is just one to one column mapping... 
i am running Talend on an EC2 machine having r3.8xlarge type 2 compute node structure, but insertion speed is just aroud a few hundreds, mostly around 300
Regards.
One Star

Re: Slow Insertion in Amazon Redshift

Hi,
Did anyone get any luck to increase performance to load data in redshift.
regards
Shankar
Five Stars

Re: Slow Insertion in Amazon Redshift

Has anyone figured out the fastest way to move data from MSSQL(or MySQL) to redshift.

 

Thanks,

Alok