One Star

S3 to Redshift issues.

I am trying to move data from several files in S3 bucket into Redshift tables. I have tried several options for this and wanted to know the best approach that I can use for this.
Note that I can copy the files from S3 to Redshift tables without talend and it works great. However, wanted to use Talend as I will be using this for incremental loads (without dropping/clearing the tables).
I have tried the following options.
1. tS3List --> tRedshiftBulkExec    : fails on delimiter not found. Still trying to identify why this issue is occurring. Copy is working.
- tRedshiftBulkExec requires me to reenter my S3 credentials. This is kind of crazy as I already have a S3 connection and it would be great it the component can just use an existing connection.
2. tRedshiftRow    : I enter the copy command here and it works. However this is not ideal when using an ETL tool. I can as well just use basic scripting. Ideally #1 should work.
3. S3 --> tS3Get (writes to local, ugh!!) --> tFileInput (from local) --> tRedshiftOutput    (defeats the purpose of S3 to Redshift).
What is the best way to accomplish this using Talend OS for Big Data ver 6.1.1? 
2 REPLIES
One Star

Re: S3 to Redshift issues.

tRedshiftRow is the best way.  While it is "not ideal when using an ETL tool" is it ideal for RedShift.  Running files through Talend adds a unnecessary step, and you may miss the parallelism in a direct S3 -> Redshift copy. 
Also keep in mind that a Redshift update is a delete / insert and those deleted rows will stick around until you vacuum the table.  Consider adding a tRedshiftRow for the vacuum to avoid performance degradation.
A complete process may look like:
s3 -> staging -> update -> vacuum
If you haven't already, I suggest testing whether a truncate / insert process is faster than a update as the RedShift architecture is optimized for bulk processing. 
One Star

Re: S3 to Redshift issues.

I am trying to move data from several files in S3 bucket into Redshift tables. I have tried several options for this and wanted to know the best approach that I can use for this.
Note that I can copy the files from S3 to Redshift tables without talend and it works great. However, wanted to use Talend as I will be using this for incremental loads (without dropping/clearing the tables).
I have tried the following options.
1. tS3List --> tRedshiftBulkExec    : fails on delimiter not found. Still trying to identify why this issue is occurring. Copy is working.
- tRedshiftBulkExec requires me to reenter my S3 credentials. This is kind of crazy as I already have a S3 connection and it would be great it the component can just use an existing connection.
2. tRedshiftRow    : I enter the copy command here and it works. However this is not ideal when using an ETL tool. I can as well just use basic scripting. Ideally #1 should work.
3. S3 --> tS3Get (writes to local, ugh!!) --> tFileInput (from local) --> tRedshiftOutput    (defeats the purpose of S3 to Redshift).
What is the best way to accomplish this using Talend OS for Big Data ver 6.1.1?