I have successfully created an S3 connection and two datasets. One points to a CSV file and the other points to a Parquet file. When placed in the pipeline, I get errors and warnings:
Hello! Just for info -- there is a community forum dedicated to Data Streams here: https://community.talend.com/t5/Data-Streams/bd-p/Data-Streams
Are you using the Data Streams AMI? We have fixed some problems with sample latency since the AMI preview -- in the meantime, you can usually get things running by fetching the sample inside the dataset form (i.e. open the S3 dataset, and click on the Get Sample button, and make sure that it returns a result). After this step, the pipeline designer should work without the error.
Another technique is to try running your job to see if it produces output -- the logs might help determine whether this is a configuration issue with the component or a bug in the preview feature.
Let me know if this helps, Ryan
Thanks for your response. I originally posted my post in the Data Streams and it was moved by xdshi to this forum.
I am using the Data Streams for AWS.
When I tried to sample the data in the dataset itself, and I get an error "Sample Not Available"
I have confirmed that the file is still in it's location.
Thanks again for your response. Here is what I tried next (with no success):
Yes, the connection works when testing. I get a green pop up box with "Connected" in it.
When I create the dataset, after putting in the connection and the region, when I type in the bucket, the list of all matching buckets appear, so I know it can see the folders and files.
I tried both options of a folder level and a file level, but i usually have it set to a specific file to ensure it is going to the right location. The documentation does not say to put a leading slash on the file, but I tried both ways without success.
There were no spaces in the path or file names.
I tried running a pipeline as suggested with just a small sample set of data in a sample folder. I get the same error that it can preview the data. When I run the full pipeline, I still get errors and no visible data through the process. I am attaching the log file.
Great! Thanks for checking all this!
It's very curious because the error message is com.talend.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:
XXXXXXX -- the Error Code doesn't match any of the strings inhttps://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList. (It repeats the HTTP status code instead.)
I typically use the leading slash in path names, but it isn't mandatory. I believe both are acceptable.
I can't seem to duplicate your error unfortunately -- I've tried with a number of configurations without problems.
Can you try accessing a public S3 file? Create a new S3 Dataset from your connection with:
Path: /trip data/yellow_tripdata_2018-01.csv
You can leave all the other parameters as the default. You should be able to get a sample from this taxi dataset!
If this works, it might be some very subtle issue with S3 authentication -- we've hit some odd cases in the past with some quite old credentials that were flaky on some buckets, but didn't find any pattern. Would you be willing to try generating new S3 credentials and giving it another try?
I was able to access the public file, thank you.
I tried recreating the security credentials as well as re-linking the role to the user but I'm still getting the "Sample not Unavailable" error when I try to preview.
Is there possibly something in the EC2 Security group that would cause this issue? Right now it is limited to one IP inbound and all IP's outbound.
Once again, thanks for giving my suggestions a try! There seems to be something subtle going on -- if you can access the public dataset *and* fetch the S3 buckets associated with your credentials, that normally should mean that everything is in place for an S3 file read to succeed! It looks like your security group should be OK.
I'm afraid I don't have any additional ideas... is there anything different about your S3 bucket or file that might explain this error? Do you have any server-side encryption turned on, or anything else particular? I'd be keen to find any leads for this problem so we can reproduce!
All my best regards, Ryan
Yes, the files were originally encrypted; however, I created a new bucket with no encryption and placed a small sample file in the bucket and ensured there was no encryption and I still got the same errors. I'll play around with it some more to see if there is anything that I can do to make it recognize the files in my own buckets and I'll let you know if I find anything that works. Thanks again for your help.
Introduction to Talend Open Studio for Data Integration.
Practical steps to developing your data integration strategy.
Create systems and workflow to manage clean data ingestion and data transformation.