Data Stream S3 connection Not Reading Files

Five Stars

Data Stream S3 connection Not Reading Files

I have successfully created an S3 connection and two datasets.  One points to a CSV file and the other points to a Parquet file.  When placed in the pipeline, I get errors and warnings:

 

CSVErrorThe sample is empty. Can't compute the preview.   Cannot fetch sample from dataset
 
Parquet:  WarningSample is not available
 
I have confirmed that the files are not empty (they are very small, but not empty) and I have also confirmed that the user ID for the connection has Read/Write access to S3.
Tags (3)
Employee

Re: Data Stream S3 connection Not Reading Files

Hello!  Just for info -- there is a community forum dedicated to Data Streams here: https://community.talend.com/t5/Data-Streams/bd-p/Data-Streams

 

Are you using the Data Streams AMI?  We have fixed some problems with sample latency since the AMI preview -- in the meantime, you can usually get things running by fetching the sample inside the dataset form (i.e. open the S3 dataset, and click on the Get Sample button, and make sure that it returns a result).  After this step, the pipeline designer should work without the error.

 

Another technique is to try running your job to see if it produces output -- the logs might help determine whether this is a configuration issue with the component or a bug in the preview feature.

 

Let me know if this helps, Ryan

Five Stars

Re: Data Stream S3 connection Not Reading Files

Thanks for your response.  I originally posted my post in the Data Streams and it was moved by xdshi to this forum.

 

I am using the Data Streams for AWS.

 

When I tried to sample the data in the dataset itself, and I get an error "Sample Not Available"

 

I have confirmed that the file is still in it's location.

Employee

Re: Data Stream S3 connection Not Reading Files

Hello! My apologies for the forum shuffling -- since Data Streams is fairly new, it's sometimes not entirely clear where a question "belongs"! We can try to debug this here of course.

It's hard to tell exactly what the problem is exactly, but if it doesn't work in the dataset sample, we can rule out a bug with the preview feature!

Focusing on the CSV use case first, can you check:

Does the "Test Connection" button in the associated S3 Connection work?

Assuming that you've verified the bucket and path for the S3 Dataset -- is it set to use a path (/mydataset/input/) or a single file (/mydatainput/input/myfile.csv) ? Both should work, but if it's a path, can you try a single file to see if it's accessible?

Can you verify that there aren't any unwanted spaces around any values? When you copy/paste paths from the Amazon console, it's pretty easy to have "accidental" inserts.

Would it be possible to run a pipeline with the CSV S3 Dataset as an input? The run logs can provide some extra information to help debug the problem. (Quick tip: you can create a Custom Data connection and Custom Data dataset and use it as an output. It will allow the job to be run but discard all data).

I hope we can figure this out. Even if it's a simple configuration error, we should be providing better feedback to the user!
Five Stars

Re: Data Stream S3 connection Not Reading Files

Thanks again for your response.  Here is what I tried next (with no success):

 

Yes, the connection works when testing.  I get a green pop up box with "Connected" in it. 

 

When I create the dataset, after putting in the connection and the region, when I type in the bucket, the list of all matching buckets appear, so I know it can see the folders and files. 

 

I tried both options of a folder level and a file level, but i usually have it set to a specific file to ensure it is going to the right location.  The documentation does not say to put a leading slash on the file, but I tried both ways without success. 


There were no spaces in the path or file names.

I tried running a pipeline as suggested with just a small sample set of data in a sample folder.  I get the same error that it can preview the data.  When I run the full pipeline, I still get errors and no visible data through the process.  I am attaching the log file.

 

Employee

Re: Data Stream S3 connection Not Reading Files

Great! Thanks for checking all this!

 

It's very curious because the error message is com.talend.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:
XXXXXXX -- the Error Code doesn't match any of the strings inhttps://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList. (It repeats the HTTP status code instead.)

 

I typically use the leading slash in path names, but it isn't mandatory.  I believe both are acceptable.

 

I can't seem to duplicate your error unfortunately -- I've tried with a number of configurations without problems.

 

Can you try accessing a public S3 file? Create a new S3 Dataset from your connection with:

 

Bucketnyc-tlc

 

Path/trip data/yellow_tripdata_2018-01.csv

 

You can leave all the other parameters as the default. You should be able to get a sample from this taxi dataset!

 

If this works, it might be some very subtle issue with S3 authentication -- we've hit some odd cases in the past with some quite old credentials that were flaky on some buckets, but didn't find any pattern. Would you be willing to try generating new S3 credentials and giving it another try?

Five Stars

Re: Data Stream S3 connection Not Reading Files

I was able to access the public file, thank you.

 

I tried recreating the security credentials as well as re-linking the role to the user but I'm still getting the "Sample not Unavailable" error when I try to preview.

 

Is there possibly something in the EC2 Security group that would cause this issue?  Right now it is limited to one IP inbound and all IP's outbound.

Employee

Re: Data Stream S3 connection Not Reading Files

Once again, thanks for giving my suggestions a try!  There seems to be something subtle going on -- if you can access the public dataset *and* fetch the S3 buckets associated with your credentials, that normally should mean that everything is in place for an S3 file read to succeed!  It looks like your security group should be OK.

 

I'm afraid I don't have any additional ideas... is there anything different about your S3 bucket or file that might explain this error?  Do you have any server-side encryption turned on, or anything else particular?  I'd be keen to find any leads for this problem so we can reproduce!

 

All my best regards, Ryan

Five Stars

Re: Data Stream S3 connection Not Reading Files

Yes, the files were originally encrypted; however, I created a new bucket with no encryption and placed a small sample file in the bucket and ensured there was no encryption and I still got the same errors.  I'll play around with it some more to see if there is anything that I can do to make it recognize the files in my own buckets and I'll let you know if I find anything that works.  Thanks again for your help.

Tutorial

Introduction to Talend Open Studio for Data Integration.

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.