DataStream & Bigquery

Six Stars

DataStream & Bigquery

NOTE: Talend Streams has been replaced with Pipeline Designer. More information about this new Talend application in Pipeline Designer introduction




I try to create a connection in DataStram but I'm not able to have successfull connection.


How fill Service Account File & Temp Location ?


I try a link for my p12 file & gs bucket for temp location but always "Internal Error"  when I check.



Tags (3)

Re: DataStream & Bigquery

Hi pierre,


Thanks for using Talend Data Streams and welcome to our community!


To be able to use our Google BigQuery connector, you will have to connect to the EC2 instance using SSH ( and upload your service_account.json ( into /opt/data-streams/data/extras


Once the service_account.json file has been uploaded to the AMI, you need to point to /opt/data-streams/data/extras/service_account.json in the Google BigQuery connector parameters.


The temporary storage should be a path to one of your GS buckets that will be used for temporary files when submitting the jobs.





Six Stars

Re: DataStream & Bigquery

Hello Cyril, I'll try tomorow, I delete my instance


I have other question, I'll post soon Smiley Wink




Re: DataStream & Bigquery

No problem Pierre. Don't hesitate to use our forum should you have any other extra question!

Six Stars

Re: DataStream & Bigquery



connexion works with json in this directory but dataset wasn't available


Image 2329.pngImage 2328.pngImage 2330.png

All seems to be ok, have you any idea why it's doesn't work ?


Re: DataStream & Bigquery

Hello! It's a bit tricky to see why this is happening -- obviously we should be bringing better error messages up to the front to help debug these issues.

As a quick check, can you make sure that your Data Location and GS bucket for temp location are in the same region (I can see that the BigQuery table is in the US)? And of course that the service account has write permissions to the the temp location!

If these are OK, would it be possible to share your BigQuery schema?

Hopefully we can get this working for you quickly!

Six Stars

Re: DataStream & Bigquery

Hello Ryan,


it's work, It's wasn't clear that temp would be a cloud storage, Before I was tried with /tmp/ in EC2 instance Smiley Wink


Image 2348.png

It's better, I have avro file now.


Preview doesn't work and it's complicated for manipulate structure


Image 2349.png


I save data my file in S3, why I Can't choose the destination file's name ? Path in dataset seems to be a directory


Image 2350.png




Re: DataStream & Bigquery

You successfully got a sample from the BigQuery dataset? In general, exporting from BigQuery is a really slow operation (sometimes minutes). Once you have a sample, you should see pretty good performance on the preview tabs.

Is it possible that the dataset was modified in between clicking the Get Sample button and going to the job to preview? This would cause the preview to be slow while waiting for the sample to be re-fetched. For the datasets with really high latency (like BigQuery), I recommend getting the sample explicitly from the dataset form whenever you make a modification and before using it in a pipeline. We are already working on improving this experience!

I will check on the S3 input path for you, but it seems to me that you should be able to specify a single *input* file! I know that this will have "unexpected" consequences if you use the dataset as an *output*, however (like creating a `/user/output/my_file.avro/part-r-00000` as the output file).
Six Stars

Re: DataStream & Bigquery

At this time, Preview in pipeline works after few try Smiley Happy


Image 2351.png


I have an other issue, data preview was not propagate. 


Image 2352.png


If I had column in pyhon, I can't manipulate this (in aggregate for exemple)

Image 2355.png



Can you explain more " I recommend getting the sample explicitly from the dataset form", I can define dateset structure explicitely ?


To modify data, it's only with Python Code ? You support only Apache Beam Python SDK, not Java ?


I'll go to test other connector,  if necessary I can help you with Google Cloud.



Re: DataStream & Bigquery

Thanks for the detail in your response! It helps quite a bit in debugging.


It looks like you have your PythonRow set to FLATMAP, when you really want MAP.   Try changing it and seeing if it fills out your preview better!  If you use FLATMAP without adding records to outputList in your user-defined code, it will just filter all of the inputs.


As a reminder: for FLATMAP, set the python variable outputList to an array containing 0..n output records (aka python object) for each input.  


For MAP, set the python variable output in your user-defined code, and it should be exactly 1 output record per input.


For your questions, we're using the Java SDK and the PythonRow is implemented with Jython.  Upcoming features in Beam should help mix languages in the same pipeline, but it's a long-term feature not yet available in Beam.


"I recommend getting the sample explicitly from the dataset form" --> My apologies, I just meant clicking the Get Sample button manually after a dataset change to make sure that the sample has been correctly retrieved.  You can't set a schema in the dataset, except for those datasets that specify their query.


Finally, when you create a new column in PythonRow, it won't (yet) show up in the autocomplete box in the next Aggregate.  You can still use it, but you have to enter it manually!


Please be assured that we're paying attention to questions and feedback -- the schema autocomplete should be fixed in the next release, and we've already opened a discussion for improving the user experience of PythonRow.  I haven't had a change to look at your previous question about S3. 


Thanks!  Ryan


Talend named a Leader.

Get your copy


Kickstart your first data integration and ETL projects.

Download now

Have you checked out Talend’s 2019 Summer release yet?

Find out about Talend's 2019 Summer release


Talend Summer 2019 – What’s New?

Talend continues to revolutionize how businesses leverage speed and manage scale

Watch Now

6 Ways to Start Utilizing Machine Learning with Amazon We Services and Talend

Look at6 ways to start utilizing Machine Learning with Amazon We Services and Talend