DataStream & Bigquery

Six Stars

DataStream & Bigquery

Hi,

 

I try to create a connection in DataStram but I'm not able to have successfull connection.

 

How fill Service Account File & Temp Location ?

 

I try a link for my p12 file & gs bucket for temp location but always "Internal Error"  when I check.

 

Tks

Tags (3)
Employee

Re: DataStream & Bigquery

Hi pierre,

 

Thanks for using Talend Data Streams and welcome to our community!

 

To be able to use our Google BigQuery connector, you will have to connect to the EC2 instance using SSH (https://help.talend.com/reader/U~NvWT4juBbI~~2xEeN56g/uATvEWVxSYcttokqpc1MsA) and upload your service_account.json (https://cloud.google.com/compute/docs/access/service-accounts) into /opt/data-streams/data/extras

 

Once the service_account.json file has been uploaded to the AMI, you need to point to /opt/data-streams/data/extras/service_account.json in the Google BigQuery connector parameters.

 

The temporary storage should be a path to one of your GS buckets that will be used for temporary files when submitting the jobs.

 

Cheers,

 

Cyril.

Six Stars

Re: DataStream & Bigquery

Hello Cyril, I'll try tomorow, I delete my instance

 

I have other question, I'll post soon Smiley Wink

 

Tks

Employee

Re: DataStream & Bigquery

No problem Pierre. Don't hesitate to use our forum should you have any other extra question!

Six Stars

Re: DataStream & Bigquery

Hello,

 

connexion works with json in this directory but dataset wasn't available

 

Image 2329.pngImage 2328.pngImage 2330.png

All seems to be ok, have you any idea why it's doesn't work ?

Employee

Re: DataStream & Bigquery

Hello! It's a bit tricky to see why this is happening -- obviously we should be bringing better error messages up to the front to help debug these issues.

As a quick check, can you make sure that your Data Location and GS bucket for temp location are in the same region (I can see that the BigQuery table is in the US)? And of course that the service account has write permissions to the the temp location!

If these are OK, would it be possible to share your BigQuery schema?

Hopefully we can get this working for you quickly!

Ryan
Six Stars

Re: DataStream & Bigquery

Hello Ryan,

 

it's work, It's wasn't clear that temp would be a cloud storage, Before I was tried with /tmp/ in EC2 instance Smiley Wink

 

Image 2348.png

It's better, I have avro file now.

 

Preview doesn't work and it's complicated for manipulate structure

 

Image 2349.png

 

I save data my file in S3, why I Can't choose the destination file's name ? Path in dataset seems to be a directory

 

Image 2350.png

 

 

Employee

Re: DataStream & Bigquery

You successfully got a sample from the BigQuery dataset? In general, exporting from BigQuery is a really slow operation (sometimes minutes). Once you have a sample, you should see pretty good performance on the preview tabs.

Is it possible that the dataset was modified in between clicking the Get Sample button and going to the job to preview? This would cause the preview to be slow while waiting for the sample to be re-fetched. For the datasets with really high latency (like BigQuery), I recommend getting the sample explicitly from the dataset form whenever you make a modification and before using it in a pipeline. We are already working on improving this experience!

I will check on the S3 input path for you, but it seems to me that you should be able to specify a single *input* file! I know that this will have "unexpected" consequences if you use the dataset as an *output*, however (like creating a `/user/output/my_file.avro/part-r-00000` as the output file).
Six Stars

Re: DataStream & Bigquery

At this time, Preview in pipeline works after few try Smiley Happy

 

Image 2351.png

 

I have an other issue, data preview was not propagate. 

 

Image 2352.png

 

If I had column in pyhon, I can't manipulate this (in aggregate for exemple)

Image 2355.png

 

 

Can you explain more " I recommend getting the sample explicitly from the dataset form", I can define dateset structure explicitely ?

 

To modify data, it's only with Python Code ? You support only Apache Beam Python SDK, not Java ?

 

I'll go to test other connector,  if necessary I can help you with Google Cloud.

 

Employee

Re: DataStream & Bigquery

Thanks for the detail in your response! It helps quite a bit in debugging.

 

It looks like you have your PythonRow set to FLATMAP, when you really want MAP.   Try changing it and seeing if it fills out your preview better!  If you use FLATMAP without adding records to outputList in your user-defined code, it will just filter all of the inputs.

 

As a reminder: for FLATMAP, set the python variable outputList to an array containing 0..n output records (aka python object) for each input.  

 

For MAP, set the python variable output in your user-defined code, and it should be exactly 1 output record per input.

 

For your questions, we're using the Java SDK and the PythonRow is implemented with Jython.  Upcoming features in Beam should help mix languages in the same pipeline, but it's a long-term feature not yet available in Beam.

 

"I recommend getting the sample explicitly from the dataset form" --> My apologies, I just meant clicking the Get Sample button manually after a dataset change to make sure that the sample has been correctly retrieved.  You can't set a schema in the dataset, except for those datasets that specify their query.

 

Finally, when you create a new column in PythonRow, it won't (yet) show up in the autocomplete box in the next Aggregate.  You can still use it, but you have to enter it manually!

 

Please be assured that we're paying attention to questions and feedback -- the schema autocomplete should be fixed in the next release, and we've already opened a discussion for improving the user experience of PythonRow.  I haven't had a change to look at your previous question about S3. 

 

Thanks!  Ryan