connect to Postgresql DB from Talend spark batch job

One Star

connect to Postgresql DB from Talend spark batch job

Hello,
I have a spark job input as csv file and output as avro file.
For the source and target folder_name and file_name I want to use context that is getting values from postgres according to the current workflow.
Is there a way to connect to postgres from spark batch job? the palette for this type is very little.
Thanks
Moderator

Re: connect to Postgresql DB from Talend spark batch job

Hi,


 


For RDS, you can use spark component to achieve it. Have you tried to use tjdbc (Generic JDBC Component) a spark component to connect to postgres RDS?


Best regards


Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Thirteen Stars

Re: connect to Postgresql DB from Talend spark batch job

It look like really missed functionality:

JDBC could be used only for lookup
no tLoadContext
no possibility load context in Job properties (same as for all DI Jobs)

You can only hardcode context in tab, but not load it 
-----------
Seven Stars

Re: connect to Postgresql DB from Talend spark batch job

I've been told by Talend R&D that it is their intent to keep tContextLoad out of the big data batch job.  There is a previous post by me if you search for it.  You have to do you context loading it a standard job then tRunJob call the big data back job. Then do a buffer if you need to send something back.
Thirteen Stars

Re: connect to Postgresql DB from Talend spark batch job

It is variant, thanks
but why, if they allow to have Context environment - why this stupid steps with load values? Smiley Happy
Question for Talend Team!
-----------
Seven Stars

Re: connect to Postgresql DB from Talend spark batch job

I agree, found the more detailed answer from Talend that I referenced earlier:
"So to confirm, tContextLoad is unavailable as a component for Big Data Batch jobs. 
But you can still use the tContextLoad in a Big Data Batch job, but gluing a DI job to a tRunJob that calls a Spark job . 
The DI job does the tContextLoad as part of the preparation and sends it to the Spark job. 
The reasoning behind this is -- because of the way big data jobs are structured, the context can't be modified in a Spark component. 
Just like you can't use the global map in a spark job, both the idea of the global map and a "modifiable" context is a kind of "job state" that can't be distributed. 
(R&D says there ARE techniques to do this, but it's almost always a bad idea!) 
R&D says they think of DI jobs as "orchestrators" that set up the Spark job to be run. "

This might make sense if they want to keep spark job more modular, functional, and with a very specific purpose which would make sense in a true scala/spark application, but it doesn't fit the Talend component, ease to development model well.