tcontextload component in talend Big data

Six Stars

tcontextload component in talend Big data

Hi All,

 

Am using talend big data 6.3.1 enterprise version . in repository we have 2 job designs : 1)standard 2)big data batch ,

we designed jobs in big data batch but we are not able to use tcontextload component.

 

is that this component cant be used because when i placed the component in work space its showing as missing.please find the screenshot for the same.

we are able to use the tcontextload component in standard job ,but not in big data batch. we are using spark.

 

Thanks,

Lmit

 

 


Accepted Solutions
Six Stars

Re: tcontextload component in talend Big data

Thanks for your reply Irshad.

 

Thanks,

lmit


All Replies
Employee

Re: tcontextload component in talend Big data

Hi,

tContextLoad is available in Standard ETL only.  

 

In Spark Batch and Spark Streaming, you need to think and design differently due to the multi-parallel processing of these frameworks.  You will use context variables only to initialise the job.  And the initialisation should happen as soon as the job begins, meaning these variable values should be passed to the job from the TAC.  This is because a Spark Batch or Spark Streaming job will run on many nodes, and will have many executor context.  Hence, you cannot rely on context variable, because it is not global to all the executor context anymore.  Once some spark logic start executing in an executor context, any attempt to manipulate the context variable will be local to that executor context only, and not the whole job.  That's why we do not provide these components in the Spark Batch and Spark Streaming so as to avoid Talend developers from using an antipattern.  

 

You need to figure out how to read variables.  An example can be to read it from a database and load the values into an RDD.  You can then  access the RDD from all your nodes.  It will be in memory and fast.  You can then read from the RDD and apply the value to globalmap etc.  We have not abstracted this logic yet as it creates an RDD.  Since RDD is immutable, each time you update a variable value in the RDD, you will create another RDD.

 

In the same line of thought, you cannot use AMC in Big Data Batch/Streaming jobs.

 

Hope that helps.

Irshad

Six Stars

Re: tcontextload component in talend Big data

Thanks for your reply Irshad.

 

Thanks,

lmit

Twelve Stars

Re: tcontextload component in talend Big data


@iburtally wrote:

Once some spark logic start executing in an executor context, any attempt to manipulate the context variable will be local to that executor context only, and not the whole job.  That's why we do not provide these components in the Spark Batch and Spark Streaming so as to avoid Talend developers from using an antipattern.  

 

 


of course this all only visions, but I think - Yours (Talend) position is also not a 100% correct

1) You are use context variables in all Your examples as well (and for batches and for real time streaming)
2) most of time we are (all other world) also use context for initialize job on start - database connection information, ip-addresses, folders, login information (99% of context usage)
3) TAC - store context variables individual for Job

in this case You are also provide also bad anti-patterns:
- or have in Job pre-defined PRODUCTION group as well as DEV, and store PRODUCTION values on DEV environment
- or setup it individual on PRODUCTION server after implementing

why not allow initialize context same as for ETL? 
I understand what are You mean - sometime in ETL developers use context.variable as variable in a middle of Job

But must bigger number of them have DEV/QUAL/PROD (I meat up to 7) of context groups defined in Repository
and after this - half of them store all values together, and other half - manually set values manually after implementing

May be I not understand, logic, but I not found this anywhere in Talend docs or threads 
 - ok, we have 100+ jobs implemented in TAC, and for some security reasons decide change database connection password
for ETL jobs all simple - change on server csv (or in a table)
but what about:
- routes
- batch jobs
- stream

yes, it could require - kill and restart for catch new values, but still much more easy rather then edit all in TAC, or redeploy all jobs

Just think about this side of the problem, simple examples with simple context group and password hadoop/hadoop - not always work in real life :-)

-----------