we have a use came where our product is generic and used by multiple clients.
But one Client is on AWS and the other is on Cloudera stack.
Is it possible to build a generic talend map that runs on both platforms?
This question is a bit confusing. AWS is a cloud computing platform that Cloudera actually makes use of in their stack. I am also not sure what you mean by a "generic talend map". Talend can work on AWS and will work with Cloudera. Can elaborate on your requirement? It is likely possible to achieve what you want, but I don't want to say yes to a question that could be interpreted in a way that is essentially saying that Talend can do the impossible (which would be to have one job that will work for any input/output schema/environment/technology)
THank you for the response. My requirement is quite simple. We are a product based company. Few of our clients are on AWS and a few others are on premise using CDH. we want to be able to deploy the same talend maps on both environment. Right now, for AWS we use S3 and for CDH we use the HDFS for storage.
Can you suggest a way to do that please? please note these are all spark jobs that we have built.
One of the massive benefits of Talend is the context variables. They allow you to dynamically work with different source and target database connections, to give a very basic example of what they can do, but they also allow your job to alter the way it processes based on context variables supplied to your jobs. This is not always straight forward, but can be implemented in a logical and consistent way.
Obviously in your use case you are dealing with very different sources. This can be handled, but there are many ways in which this can be done and not all are going to be the most logical way. I would approach this as follows....
1) Create context variables to help you identify your environment and the configuration of your environment
2) Create child jobs to handle each environment. So, for AWS have a job to deal with your S3 bucket information and for Cloudera have a job to handle HDFS. Pass the data from these jobs to either a standard job that can handle the next stage for all environments or to child jobs that are specific to your environment.
3) These child jobs will be used inside a parent job. This parent job essentially orchestrates the path your job takes. Depending on the context variables, different child jobs will be run. A basic example of how to do this would be to use RunIf links between your child jobs.
This is a very high level description of the process I would use. I know it might sound like multiple jobs, but in reality you would have one job that you could use in your different environments as far as your clients are concerned.
Thank you for the recommendation.
This thought of having child job with parent job did cross my mind. But they do not satisfy the code re-usability requirement. For example, a change has to be made in the extraction process. then we will have to perform the same updates on each of the child extraction job (AWS and CDH) .
You can reuse components of the process so long as they are working with a consistent schema and tool. For example, in your S3 bucket do you have the same schema files as in the HDFS storage? If so, you need jobs to retrieve those files and then a single job can read the files for both environments. When it comes to writing the data elsewhere you will need to have different processes if you are writing to different locations, no matter how consistent the schemas are.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Learn how to make your data more available, reduce costs and cut your build time
Read about OTTO's experiences with Big Data and Personalized Experiences
Take a look at this video about Talend Integration with Databricks