Talend supports CDH 5.10 on Talend 6.4.1, but doesn't support CDH 5.11. CDH 5.12 is supported in Talend Winter ’18, so this means that there’s a gap in CDH official compatible versions. Moreover, you will not find CDH 5.11 among the list of distributions in Studio 6.5 this winter release, but you will find CDH 5.12.
CDH 5.12 has numerous features that far exceed CDH 5.11. Talend is working with all Hadoop platform providers to find a generic way of connecting to any version of a cluster, and that will address those skipped versions and more.
New single-node Cloudera Hadoop Cluster, installed on a Red Hat 7.4 AWS AMI using Cloudera Manager 5.11.
If you need to install an old version of CDH, see this link: Installing Older version of CDH.
Spark core distribution, Kafka 0.10.0, and Spark 2.1 (as Spark 1.6 is installed by default).
In Talend Studio, from the Repository, right-click the Hadoop cluster and select Create a Hadoop Cluster.
The best practice when you want to connect to an unsupported version of a cluster is to find the closest supported version, and then manually specify the Hadoop services.
Select Custom unsupported for the distribution, then select Cloudera for the distribution and CDH5.10 Yarn mode for the version.
From there, you can replace the different libraries with the appropriate version of the jar.
You can find a great video tutorial here: How to add an unsupported Hadoop distribution to Talend Studio that describes how to find and replace a jar (it's made for HDP but reusable for CDH).
You may not have to go through the custom unsupported connection; you may be able to use the CDH5.10 distribution and use the Cloudera Manager Service to retrieve the cluster configuration and use CDH5.10 libraries.
Next, paste the hostname of your cluster and fetch Hadoop components.
Specify a user name and check the services.
In the metadata of the repository, under the Hadoop cluster, you should see the connection you just created.
Testing started with a very simple Job to put data in HDFS using a tHDFSPut component, adding a tHDFSconnection component, and the Job executed successfully.
Another Job used the same file, and loaded it into an Hbase table that stored the RAW data and another where blank values were replaced with Null in order to leverage Hbase sparse data capabilities.
Testing of Big Data Batch Jobs started with a very simple one, with one map and one reduce task that filters the data set and then aggregates the result of the filtering.
A more complex Job was tested, using a tMap component.
Spark 2.1 must be installed manually using Cloudera Manager and CSD parcels in order to test Spark 1.6 and 2.1 Jobs.
Set up the Spark configuration of the Job to use the machine where Talend Studio is installed as a Spark driver (the Spark driver machine should be in the same private network as the cluster machine in order to resolve private IP within the VPC).
A more complex Job involved reading from an Hbase table and applying more complex aggregations.
Last, the Spark Machine learning Library was tested using the Naïve Bayes classification algorithm against the very famous Iris data set.
Creating the model with the training dataset:
Scoring the test dataset with the predictive model:
The accuracy of the predictive model is pretty good, only one the predictions went rogue:
Spark Streaming Jobs were tested using Kafka: one job produces and sends messages to a topic, and another Job consumes those messages.
Note: Be sure that you installed a version of Kafka that is supported by Talend Studio 6.4.1, in this case version 0.10.0.
The following Job produces and sends a message to a Kafka topic:
The following Job consumes those messages and shows them in a log table:
Everything worked well.
Even though version CDH 5.11 is not officially supported in Talend Studio 6.4.1, everything went well and the configuration was pretty straight forward, so you can still use your favorite data integration tool for most of the main critical tasks and even more. It's possible that every single detail was not tested, but overall no compatibility issues arose.
You are more than welcome to share your experiences, and address the issues you might be facing with this setup in the comment section.