Fifteen Stars

The same old problems, new technology. Talend explain yourself!

I am trying to connect to an Apache Hadoop 2.7.0 distribution. I know it is not supported, but the supported versions are so far behind where we are now, I am not inclined to go there. I have been using Talend's software for years and am more than proficient in the DI and ESB applications. I therefore thought that I would stand a pretty good chance of using the "Import Custom Definition" option in "Hadoop Cluster Connection" settings. I have seen a few posts where this option has been suggested, but that is as far as it goes. An example is this. There is no information online as to what is meant by a "custom definition file" and all we are being hinted at is that it is a zip. There is no zip file in the Hadoop distribution, so I simply zipped my distribution and tried that. I got this...


Can you please explain exactly what this zip file is, where I can get one or how I can build one (ie what data I need)? 
Talend's software is brilliant. I have been using it for years and have built my business around both the Open Source and Enterprise Editions. However Talend fails significantly when it comes to its documentation. It is beyond appalling. If I were to recommend a single thing that would elevate Talend to a position where it could truly compete with the closed source big boys, it would be to focus on documenting. Focus on explaining what the cryptic selection/option boxes do. Focus on spoon feeding answers to people so that a single answer on a place like this CAN serve hundreds of people. At the moment I am only sticking with this because I know that when I get this working (and when I discover and make personal notes on the "features"), I will find it very useful. If I were new to Talend, I would have dropped the BigData tool by now as not even Sherlock Holmes could piece together the obfuscated "instructions". 
Rilhia Solutions
11 REPLIES
Employee

Re: The same old problems, new technology. Talend explain yourself!

Hello,
The way to use a custom distribution is the following:
1 - Click on the browse button when the custom mode is selected.
2 - A window pops up and ask you to cancel, or to choose one of the two options available, importing from an existing distribution, or importing from a zip file.
3 - Let's consider you have nothing locally, then you should select the "import from an existing distribution" and choose the distribution which is the closest to Hadoop 2.7.
4 - You will notice that a list of tabs appear (HDFS, Hive, Hbase, etc...) with a list of jars for each of them. It deals with dependencies that the Talend components will require to connect to your cluster, for HDFS, Hive, Hbase, etc...
5 - As you may have understood already, you now have to replace each jar from the list with the jar coming from your cluster. Example: in HDFS, you will find a hadoop-hdfs jar, in version 2.3.0-cdh5.1.3 for example, you will have to replace it with hadoop-hdfs-2.7.0.jar coming from your cluster.
6 - Then you'll be able to execute your jobs.
Please let me know if you have question, as this mechanism is not straighforward.
Best regards,
Rémy.
Fifteen Stars

Re: The same old problems, new technology. Talend explain yourself!

Thanks Remy,
It took a while, but I managed to figure out most of what you said by tinkering for a while. However, I haven't been able to get a connection to my Hadoop cluster as it appears to be too different from the latest Apache Hadoop version supported by Talend 5.6 and 6.0 RC. A simple swap of Jars isn't working. Has anyone at Talend got it working with Hadoop 2.7.0....or any Hadoop 2 release? It seems strange that Talend only support up to Hadoop 1.0.0. When will Hadoop 2 be supported? At the moment I am working on getting to know it, so it is probably better that I code my stuff manually. But it would be nice to be able to use Talend with Hadoop and I am not inclined to learn an old version of Hadoop just so that I can use it with Talend
Regards
Richard
Rilhia Solutions
Community Manager

Re: The same old problems, new technology. Talend explain yourself!

@rhall_2.0
Some documentation does exist in the terms described by Remy at: Handling_jobs-custom_hadoop
But as mentioned by Remy, the process isn't straightforward, so we'll try to add more information in the documentation in a future release.
Feel free to open JIRAs to the Talend documentation team when you can't find information in our documentation: https://jira.talendforge.org/secure/CreateIssue!default.jspa
Many thanks
Elisa
Fifteen Stars

Re: The same old problems, new technology. Talend explain yourself!

Thanks Elisa.
I did look through Talend's website but the search mechanism is not very useful. It is actually easier to exit Talend, point Google to help.Talend.com and then search. But that didn't find what I needed. As I said, I managed to work out what I needed, but could not get Hadoop 2.7.0 to connect to Talend. The page you suggested actually says that Talend Exchange has definition zip files from the community to try. It doesn't....or at least they are not easily found. 
If anyone at Talend has managed to hook 2.7.0 or any other ver 2 of Apache Hadoop to Talend, it would benefit a lot of people if they could share their definitions on the Exchange. 
Rilhia Solutions
Community Manager

Re: The same old problems, new technology. Talend explain yourself!

We'll improve the details provided in the documentation page for sure.
Just a simple curiosity, your first entry point was the search field on talend.com website?
Thanks for your feedback
Elisa
Fifteen Stars

Re: The same old problems, new technology. Talend explain yourself!

I searched the Talend website first, but I regularly find that it doesn't perform very well unless you know exactly what you are searching for. When you are searching for something where you don't know the exact name, it makes it very difficult to find based on approximate knowledge. It is much easier to search via Google if that is the case.
Rilhia Solutions
Community Manager

Re: The same old problems, new technology. Talend explain yourself!

Thanks a lot for this feedback.
Do you use at all the search field directly in help.talend.com? (rather than talend.com search which is different).
Cheers
Elisa
Fifteen Stars

Re: The same old problems, new technology. Talend explain yourself!

The help.talend.com search gets better information, but the problem here is that the pertinent information is not available. On trying to configure a "Custom Unsupported Distribution" there should be a detailed document explaining the relationship between the Jars being imported and the distribution, how the index.xml file corresponds to that, what descriptions must be used in the index.xml (if any) and maybe there should be an explained example. It is a bit like trying to translate an ancient Egyptian piece of text without the benefit of the Rosetta stone in its current form. 
I have thankfully managed to get it to work by hacking away at this and modifying a default Cloudera distribution. I am sure there will be issues with this custom distribution, but I will sort those as I find them. My complaint here is that even basic documentation based on the use case/user story the developer was given to build the functionality would suffice here. Maybe even developer notes. But instead I was left in a position where I knew what I wanted to do was achievable and I was teased with the possibility of examples being on Talend Exchange, but there was absolutely nothing that helped me get the answer. I know that there is a lot to document and there is a lot that has been documented that is very good. But there is also a lot that doesn't get documented or updated for a long time.
Rilhia Solutions
Community Manager

Re: The same old problems, new technology. Talend explain yourself!

Thanks for your feedback. However there should be as many rosetta stones that as there are versions and distributions... We can't document each of them in detail. Remember those are unsupported, so we don't always have those at hand or have the opportunity to test and have the time to document. Now I'm taking your point. We will try to improve this by showing the principle of debugging the import of an unsupported hadoop distribution in a future release (DOCT-5104). 
If you notice something is missing or not detailed enough, it would be great to report it to the Talend doc team on JIRA (Documentation project): https://jira.talendforge.org/secure/CreateIssue!default.jspa
Many thanks again for taking the time to give us your feedback.
Elisa
One Star

Re: The same old problems, new technology. Talend explain yourself!

Has this feature been fixed in latest Big Data 6.0.1 version?
As downloaded and checked, still for apache version, can only support 1.0.0 version.
BTW the you tube video titled "How-to add an unsupported Hadoop to Talend Studio" metioned in https://jira.talendforge.org/browse/DOCT-5104 is also not available in youtube.
Please help assist
Fifteen Stars

Re: The same old problems, new technology. Talend explain yourself!

Hi mints, you can support a standard Apache Hadoop cluster using a Clourdera (based on Apache 2) version. It is a pain and requires you to dig around for configuration settings, but it does work. I have a 4 node Apache Hadoop 2.7 cluster working. I found most of the time was spent configuring the cluster. Once that was done (and after all of the reading it took), I found that selecting the latest Cloudera config with adjusted URLs and ports worked for me. 
I think that Apache 2 needs to be supported as an Apache distribution though. I am obviously not the only person who took the approach that if learning Big Data, it is best to start with the base open source Hadoop distribution. 
Rilhia Solutions