One Star

FTP to HDFS

Hi,
I need to copy csv files to an FTP server, do some manipulation and then upload to HDFS. I've browsing the forums and the internet for some time but didn't find an answer on how to accomplish this.
The closest I got is this Help Center link: https://help.talend.com/search/all?query=Setting+up+an+FTP+connection&content-lang=en
However the Metadata node in the Repository tree view is not available when I open up Open Studio for Big Data.
Can you point me in the right direction please? Which product do I need to accomplish this / how do I enable the Metadata node?
7 REPLIES
One Star

Re: FTP to HDFS

Where does the CSV manipulation takes place?And by whom?
One Star

Re: FTP to HDFS

Thanks for the quick reply.
The CSV manipulation takes place on the FTP server which runs an Ubuntu. It's done by a cron job that runs a small C++ program on each of the csv files.
That's also the final goal to integrate this manipulation and the whole process into a workflow within Talend, however as a first step I'm curious how to schedule a regular upload from FTP to HDFS in Open Studio for Big Data.
One Star

Re: FTP to HDFS

Pfff... just some thoughts:
# you can use the tSSH component to run the "c++" job which normally is triggered by cron on your server.
# but to be honest: use Talend to do the manipulation
I'm not sure what exactly your question is now? How to build a job from scratch??
One Star

Re: FTP to HDFS

My question is:
How to build a job that connects from the FTP server to HDFS and pushes the csv files.
I couldn't do that as described in the Help Center.
One Star

Re: FTP to HDFS

Please describe your jobflow in more detail. What should the job do?
1. connect to FTP server?
2. Download file from that server?
3. Put that downloaded file on HDFS?
Again: more details.
One Star

Re: FTP to HDFS

The whole job flow:
1. Export a table from Oracle 11g to csv text files, split up into smaller chunks, each chunk is a 100 megabyte csv file. Let's call the machine where Oracle is installed "Oracle server". It has windows server 2008 r2 operating system.
2. Transfer these text files from "Oracle server" to an Ubuntu 13.10 server via SFTP to a specific folder. Let's call this the "FTP server"
3. Take all files in the folder and run the C++ program. This program takes as input the csv file and hashes some of the information in it. Then it outputs a modified copy of it to a different folder, still on the "FTP server".
4. Upload these hashed csv files to a hadoop cluster by connecting to the main node pushing them directly into HDFS. Let's call this "Hadoop server". The cluster has HDP 2.0, main node runs CentOS 6.5
5. Create external hive table on the uploaded files by specifying the folder in HDFS where the csvs reside.
Direct load from Oracle to HDFS is unfortunately not an option for us. We have to do this hashing on the "FTP server".
I do realize I didn't provide enough details so thanks for your time and effort Rogier! Let me know if there are other important aspects.
One Star

Re: FTP to HDFS

No problem; I suggest that you simply start working with Talend. You will see that WHAT you want isn't that difficult. Try the following:
1. Create new job
2. Drag in a tOracleInput and fill in the query which selects the rows
3. connect to a tfiledelimited output component, under advanced settings you can specify "splitting up files to x amount of records"
4. So now you have a couple of CSV files containing your records on the computer which is running the Talend job.
5. Now you have to upload them using a tFtp to the FTP server
6. Now comes the moment when the Ubuntu machine "does something" with your csv files and saves changed files
7. Again, use tFtpGet to download those files and put them...
8. with a tHDFS put on the cluster
9. etc..
10. etc.
11. etc.
So, I would suggest to start clicking in Talend Smiley Happy. Read some tutorials, etc.
Things to consider:
- you can also save the records straight from the oracle database to the FTP server.