Synchronize files from unix to hdfs

One Star

Synchronize files from unix to hdfs

I want to try and setup a job that would get a list of files from a remote server (Unix) and compare it with files listed in HDFS. If the file does not exist I want to get the file from the Unix server and put it into HDFS. Can anyone point me in the right starting direction on how I might be able to do this? I am using the latest 6.1 version of the Talend Big Data Studio.
Community Manager

Re: Synchronize files from unix to hdfs

Hi 
What protocols do you want to access the remote server (Unix) and get the file? FTP? SCP or http? And do you just want to compare the file name or the file content?
Regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: Synchronize files from unix to hdfs

I have access to scp and ftp. I just want to compare the file names since they are unique per day.
Community Manager

Re: Synchronize files from unix to hdfs

Hi
You can use txxxList to get all the file names from remote server and HDFS server, do an inner join between remote files and HDFS files and get the unmatched records, eg:
tFTPFileList--iterate--tFixedFlowInput1--main--tUnite--main--tMap--out1-->
                                                                                    |
                                                                               lookup
                                                                                    |
                      tHDFS--iterate--tFixedFlowinput2--main--tUnite-
tFixedFlowInput1: define one column and set its value as:
((String)globalMap.get("tFTPFileList_1_CURRENT_FILE"))
tFixedFlowInput2: define one column and set its value as:
((String)globalMap.get("tHDFSList_1_CURRENT_FILE"))
Refer to this KB article:
https://help.talend.com/pages/viewpage.action?pageId=190513450
Regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: Synchronize files from unix to hdfs

Thank you for that!
I created the flow as you suggested. In this I am not sure the need for the unite element but I have them there anyways.

After this am I doing an FTPGet getting the file locally and then doing an HDFSPut? Is there anyway to orchestrate sending the file directly from the remote Unix server to HDFS?
Community Manager

Re: Synchronize files from unix to hdfs

Hi 
tUnite component is needed in this job to merge all the file name before doing the join. 
After this am I doing an FTPGet getting the file locally and then doing an HDFSPut? Is there anyway to orchestrate sending the file directly from the remote Unix server to HDFS?

No a direct way to move the file between remote server and HDFS, you have to get it to local system and then put it to HDFS.
Regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: Synchronize files from unix to hdfs

Thanks for all of your help!