I have a master job which creates a list of files from hdfs directory and stores them in different arrays. These arrays are then pass over to child job which will work on those files.
However, I would like to run master job in parallel on different machines but this will create a issue i.e. same list of files will be given to each child job on each machine.
It would be great if I can give unique files to child job to process.
May I ask how are you running the same instance of parent job on multiple machines at the same time and how do you control them ? I know how we can do but want to know your method so that can suggest accordingly.
Nevertheless, if the primary focus is to pick and handle unique files by job1-serv1 , job2-server2 , job3-serv3 to avoud reprocess/clash then you can do this workaround.
1. i am assuming you already know the machine details before hand
2. job a: list all files in hdfs and insert the entries with random machine details (from 3 machine names say) . 2 fields , one filename , second machine name
3. in child job instances , make sure you setup server/hostname selective pick for files during listing/processing
4. this way every job+server combination will pick their tagged files and no duplicacy and clash.
@jatinderjawanda , thats why i mentioned about separating listing file activity before. Not to include it with child job if that makes sense.
I guess there are a few standard methods.
Coordinating the instances activity through a database table that is set up to only allow one user at a time so an instance pics a file from the dir, checks if anyone picked it from this table if no one has it then it writes to the DB table that it has it and releases the table.
You can do something similar with a file, lets call it control file. Only one of them is allowed to exist and only one user is allowed to have it open. If an instance succeeds in getting control of the file then it is its turn to pick a file. Which it promptly renames with a prefix which marks it as belonging to that specific instances or you simply move it locally to be process and then release the control file. Obviously if the other instances detect that they can not take control of the control file then they go to sleep for a little while.
Depending of your OS, you can also simply rename the file and the OS will ensure that the file will not be processed by two entities simultaneously. So once your renaming is done you move it at the instances leasure.
If you are not sure that your OS will prevent the concurrent manipulation you can also pick a file, rename it with a prefix, but do nothing with it for a few milli second. And then come to see based on the core name (not including the prefix) if someone else has tried to take control of it. If no one did then it's yours, if someone did remove your prefix and let the other guy have it. (you may get a duplicate warning then you delete that one)
These are all old school contingency breaker methods.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Learn how to do cool things with Context Variables
Find out how to migrate from one database to another using the Dynamic schema
Pick up some tips and tricks with Context Variables