Hi Community, I have many csv files in distributed directory. There are duplicate file-names in those directory. I want to read those files only once, if there are duplicate filename it should read only one file. example D:\test\a\ abc.csv, 123.csv,yud.csv D:\test\b\rd.csv,xy.csv D:\test\abc.csv,fty.csv In above you can observe abc.csv is located in 2 locations. I want to read one among these two csv. Please do needful help. Thanks, Sravanth
You need to store the file names. Where (memory/file/database) depends on whether or not you want this de-duplication to persist across runs of your Job. A database table of processed files may be the sensible option. You can then insert each successfully processed file and then check the database each time you pick up a new one. If you don't have a database to hand, I always use SQLite for this type of activity.