I want to perform lookup operation (defined below) in spark batch job.
My main input flow file lets say 'ABC' have columns U,V,W
I have another file(lookup file) say 'DEF' say columns .X,Y,Z
My logic should check like if(X=="APPLE") then get 'Y' value and populate to "V" else populate 'Null'
is explained more in the link:
In Data integration job i had stored the data in hash map and written a custom java function to fetch values on the basis of keys. Now by considering spark distributed framework, what is the best way to achieve lookup operation (hashmap or rdd or pair rdd or data frame or etc.,)? Also if possible, please elaborate on why is it a best option..
Appreciate your suggestion/help.
tCacheIn and tCacheOut can be available in the Spark Batch and Spark Streaming Job framework.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Learn how to make your data more available, reduce costs and cut your build time
Read about OTTO's experiences with Big Data and Personalized Experiences
Take a look at this video about Talend Integration with Databricks