I have an input file of 700k rows and a lookup file of 63k rows. I'm trying to match each of the input rows with one and only one of the lookup rows. Unfortunately this is not an exact match on several fields, instead there are three criteria:
1. Input.string must start with all characters of lookup.string. This string can be a variable number of characters long per lookup row. Some have 2, some have 10.
2. Input.date1 must be later than or equal to lookup.date1.
3. (Input.date2 must be less than or equal to lookup.date2) or (lookup.date2 is null)
I have sorted my lookup file so that the strings are in descending order and the dates are sorted so that the first row that matches the criteria above should be the proper lookup row, and set the tMap to Inner Join.
As none of these are direct, the only way I know to do this match is to use a tMap with an expression filter only. If anyone knows of a better component to try please let me know. This unfortunately means that it is a Cartesian join with a lot of rows. Also with a tMap join with expression filter only, it appears you cannot restrict to first match.
As I'm already stuck with a Cartesian join with a large amount of data, I'm wondering if there is any way to stop the join process after it finds the first match. At the very least I'd like to avoid all the extra processing that I can.
My current solution is to include the line number of the input as part of that row, and have a tjavarow at the end which updates a global variable. When a row makes it through the filter, the global variable is then updated to that line number. In the expression filter no row can match if the line number is equal to that global variable. Basically only one line is getting through the expression filter. However all the remaining lookup rows are checked anyway. This works, but I'm lucky if it processes a hundred rows a second.
I'm not sure I fully understand your requirement, but it sounds like you may be able to make use of the "Reload at Each Row" functionality of the lookup and a bit of Java code. It would be nicer if your lookup data was coming from a DB as you can add a dynamic where clause to your query for every input row. Obiously 700k DB queries is a lot, but I'd wager it would perform better than 700k x 63k potential comparisons in Java.
Since you are currently storing your data in a tHash component, I'm assuming that you want to hold this data in memory. This does mean that filtering it may require a bit of Java. I have written two tutorials which cover what I am suggesting, although they cover the theory (with few similarities to what you are actually wanting.
The first tutorial looks at storing data in a bespoke collection. I am suggestion this as I believe, from what you have described, that you may be able to use this with your filtering of records in the lookup. It is a little complicated (I was actually recreating the Connect By Prior functionality in Oracle.....just to see if I could), but should give you some ideas on how you can make use of a bespoke collection.
The second tutorial is on row multiplication. I use a technique of retrieving a value from the Main flow of a tMap and sending that to the Lookup component. The lookup component returns data to the tMap based on that value. This could be used with your bespoke collection to allow you to filter your in-memory data.
With your understanding of the requirements (and a bit of Java knowledge) it should be relatively straight forward to build something that will allow you to achieve your performance goal.
If all the data were in a DB, then an inner join with a "Like" and a first match would handle everything quite nicely. Unfortunately I only have access to data files. The only thing I am storing is the lookup values, and those are stored because they might come from a variety of file types and I don't know another way to make the lookups to a tMap conditional.
Above this subjob I basically have an IF that triggers one of several different read functions depending on file type, as tFileInputExcel cannot handle both xls and xlsx at the same time. That data is then read into a pre-instantiated empty hash, and then I use that hash as the lookup. My only other thought would be to write them to a temporary delimited file instead. I actually don't want to store any other data in memory, which is why I'm writing the matched rows out to a file.
The articles you've listed are interesting, and I'll have to check them out. However at this time they don't seem pertinent to my situation. I'm not looking to filter anything. I'm actually looking to join the data, based on the criteria I listed above, along with a first match only requirement. For instance, this is my Expression Filter for the lookup in the tMap, which in SQL would be the join condition:
(row3.Call_Date.compareTo(row4.Rate_Begin_Date) >= 0) &&
(row3.Call_Date.compareTo(row4.Rate_End_Date) <= 0))
The first line of it just prevents any match after the first from flowing. After a row processes the global variable is set to be equal to that row, preventing further matches. I'd like to just be able to stop the join at that point, as it is wasted processing.
With just an expression filter, there is no option in tMap for First or Unique, only All Rows. The end goal is not filtering anything. It is joining.
You misunderstood me. You are looking to filter your Lookup data IF that will reduce the number of rows returned to match on. Your suggestion about a "Like" in SQL can be performed on your Lookup dataset before it enters the tMap achieving the functionality you want.
Having said that, I've just realised that in Talend 6.4 there is an update to the tMap which I believe provides the functionality you require. You can pre-filter inputs AND use the options you mentioned (First, Unique or All Rows). Take a look here for the Talend documentation: https://help.talend.com/reader/j~SccKoZW8kEeYb0teesxw/6evSw6fOmjwGpoLcD~869Q
Look for the green plus with the white arrow pointing right on both the Main input and lookup.
I believe you may be able to achieve the logic you require from this new addition. If not, I still think you can achieve it using my first suggestion.
I think I indeed might be misunderstanding you. I'm not sure how I could filter my lookup data without the input data. The relevance of the lookup data depends on that input data. There are 63k lookup rows, and maybe only one of all those will match, or maybe all of them will have several matches. How could you pre-filter the lookup rows without using the input rows? You won't know what lookup rows are relevant until you try to match the input rows to them. The only way to filter them would be to try to join them in the other direction, which has exactly the same limitations except that the larger dataset would be the lookup (which has issues on its own with overhead).
The conditions of the join (rephrased, not sure how I can make it more clear?)
1. Input.Called_Number.startsWith(Lookup.Dialed_Digits) and
2. Input.Start_Date.compareTo(Lookup.Start_Date) >= 0 and
3. (Lookup.End_Date is null) or (Input.End_Date.compareTo(Lookup.End_Date) <= 0
The second article is about the opposite of what I am trying to do. I'm trying to reduce row duplication, not induce it.
Talend 6.4 might be a solution. My company has primarily been working with Open Studio for Data Integration 6.1.1, so I'll see if I can check out the newer versions. We are trying to keep everyone in this group using the same version as there could be problems with compatibility otherwise.
If you use "Reload at Each Row" you can use your Main data to filter your Lookup data. If your Main row column Called_Number is "12345" you can send that to your Lookup data source in the globalMap in order to reduce the rows sent back. So in your case, if your Dialed_Digits records are ......
.... you could use a bespoke collection with a method which would use "12345" (from the Main row) as its input and only return the rows with....
....to your tMap for further processing.
If you want you can use it to perform all of your logic and only ever send one row at a time back to the tMap. That was the method I was talking about first and the method the two tutorials related to. This way you are not carrying out a cartesian join and comparing every Main row against every Lookup row in the tMap.
Clearly doing this (and in the example I used above) there will be the possibility for duplicates. If they occur then it is just a matter of working out how you want to remove duplicates (which of 1, 12, or 123 would be the best match for 12345....assuming the dates are the same for each) and use something like a tSortRow and a tAggregateRow to group by your duplicate key and return the first or last values for each group.
Watch the recorded webinar!
Accelerate your data lake projects with an agile approach
Create systems and workflow to manage clean data ingestion and data transformation.
Introduction to Talend Open Studio for Data Integration.