I know how to do traditional joins, where you join on two values being equal, but what about if you want to join on the value that is closest to the one you have?
I have a function in MySQL which returns me a value from the row with the closest timestamp to the one I provide to it.
We are interested in moving some functions into talends logic and I was wondering if it was possible to do this kind of thing?
I'm not sure if there are enterprise edition tools which do this, but you can approximate this with a tIntervalMatch component:
https://help.talend.com/display/TalendComponentsReferenceGuide61EN/tIntervalMatch
tIntervalMatch requires a range, though. It won't compare to a single column and find the closest value.
There is also tFuzzyMatch, but I'm not sure if that works with date columns.
I appreciate the suggestions but in the given scenario I will not know the interval, this is why I need "closest".
OK, there is more you need to take into consideration here. For example, the example I gave is one thing. Here are few more....
1) Are values made unavailable to other records once they've been matched? So for example, if you have 3 dates to match and 6 dates in the lookup table, once you have matched one of the 3 dates is the date in the lookup table it was matched to removed from the set?
2) What if a date is directly between two dates which are exactly the same distance away either side?
3) If you can only match 1 date to each lookup date, you will need to pay some special attention to the order in which you carry out the matches. What if the first row has a date that matches a lookup date to 2 days, then the last row matches that same lookup date to 1 day?
This is quite a complex problem to solve here. Possible, but you need to refine the requirements
1) No the matches are not removed, if two "main" rows match to the same "lookup" row, that's fine.
2) If there are two equidistant matches, I wouldn't be bothered which one was picked.
3) --
I can't think of what else I can say to help narrow down the requirement, ideally there would be a component that takes a main table and a lookup table, and retrieves info from the nearest lookup row for a particular column.
OK, the way I would look at this initially would be to convert your dates to milliseconds. You will need to use a tMap with your main flow and your lookup unconnected. You will need to use a series of tMap variables to keep track of your data and the most recent shortest distance between dates. Take a look at this tutorial because it does something similar (https://www.rilhia.com/quicktips/quick-tip-compare-row-value-against-value-previous-row).
The thing you need to be aware of is that tMap variables keep their values between rows. So you can keep track of the row which has the nearest match if you keep it as a series of variables. When the main row has gone through all of the lookup rows (this will be a cartesian join) you will need to be able to identify this to release the values from your tMap variables to the output flow.
What will happen is that if you have 100 input rows and 50 lookup rows, you will pass 5000 rows out of the tMap. The next component will be a tAggregate row. You will use this to group by whatever you have used for your main row key. Then, as you know that the last row of the group will contain your nearest match, you will set the aggregate function on all of the columns to be "LAST". This will reduce your data back to 100 rows and you will have joined to the nearest date in your lookup.
Keep in mind that the difference between dates could be positive or negative. If you don't care which way you go, you will need to make sure that you convert negative date differences to be positive.
Sorry it's a bit of a brain dump, but I think you should be able to do it with that.
How do we know for the tAggregateRow that the "LAST" row will be the one with the closest value?
Because of the logic used to set the tmap variables that hold row of the nearest match.
Lets assume we have the following variables; date_difference (long), main_row_key (String), Value1 (String) and Value2 (String).
When your main row key column (must be unique) is changed, then you need to set the main_row_key variable and set the date_difference variable to the first difference. After that is done, for every row where the main_row_key is the same, you check the calculated date difference and when it is lower than the held date_difference variable, you update the variables that will hold your lookup values.
Main Row
ID (String): A123
Date (milliseconds): 1494577714032
Lookup Row 1
Date (milliseconds): 1494577710032
Value1: Hello
Value2: World
Lookup Row 2
Date (milliseconds): 1494577713032
Value1: Goodbye
Value2: My Friend
Lookup Row 3
Date (milliseconds): 1494577716032
Value1: Hey
Value2: Whats Up
When the main row arrives in the tMap it will have a key of "A123". If we have a variable called "main_row_key", we check the current held value of this. If it is different to the key coming in (as it will be) we set the "main_row_key" to "A123" and also calculate the date difference between the Date fields (for the first lookup row this will be 4000). This will be our first value so naturally it will be considered the current "smallest gap". We therefore set some tmap variables to hold the Value1 and Value2 values for this lookup row. These get passed to the output flow.
The next lookup row that arrives will cause the main row to be duplicated which will have the same key as the currently held "main_row_key". Therefore we know we are still calculating the same main row. We therefore need to test the difference between the held date_difference value and the calculated date difference. If the calculated date difference is less than the date_difference variable value, the date_difference variable value needs to take that value AND the Value1 and Value2 tmap variables need to take the values of the incoming lookup row. For lookup row 2 the date difference is 1000. This is lower than 4000, therefore we know the dates are closer. The Value1 and Value2 variables are now set to "Goodbye" and "My Friend". These values are passed to the output flow.
The next lookup is carried out the same way but this time the date difference is greater than the held date_difference value (2000 is greater than 1000). Therefore nothing held as tMap variables changes. The previous values remain and are passed to the output flow for this row.
So, in a long winded and clumsy way (sorry, I find it tough describing the level of detail needed here in a shorthand manner), we have seen that the lookup row details of the row with the smallest date difference are always held until the last row of the main row group. As such the tAggregateRow can guarantee to return the correct details if set the aggregate logic to LAST for all of the rows that are grouped by your main row key.
I think I explain this in the tutorial I gave you. At least I give you screenshots which should make it a bit clearer how to achieve this.
Good luck. This is quite fiddly, but once you have figured this out, it is a very useful technique to help solve lots of different problems. One thing I forgot to mention is that the tMap variables are processed in order (from top to bottom). This logic (if you follow the tutorial) is based around comparing the value of the next tMap variable in the computation of the tmap variable above. This is how you can take a peak back at the last row value. Once the variable above has been set, the variable you were looking back at is updated to be the current row variable ready for the next row's "look back". It is quite tricky to explain (hence the long winded explanation) but really useful and not many people are aware of it.
Thank you for all your help, I think I managed to get this working.
Unfortunately, it seems to be slower than doing things directly in the MySQL, so it is not useful to me.
That's what this was about, though; an experiment to see what can be done in Talend, so I have what I came for. Thanks again.
Performing logic like this inside a database will always be quicker. If all your data is in the same database and you want performance to be optimal, I would always advocate carrying out logic like this in the database. Otherwise you will incur the latency of retrieving the data from the database, network performance, then processing the data in memory in Java. I assumed that the data was coming from different sources when I answered this. It was also an interesting problem to look at :-)