I need to calculate YTD totals for this year and previous year. Fiscal year runs from october 2010 -->september 2011. I created a SQL view to calculate YTD totals which works fine but due to performance issues, I am trying to convert it into a ETL. SQL is similar to as below: select a.yr, a.Dt, a.x1, a.x2, sum(b.cnt) as ytd_total from Fact_tbl a join Fact_tbl b on a.yr = b.yr and a.x1 =b.x1 and a.x2=b.x2 and a.dt >= b.dt join time_dim c on a.yr=c.yr and a.dt=c.date and (a.dt between c.year_start_date and c.year_end_date) group by a.yr, a.Dt, a.x1, a.x2 ) So, in my ETL I am using a tmap and doing a inner join between 2 fact_tbl which have data for current year and previous year. Also, I am doing a inner join between the lookup flow of fact_tbl and time_dim. After this I am having a hard time implementing the below: 1. condition to check if Dt is between the year start & end dates. I have got a filter on the input file in tmap which verifies that the Dt field is a date after the Year_start_date and before Year_end_date. I feel this condition is not working. 2. Calculating ytd_total. I am using this calculation for ytd_total.. row9.Dt.after(row8.Dt)|| row9.Dt.equals(row8.Dt)? (row8.cnt+ row9.cnt): 0 I dont think this is the right way to calculate the totals as I need to do a group by too. Please let me know if you have any thoughts on doing this? Regards,
Attached are the screen shots of the flow. Image 1 is the ETL flow Image 2 is the tmap inputs in which I verify that the Dt field is a date after the Year_start_date and before Year_end_date Image 3 is the tmap outputs in which I am doing a calculation for ytd_total in it. I tried using tAggregateRow too, but it just gives me the total for that day but not Year to date totals.
Difference between your SQL and tMap are: SQL tMap 1. Joined op1 to op2 on YR, X1 and X2 Joined on all of YR, X1, X2, X3 and most importantly DT 2. Condition for tm is on op1.DT Condition is on op2.DT 3. Between tm.YEAR_START_DATE and Condition is exclusive of tm.YEAR_START_DATE and tm.YEAR_END_DATE is inclusive tm.YEAR_END_DATE 4. Summing op2.cnt Summing op1.cnt+op2.cnt (For efficiency, you could move the comparison between op1.DT and op2.DT to be a filter on op2 and use either !op1.DT.before(op2.DT) or op1.DT.compareTo(op2.DT)>=0.) Then you need tAggregateRow to group by op4.YR, op4.DT, op4.X1 and op4.X2 and sum op4.cnt.
Hello Anut, you should also be careful when calculating YTD with joins. Normally you group by year and month of the transaction, but say you have a transaction for January (amount = 100) and one for April (amount = 150). Your query would give you : YTD January = 100 YTD April = 250 All the other YTD values (feb, mar) are not calculated and eventually assumed as zero, which is wrong. There is a SQL workaround for this, which is to create a cross product join with a table with the 12 moth, but performances are badly affected as you can imagine. For this reason in ETL different techniques must be applied. We developed a component to do that, but it's not released to the public yet, it is based on a java library (which we produced) that does not use joins to do the calculations, instead it uses a memory buffer and it scans all the periods. See in the attach, as an example month 2007 08 did not have values, still YTD is calculated BR Francesco
Thanks for the reply. In the ETL before, I am doing a cross join between fact & time dimension table. Doing so, for each "group by" combination I am creating a record for every date. so my data looks similart to this. yr dt x1 x2 cnt 2010 10/1/2009 toy1 ca 1 2010 10/2/2009 toy1 ca 8 2010 10/3/2009 toy1 ca 0 2010 10/4/2009 toy1 ca 0 ' ' ' ' ' ' ' ' ' 2010 9/28/2010 toy1 ca 0 2010 9/29/2010 toy1 ca 0 2010 9/30/2010 toy1 ca 5 On this data, I am doing a self join. Using SQL, I get the values as below: yr dt x1 x2 cnt 2010 10/1/2009 toy1 ca 1 2010 10/2/2009 toy1 ca 9 2010 10/3/2009 toy1 ca 9 2010 10/4/2009 toy1 ca 9 ' ' ' ' ' ' 2010 9/28/2010 toy1 ca 9 2010 9/29/2010 toy1 ca 9 2010 9/30/2010 toy1 ca 14 Using taggregateRow, I get the values as in input only which is my issue. yr dt x1 x2 cnt 2010 10/1/2009 toy1 ca 1 2010 10/2/2009 toy1 ca 8 2010 10/3/2009 toy1 ca 0 2010 10/4/2009 toy1 ca 0 ' ' ' ' ' ' 2010 9/28/2010 toy1 ca 0 2010 9/29/2010 toy1 ca 0 2010 9/30/2010 toy1 ca 5
As per the example screen shot, I understand the new component will help me in acheiving what I want to. Not sure, if this can be waited until the availability of the component. Can you provide me the techniques I can use?? Please let me know. Thanks again for your time.
Have you fixed your tMap to correct the difference I pointed out? Put up a new screenshot and one of tAggregateRow. Are skipped dates a problem with your data? It's easy enough to solve this with existing Talend components if need be.
attached are the screenshots with modifications of differences you pointed out. skipped dates is not my problem. as mentioned in earlier example, issue is data from taggregateRow is similar to my input data . I am not getting a upto date total. Thanks for your time.
You still have joined op1 to op2 on all of YR, X1, X2, X3 and DT. Of course you are not getting a total across multiple op2 DTs since you are joining on the date i.e. op1.DT=op2.DT. You also still have your condition for tm excluding YEAR_START_DATE and YEAR_END_DATE, whereas in your SQL "Between" would include them.
Hi Alevy, One last thing.. In the ETL, before aggregation I did a cross join with time dimension table and created a record for each date. This created a huge data set and I am getting a java memory error when I run it. So I decided to aggregate initially and then include the missing /skipped dates. I cannot get the logic that will get the YTD total from previous date for a skipped date. Any suggestions?? Thanks.
I suggest using tLoop --> tIterateToFlow to create a flow of all the dates you need (in ascending order). Then do a left-join lookup of the values you have by date, giving you a flow of all dates but with the values for some dates being null. Follow that with a tJavaRow with code like the following: output_row.DT = input_row.DT; if (input_row.cnt!=null) output_row.cnt = input_row.cnt; That will pass through the values that you do have but where you are missing a value, the previous row's value will be re-used.
Alevy, I used the code that you mentioned and is working great. But the issue I am having is, I have multiple groups of records as mentioned below and 2 fiscal years for each group. In each group, a missing value should be obtained from previous row until the end of fiscal year. At the begining of fiscal year, if the value is 0 it should remain 0 until there is a value. yr dt x1 x2 cnt 2010 10/1/2009 toy1 ca 1 2010 10/2/2009 toy1 ca 8 ----- ----- --- -- 2010 9/30/2010 toy1 ca 40 2009 10/1/2008 toy2 vt 1 2009 10/2/2008 toy2 vt 1 2009 10/3/2008 toy2 vt 5 - - - - - - - - - - - - - -
But using the above code, for few groups previous row value is being passed to another group or the same group if the value at beginning of fiscal year is 0. I have included the condition to check if the date is in between the fiscal year start date & fiscal year end date. I have been trying couple of things, but it is not working as I expect. Please let me know if you have any thoughts.
Try modifying the tJavaRow as follows: output_row.DT = input_row.DT; if (input_row.cnt!=null) output_row.cnt = input_row.cnt; else if (<current record's fiscal year end date>.after(context.FiscalYear)) output_row.cnt = 0; context.FiscalYear = <current record's fiscal year end date>; where context.FiscalYear is a Date, initialised with any date that is before the end of the earliest fiscal year e.g. 01/01/1900. That will reset the output to zero each time the fiscal year changes (including on the first record) and then carry that zero forward until you have an incoming value.