Hello, I am trying to implement the following SELECT statement in TOS 2.2.4: SELECT 2_mil.id_key, mi.col_1, mi.col_2, mi. col_3 FROM 8_million_records 8_mil, 2_million_records 2_mil, 500_records me, tmp_file_data mi WHERE 8_mil.id_key_alt = mi.SecurityId AND 8_mil.id_typ_alt_key = 'I' AND 8_mil.id_key = 2_mil.id_key AND me.id_ctry_std IN ('A', 'B', 'C', 'D', 'E'), AND me.id_exchangekey = 2_mil.id_exch_pr AND 2_mil.id_replaced = 'N' AND 2_mil.id_del_grd = '0' AND me.id_del_grd = '0'
When I tried to join all tables in one tMap I never received a result, so I separated the SELECT on several sub joins. Can you please confirm that this is the right way to implement this SELECT in Talend or probably there is a better option to do it in TOS 2.24 or in TOS 2.3.0? 8_million_records has the following select: "select id_key, rtrim(ltrim(id_key_alt)) id_key_alt from 8_million_records where id_typ_key_mil = 'I'" 2_million_records: "select id_key, id_exch_pr from 2_million_records where id_replaced = 'N' and id_del_grd = '0'" 500 records: "select id_exchangekey from 500_records where id_ctry_std IN ('A', 'B', 'C', 'D', 'E') and id_del_grd = '0'"
The input file has about 1,500 records. I see your point, but how it will affect performance? Talend seems to be quite quick at loading a big amount of records from a database table in one batch and then perform a hash map join in memory against a small input file. While, possibly if a big table is used as an input, then Talend will not be able to fetch records from a database in a one batch since it has to verify each record against a lookup file before fetching the next record. Wouldn?t the database operation become a bottleneck in that case? Anyway, thanks for the answer - I will try it. I know Talend has introduced a tJoin component in the 2.3.0 version, which seems to be dedicated to working with big amount of data. Do you think some combination of tJoin and tMap components can possibly give an advantage for this SELECT?
In fact, all data loaded in lookups of a tMap are implicitly loaded in a hash in memory. So, don't worry on performance on this point. The inconvenient is we can't load a big amount of data in lookup, It depends on number of columns, type of columns and length of data. We are working on file hash, I hope this option will be available for 2.4 M1 to get round this limitation. In your case, the choice of Main depends on all rows 8_millions must be fetched or it's just an existing check, but I advice you to use your 8_millions as Main of tMap if you have an Heap memory error". Don't forget to trim your lookup's keys as possible. You can set as lookups in this 2_million_records, tmp_file_data and 500_records into a same tMap.
Hi, I have tried to use the 8_million_records tables as a Main row and compared it with the results which I had when I used a file as a Main row. Regarding performance ? the results are basically the same: design_001: Starting job test_big_join at 11:47 04/02/2008. Job test_big_join ended at 11:50 04/02/2008. design_002: Starting job test_big_join_2 at 11:38 04/02/2008. Job test_big_join_2 ended at 11:41 04/02/2008. old_design_001 Starting job test_big_join_3 at 11:56 04/02/2008. Job test_big_join_3 ended at 11:59 04/02/2008. However, this test raised another questions: design_001 has 241 records in the final output (as it has been indicated by the Talend statistic). This amount of records is consistent with amount of records I have received in design_001. However, in the old_design I received 227 records. All queries have been run against the same record set, they use the same SELECTs, all corresponding components received the same amount of records from the database tables and the file in all design diagrams. Talend statistic shows tMap_1 returns different amount of records for design_001 and old_design_001. However, the only difference between these two designs is the order in which two inputs have been joined (as a matter of fact they have been created through the Talend copy paste option for components) and then I only adjusted the tMap_1 component. I have double check the tMap_1 component for design_001 and old_design_001. For both diagrams tMap_1 looks the same to me, apart from the input order part. Can you please advise why I have received different amount of records when I changed the join order in the tMap_1 component? Thanks Michael
Be careful, tMap does not work like a SQL query, its behavior depends on each Lookup configuration, AND the order of the tables, the order of input tables being essential: - if you set your Lookup as "UNIQUE MATCH", "FIRST MATCH" or "LAST MATCH", then the lookup is considered as a filter for all relative tables (Main or Lookups) positioned above this table in the tMap. Hence it will select ONLY ONE match for one row (1 row main x 1 row lookup = 1 row output max). - if you set your lookup as "ALL MATCHES", then the lookup is considered as a multiplier of rows of the relative tables (Main or Lookups) positioned above this table. Thus it will only select THE MATCHING rows, which will accordingly multiply the results in the outputs (1 row main x N rows lookup = N rows output max). - if you set your lookup as "ALL ROWS", the lookup will be considered as a multiplier of rows of the relative tables (Main or Lookups). Thus, it will select ALL rows, which will accordingly multiply the results in the outputs (1 row main x ALL rows lookup = ALL rows output max). For nearly all these lookups you can set additional filters (in the Filter area). Actually, you must think first about prioritizing your input data flows according to the output you want to get, and not think like SQL language where order has no effect. I hope these indications will help you. A note on the tJoin, this component seems to be slower than the tMap, it will be improved soon.
thanks for the reply, "ALL MATCHES" works as it works in a SELECT. Can you bring a simple example for two record sets (main and lookup) and which outputs should be produced for "Inner Join" and "UNIQUE MATCH" to demonstrate how it works? Can you also provide more details and/or examples to illustrate what you mean when you say: "Actually, you must think first about prioritizing your input data flows according to the output you want to get,"? What do you mean by data prioritization according to the output I want to get?
Console: WARNING: UNIQUE MATCH is configured for the lookup 'row2' and it contains more one result from keys : row2.ID_LOOKUP1 = '1' REJECT INNER JOIN|1|MAIN ligne1|1|LOOKUP ligne6 WARNING: UNIQUE MATCH is configured for the lookup 'row2' and it contains more one result from keys : row2.ID_LOOKUP1 = '2' REJECT|2|MAIN ligne2|2|LOOKUP ligne8 REJECT INNER JOIN|3|MAIN ligne3|| REJECT|4|MAIN ligne4|4|LOOKUP ligne4 REJECT INNER JOIN|5|MAIN ligne5|| REJECT INNER JOIN|6|MAIN ligne6|| WARNING: UNIQUE MATCH is configured for the lookup 'row2' and it contains more one result from keys : row2.ID_LOOKUP1 = '7' OUT|7|MAIN ligne7|7|LOOKUP ligne10 REJECT INNER JOIN|8|MAIN ligne8|| OUT|9|MAIN ligne9|9|LOOKUP ligne9 Comments on console output: "Unique match" mode check uniqueness of matching keys, so more one identical keys will imply a warning message such as visible below. NOTE: a bug with "Unique match" show the warning only at the third duplicate, this bug is fixed and will be available for 2.3.1. ------------------------------------------------------------- Output rows for OUT: OUT;7;MAIN ligne7;7;LOOKUP ligne10 OUT;9;MAIN ligne9;9;LOOKUP ligne9 Comments on output rows for OUT: "Unique match" implies that only one of the same matching key will match with a row input, more precisely it works same as "Last match" mode which retains only the last loaded identical key. In this example the lookup line "7;LOOKUP ligne7" is loaded before the last identical key, then it never won't match. Only row1.ID_MAIN > 4 are valid, and rows 5, 6 and 8 are rejected because they does not exist in lookup. So it remains rows 7 and 9. ------------------------------------------------------------- Output rows for REJECT: REJECT;2;MAIN ligne2;2;LOOKUP ligne8 REJECT;4;MAIN ligne4;4;LOOKUP ligne4 Comments on Output rows for REJECT: Here, rows 2 and 4 are rejected by filter in OUT. ------------------------------------------------------------- Output rows for REJECT INNER JOIN: REJECT INNER JOIN;1;MAIN ligne1;1;LOOKUP ligne6 REJECT INNER JOIN;3;MAIN ligne3;; REJECT INNER JOIN;5;MAIN ligne5;; REJECT INNER JOIN;6;MAIN ligne6;; REJECT INNER JOIN;8;MAIN ligne8;; Comments on Output rows for REJECT: Here, row 1 is rejected by filter in lookup, other rows are rejected because they does not exist in lookup.
For your second question:
Take as example a table "CUSTOMERS" and a table "ORDERS". To map all orders with relative customers, the good way is to put "ORDERS" table as main input table and "CUSTOMERS" table as lookup table if you select one of the alone lookup type. It will be the more efficient architecture. Yet, as you say, you can also inverse tables and use "All matches" mode but many more data will be loaded in memory and may trigger a java heap memory exception. The choice depends on you and your computer's capabilities.
Ok, thanks, I see the purpose of the "Unique match" "Take as example a table "CUSTOMERS" and a table "ORDERS". To map all orders with relative customers, the good way is to put "ORDERS" table as main input table and "CUSTOMERS" table as lookup table if you select one of the alone lookup type. It will be the more efficient architecture. " I agree that a database design is usually more efficient for an ETL tool when "CUSTOMERS" and "ORDERS" create a classical star schema, where ORDERS is a fact table and CUSTOMERS is a dimension . However, I have a different scenario: 1) our database is an OLTP, it is not a data warehouse. 2) Even though the file, which has to be loaded in the database, has less records than a lookup table, logically it has the same meaning as ORDERS. "Yet, as you say, you can also inverse tables and use "All matches" mode but many more data will be loaded in memory and may trigger a java heap memory exception. The choice depends on you and your computer's capabilities."
I am not sure if I follow you. If I use a slowly changing dimension type 2 as a look up table, then possibly I need the latest record from that table because most likely it will be the current/active record. In that case - I agree with you. However: 1) you've recommended to use the data flow which I need to load in the database as a lookup because it has less records than the lookup table. That's ok. However, if I use now "Unique match" to join my original data flow (which has become a lookup) with a lookup table (which now has become a main flow) - Taland will exclude all new records from my original data flow for a given ID except the last record. I.e. for a given ID Talend will load in the database only a record which happened to be the last record in the file for that ID, ignoring all other records with that ID. Obviously it is not what I want. As far as I understand, if I do not use slow changing dimensions type 2, "Unique match" will lead to an error. Can you advise if I have missed anything? 2) If I have two Sybase inputs for, let's say, ORDERS and CUSTOMERS which I join in tMap through the "Unique match" option. How Taland will load data from these tables: will it load in memory everything from CUSTOMERS, then everything from ORDERS and then join them in memory producing a new output, excluding duplicated records? or Toad will keep in memory only last records for CUSTOMERS? If Toad loads and keeps all required data from CUSTOMERS and ORDERS in order to produce both a normal output and a rejected output, then I am not sure how "Unique match" can be more efficient from the memory perspective?