One Star

Joining large data sets

Could you please provide some technical insight into Talend?s capabilities in handling large data set joins?
Specifically, I am interested in the following two scenarios:
1) One large table joins several small data sources (lookups).
- Is there a limit to the number of sources you can use in tMap?
- Can I explicitly set the join expression to perform complex joins?
Example: (Src1.FieldA=Src2.FieldA AND Src1.FieldB > Src2.FieldB AND Src1.FieldB < Src2.FieldC)
2) One large table joins with another large table.
- Ideally, I?m looking for a way to do row-by-row processing on pre-sorted data.
For context, let?s say we're trying to join a 100 million row table from Database A to a 100 million row table in Database B.
- From what I can tell it appears that the tMap component does one large in-memory hash to join the data sets. Obviously at some point you run out of system resources using this component. Is there another component available or in the works to handle this scenario?
Thanks,
Aaron
Tags (1)
11 REPLIES
Community Manager

Re: Joining large data sets

Hi
- Is there a limit to the number of sources you can use in tMap?

No, there is no limitation.
- Can I explicitly set the join expression to perform complex joins?
Example: (Src1.FieldA=Src2.FieldA AND Src1.FieldB > Src2.FieldB AND Src1.FieldB < Src2.FieldC)

yes, you can set the complex joins in the filter fields of tMap: Src1.FieldA==Src2.FieldA && Src1.FieldB > Src2.FieldB && Src1.FieldB < Src2.FieldC
From what I can tell it appears that the tMap component does one large in-memory hash to join the data sets.

From TOS2.3.0M1, you can allocate more memory to a job, go to Windows --> Preferences --> Talend --> Run/Debug and change the VM arguments.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: Joining large data sets

Thanks for the reply.
Unfortunately large data sets, like in my example, can easily exceed memory resources. Although it is useful to configure memory allocation, preloading all the data results in a sever limitation of the product.
What we really need is the ability to pull pre-sorted data from a buffer for row-by-row processing. This allows us to work with data sets of unlimited size, a requirement in most of my ETL projects.
Any thoughts on when this may be included is much appreciated.
- Aaron
One Star

Re: Joining large data sets

Hello
I jump into this topic as a while ago I posted 1912, asking for the support of Berkeley DB. It is very efficient when it comes to deal with millions of records (100M + in my case - including lookups). So far I use Talend to develop the general infrastructure but the DB accesses and lookups are coded in Perl.
If this idea fits your need, maybe you could support this feature request, which in turn could get higher priority.
Thanks
Employee

Re: Joining large data sets

Hi,
We are currently working on a new tJoin component with a persistent hash lookup implementation : 2539. Your feedback is welcome :-)
Best Regards
One Star

Re: Joining large data sets

what about a java component ?
Employee

Re: Joining large data sets

Hello,
what about a java component ?

Done in 2692
Regards,
One Star

Re: Joining large data sets

i download 2.3.0M2 version but i don't find this component. where i can download 2.3.0RC1 ?
Employee

Re: Joining large data sets

i download 2.3.0M2 version but i don't find this component. where i can download 2.3.0RC1 ?

download page
Employee

Re: Joining large data sets

Hi,
The new tJoin component does not yet support BerkeleyDB files. It still uses an in-memory perl hash to perform lookup. This component is liter than tMap and thus requires less resources.
The next version of tJoin will support in-memory BerkeleyDB Hash that need 3 times less memory than perl hash. BerkeleyDB database files will also be supported for bigger lookup tables.
Hope it helps.
One Star

Re: Joining large data sets

Hello,
Is there a bug tracker for the next version (supporting Berkeley DB)?
What is the schedule for this second version ?
Thanks in advance
Best regards
Philippe
Employee

Re: Joining large data sets

Hi,
This 1912 is related to Berkeley DB input/output components. We have postponed the developement of tJoin with BDB support to 2.4 release because it is related to 963.
I will create the bugtracker feature request as soon as 963 is fixed.
Best Regards.