tPartitioner, tCollector & tRecollector

Seven Stars

tPartitioner, tCollector & tRecollector

I'm using TIS 5.2 EE version... There is a set of components available in this version tPartitioner, tCollector & tRecollector. These are used to partition data based on some key and process it in parallel. But when I looked at these component closely - got to know that tPartitioner component has one option "Number of child threads". I believe this option would decide in how many threads(in other words processors) data would be partitioned in and processed.
My question is - Is there any option which can process this partitioned data on different job servers and Recollect it at the end. Coz even if my data is processed on different processors, its still eating up my only server's hardware. I'd like to know how exactly it works internally. This is really important for me to design one process for my project.
Community Manager

Re: tPartitioner, tCollector & tRecollector

Hi
I'm afraid this is a feature that has not been developed.
I suggest that you open a new "feature request" in our JIRA bugtracker, if you don't mind: https://jira.talendforge.org/TDI
Many thanks
Elisa
Employee

Re: tPartitioner, tCollector & tRecollector

nishadvjoshi -
There is no automatic way to send your data that is partitioned by the tPartitioner to different servers. The tPartitioner will partition and start N number of threads on the machine the job is running. I typically set the number of threads to the number of cores - 1.
? tPartitioner ? breaks up the input dataflow into ?buckets? or partitions --- one for each thread defined
? tCollector ? start of the thread. Will read from its assigned bucket to grab data
? tDepartitioner ? end point of a threads dataflow. Notice there can be more than one.
This is ending the thread and creating an output flow to be picked up by a tRecollector.
? tRecollector ? new thread that read from the associated tDepartitioner data flow.
The Recollector and Departitioner allows the data flow from threads to be combined back into a single dataflow. Notice that the Partitioner ?Starts? the Recollector. The Recollector is one thread that reads internal memory queues that each of the threads dump their data into (think of the Departitioner as an internal memory queue ? one per thread).
You do not have to set a partition key. That is optional and based on your job/data. You do not have to use the tDepartitoner/tRecollector -- only needed if you need to unite your data for further processing in the job.
You can tune the job based on the hardware you have the job running on by setting the number of threads, the buffer size and setting a partition hash key. However, these components were designed to process data in parallel on a single machine.
Seven Stars

Re: tPartitioner, tCollector & tRecollector

Hi wfox,
This is really useful information... thank you very much for that insight. I'm planning to design a package to process data in parallel. When I was looking at TAC console, I figured out that we can add virtual servers by clubbing two or more servers. In such cases, will that Virtual server be considered as one whole server to partition and process data by these components or not?
One Star

Re: tPartitioner, tCollector & tRecollector

In Partition Column, enter the name, without any quotation marks, of the partition column of the Hive table you want to write data in.
In Partition Value, enter the value you want to use, in single quotation marks, for its corresponding partition column.
this not work ? Why ?

Bruno