Working with Bigdata Pig

One Star

Working with Bigdata Pig

I tried to use Talend tPig* components, working with Hadoop. But there were a few problems I could not solve. Maybe there was some way I have not found? Maybe someone has some workaround or any positive practice working with ?Talend Pig?? I'd like to hear about it.
The following are some of the problems.

UNION is absent.
Pig example:
-- the table ?/calls? contains phone calls.
-- I want to get a table which union incoming and outgoing calls.
call = LOAD '/calls' USING PigStorage(';') AS (subsfrom: chararray, substo: chararray, date: chararray);
C1 = FOREACH call GENERATE subsfrom, date, 'OUT' as direction:chararray;
C2 = FOREACH call GENERATE substo, date, 'IN' as direction:chararray;
U = UNION C1, C2;
STORE U INTO '/call1' USING PigStorage(';');
I didn?t find any way to realize this example via Talend tPig*.

tPigLoad has only one exit.
An example is the same. Even if there was some ?union? component, I would have to use two identical tPigLoad components. Or more than two if I need to use this table more than twice.

tPigAggregate has low functionality.
See the example above. In this example I need to generate simple column ?direction?. I would like to do this with tPigAggregate. But the only way to do this was using tPigCode.

tPigJoin has only one enter.
Pig example:
phone = LOAD '/phone' USING PigStorage(';') AS (subs: chararray, churn: int);
call = LOAD '/calls' USING PigStorage(';') AS (subsfrom: chararray, substo: chararray, date: chararray, type: chararray);
-- Suppose we want to get list of SMS from customers that have flag phone.churn=0
fphone = FILTER phone BY (churn==0);
fcall = FILTER call BY (type=='sms');
J = JOIN fphone BY subs, fcall BY subsfrom;
But we can use only one filtered table in tPigJoin. Not two (or three, four etc.)

tPigCross has only one enter.
Example is the similar to the previous one.

There is no way to use SPLIT.
SPLIT churn INTO churnvoice IF type=='voice', churnsms IF type=='sms', churnmms IF type=='mms', churngprs IF type=='gprs', churnussd IF type=='ussd', churn0 OTHERWISE;
Of course we could try to use tPigFilter instead, but because tPig elements have onle one exit we can do this only using six different chains of tPigLoad + tPigFilter elements.

Where is LIMIT?
I would like to find it in PigStoreResult component, but I didn?t find it there.

And what about authentication?
tHDFS* components (for example tHDFSConnection) contain username fields but there is no any similar ones in tPig* components. And this way I can work with Hadoop only when I switch off the authentication there.

-- Dmitriy.


Join us at the Community Lounge.

Register Now


Talend named a Leader.

Get your copy


Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables


How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration


Downloads and Trials

Test drive Talend's enterprise products.