One Star

tExtractRegexFields Usage

Hi All,
I am fairly new to Talend and am still learning my way around. I'm trying to get data from a database, parse each record, then write the data to a target. I've already created a job that can read data from the source and write to a target, but I'm having a little trouble with parsing the data records. Also, I need to use a Regular Expresson in order to locate the text from each record.
There is a new component in Talend v 2.4 called tExtractRegexFields that I believe is exactly what I'm looking for, but there is no documentation or examples on how to use this component. My questions are, has anyone used this component and are there any examples of it's usage. One other note, I am using a Java project in an attempt to get this component to work.
Thanks
5 REPLIES
Community Manager

Re: tExtractRegexFields Usage

Hello
The tExtractRegexFields component is used to split one columns to multiple columns. Here comes a Java scenario,
Input file:

1,2,3, meeting with John ,5
1,2,3, go to the gim ,5
1,2,3, meeting with Mary ,5
1,2,3, go to training ,5
1,2,3, diner with kids ,5
1,2,3, diner with parents ,5
1,2,3, go fishing with friends ,5

Result:
Starting job feature3713_tExtractRegexFields at 13:30 23/07/2008.
.---------+-------------------------+--------+-------------.
| tLogRow_1 |
|=--------+-------------------------+--------+------------=|
|day |due |priority|place |
|=--------+-------------------------+--------+------------=|
|Monday | meeting with John |high |office |
|Tuesday | go to the gim |low |SuperGim Club|
|Wednesday| meeting with Mary |low |office |
|Thursday | go to training |high |IT center |
|Friday | diner with kids |highest |home |
|Saturday | diner with parents |highest |Dad's home |
|Sunday | go fishing with friends |high |home |
'---------+-------------------------+--------+-------------'
Job feature3713_tExtractRegexFields ended at 13:30 23/07/2008.

Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: tExtractRegexFields Usage

Thanks, your example worked perfectly! I did not think to create an input file as a test to figure out how this component worked.
After, I experimented with the example a few times; I tried to apply what I learned to my data with no success. So, I went back to the example and started changing things a little at a time until I broke the component:
What I learned is that:
1) The input and output names used in the tExtractRegexFields cannot be the same (see image below).
2) The tExtractRegexFields component is fragile.

1) The input and output names used in the tExtractRegexFields cannot be the same.
I set up all the components just as you have in your example. Then I started over and defined a new tExtractRegexField component, then used the same input and output name (see image below).
The output looks like this:
.--+--+--+--.
| tLogRow_1 |
|=-+--+--+-=|
|c1|c2|c3|c4|
|=-+--+--+-=|
|1 |2 |3 |null|
|1 |2 |3 |null|
|1 |2 |3 |null|
|1 |2 |3 |null|
|1 |2 |3 |null|
|1 |2 |3 |null|
|1 |2 |3 |null|
'--+--+--+--'
If the output names are different (I named the columns w,x,y,z), the output works as in your example.
2) The component is fragile. I changed the Regular Expression to:
"^\\(+)\\\\s*\\]+)\\]"
I got the following error message:
Exception in component tExtractRegexFields_1
java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Unknown Source)
at test2.pfr_0_1.PFR.tFileInputDelimited_1Process(PFR.java:810)
at test2.pfr_0_1.PFR.runJobInTOS(PFR.java:1040)
at test2.pfr_0_1.PFR.main(PFR.java:954)
If I do nothing to the Regular Expression and change the input file to just have a single record beginning with Monday, it works:
Input file:
1,2,3, meeting with John ,5
Output
.------+-------------------+--------+------.
| tLogRow_1 |
|=-----+-------------------+--------+-----=|
|day |due |priority|place |
|=-----+-------------------+--------+-----=|
|Monday| meeting with John |high |office|
'------+-------------------+--------+------'
or if I change all of the records to match my Regular Expression:
Input File:
1,2,3, meeting with John ,5
1,2,3, go to the gim ,5
1,2,3, meeting with Mary ,5
1,2,3, go to training ,5
1,2,3, diner with kids ,5
1,2,3, diner with parents ,5
1,2,3, go fishing with friends ,5
Output:
.------+-------------------------+--------+-------------.
| tLogRow_1 |
|=-----+-------------------------+--------+------------=|
|day |due |priority|place |
|=-----+-------------------------+--------+------------=|
|Monday| meeting with John |high |office |
|Monday| go to the gim |low |SuperGim Club|
|Monday| meeting with Mary |low |office |
|Monday| go to training |high |IT center |
|Monday| diner with kids |highest |home |
|Monday| diner with parents |highest |Dad's home |
|Monday| go fishing with friends |high |home |
'------+-------------------------+--------+-------------'

Is there anyway to configure the component to return null when there is no match?
Thanks again,
Kevin
Community Manager

Re: tExtractRegexFields Usage

Hello
Then I started over and defined a new tExtractRegexField component, then used the same input and output name (see image below).

Note that if you want to split c4 column, the output name should be different from c4, otherwise, it always be null for c4 column.
Is there anyway to configure the component to return null when there is no match?

It is impossilbe, as we have coded it will throw this exception if the there is no match.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: tExtractRegexFields Usage

Hi,
"It is impossible, as we have coded it will throw this exception if the there is no match."
So, if I am trying to split a column using the tExtractRegexFileds component, every record has to be an absolute match or it will throw an exception? I'm working with data in which I cannot tell if every record will have a column with an absolute match. For example, let's say my data source has 500 records. Records 1 - 20 will match the pattern, but record 21 does not match, so the entire job breaks. Other than going through every record and hand editing each column to make sure I get a match, is there a way to work around the exception? I mean, is there a way to provide a catchall or secondary RegEx in the event that the Primary RegEx does not match?

Thanks again,
Kevin
Community Manager

Re: tExtractRegexFields Usage

Other than going through every record and hand editing each column to make sure I get a match, is there a way to work around the exception?

Yes, I agree with you. Can you report a bug on our bugtracker? To add an option 'Die on error' on this component, so that the job will keep running even there are some no match rows.
Thanks for your support!
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business