Five Stars

How to extract lines/rows that have a regex in the line

I have a file and I need to process lines that have a regex with one tJavaRow and the rejected lines with another tJavaRow. I tried to use the tFileInputRegex and it picks the regex (not the whole line) and the rejected line would have nothing in it. For e.g., if I try this

tFileInputRegex ----success---> tJavaRow1

    |____________reject _________>tJavaRow2

 

and print the row values in 1 and 2, 1 would have the regex and the 2 null. So if my regex is 'sri' the lines containing it like 'sriisnice' would have 'sri' in the row and the ones like 'talendisgreat' would have null. I need to get the actual line. How do I do that?

I tried a few file as well processing components but could not find one that can do it. I might be able to make the extract regex fields or some other processing component do it with java row putting the line in context but no luck so far. I'll keep you posted if I find one.

1 ACCEPTED SOLUTION

Accepted Solutions
Five Stars

Re: How to extract lines/rows that have a regex in the line

Dijke,

 

I am using the right regex to match the whole string, please look at my earlier reply. I further looked at the code and if there is a match the component gets the first matching group. However this should have been the whole string but I get nothing. However giving the string I am looking for gives me just the string in the big line so the tFileInputRegex, as documented splits the line by regex but not give you complete match. I therefore worked around it by holding the line in a context variable and after get the line on with and without match. So the flow is ..iJavaRow (put line into a context variable)   ---> tExtractRegexFields --success---> tJavaRow (get line from a context variable)                                                                                                                                                       |

                                                                                                                                                                                 reject   ---> tJavaRow (get the line from context variable).

 

This worked. 

                                                                                                                                                                                   

6 REPLIES
Nine Stars

Re: How to extract lines/rows that have a regex in the line

If possible please provide source and expected target output.

 

Regards,

Veeru Boppudi
Five Stars

Re: How to extract lines/rows that have a regex in the line

My regex: "sri"

 

If my string is "sriisnice" I want the success output to be "sriisnice" (not sri) and if the string is "talendhasgreatcommunity" I expect reject output to be  "talendhasgreatcommunity". If I try my regular expression ".*sri.*" in any regex testing site it shows full match but I don't get anything!.

Nine Stars

Re: How to extract lines/rows that have a regex in the line

Are you looking for the following output?

Regex.PNG

Regards,

Veeru Boppudi
Nine Stars

Re: How to extract lines/rows that have a regex in the line

I think I know what the problem is,
In regex pattern matching there are different search/match options.
In Java and Talend you need to match the whole string with your regex string like:
- "hahahBOhahaha" doesnt match the regex "BO' but ".+BO.+" does.
I think this is the case with regexMatch function

You need to make sure you match the whole string. Or use a different function,
Look into this documentation:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html


Five Stars

Re: How to extract lines/rows that have a regex in the line

Vboppudi, while I can't see your tJavaRow code I bet you are using regex match there. The problem with that is I can't do 'if else' so I had to use the regex components. I was able to do it with holding the current line in a context variable and extract regex component.

 

 

Five Stars

Re: How to extract lines/rows that have a regex in the line

Dijke,

 

I am using the right regex to match the whole string, please look at my earlier reply. I further looked at the code and if there is a match the component gets the first matching group. However this should have been the whole string but I get nothing. However giving the string I am looking for gives me just the string in the big line so the tFileInputRegex, as documented splits the line by regex but not give you complete match. I therefore worked around it by holding the line in a context variable and after get the line on with and without match. So the flow is ..iJavaRow (put line into a context variable)   ---> tExtractRegexFields --success---> tJavaRow (get line from a context variable)                                                                                                                                                       |

                                                                                                                                                                                 reject   ---> tJavaRow (get the line from context variable).

 

This worked.