regex parsing of apache access log

One Star

regex parsing of apache access log

Hey Guys,
Anyone know if this can be done. I use a regex in my perl script monster ETL process (attached at bottom) that goes through my access logs and pulls out the GET variables in every HTTP request. I do this in such a manner that a hash is created with the key being the GET variable name, and the value being the GET variable value. This is nice because it allows me to deal with GET requests that don't all have the exact same number of parameters - some just evaluate to basically NULL.
Can I do this with Talend? I read up on "Setting up a File Regex schema" on p. 65 of the user guide but I am still not sure.
Thanks.
-----
if($_ =~ m/\,(*),(\/slacker\.jpg|\/slacker\.gif),(*)/){
$arg=$10;
while($arg =~ m/(\w+)=(*)/g){
$args{$1}=$2;
}
}
Example input:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
One Star

Re: regex parsing of apache access log

Sorry, I just noticed my input lines were off (because I stripped out some fields for anonymity) - but the basic idea is the same! Smiley Happy
Employee

Re: regex parsing of apache access log

Is that really Apache access log?
I've implemented a tApacheLogInput for TOS 2.4 (available for TOS 2.3 in the ecosystem) that deals with "standard" Apache access log lines.
You gave an example of the input lines, can you also give the corresponding expected output lines?
One Star

Re: regex parsing of apache access log

It's actually a ligHTTPD log that I modified with the conf file to only log the data fields I need.
The key point is that I need to break up the variables in the GET request into a CSV file that I then ETL through my own custom process (which I want to replace with Talend).
Here are the lines:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
Here would be the output:
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,6374451b48368cf558,,400
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,f5c99a6c552c032123,anything,400
Notice: Line 1 only has 2 GET variables (var1 and var2), while Line 2 has 3 GET variables (var1, var2, var3). In the output, even though line 1 has only 2 variables a placeholder is inserted for var3. Another thing to keep in mind is that I simplified the naming of these GET variables for the example, but my users can name then anything (i.e. var1 OR variable1 OR myvariable1).
That's why I need the regexp - to grab all variables and drop to a hash array, where I only perform operations on the variables I expect, and toss the rest of the junk.
Thanks!!
Employee

Re: regex parsing of apache access log

In the tFileInputRegex, the regex is:
'
^\,
(+)
,
(?:var1=(+))?
(?:&?var2=(+))?
(?:&?var3=(+))?
,(\d+)
$'

The input of my job is:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
,24.24.24.24,var1=x&var3=foo,400
,1.2.3.4,var2=bar&var3=foobar,400
,1.2.3.4,var1=foo&var3=barfoo,400
,1.2.3.4,var3=barfoo,400

The limit is that you must respect the order var1,var2,var3. Some or all vars can be missing, but when present the order must be respected.
One Star

Re: regex parsing of apache access log

I must say.. that's pretty cool Smiley Happy
One Star

Re: regex parsing of apache access log

Ok, so does tFileInputRegex not support looping? Your code basically does the same thing, except you don't have the loop I put at the top that allows for a hash key/value system that doesn't care about order or presence.
Employee

Re: regex parsing of apache access log

Ok, so does tFileInputRegex not support looping?

No. Well, not with my current knowledge of regular expressions :-)
Your code basically does the same thing, except you don't have the loop I put at the top that allows for a hash key/value system that doesn't care about order or presence.

Here comes another solution which does not care about vars order. I think this solution is slower than the first I gave you. Be also warned that the day you'll have a var4 and var5, you'll have to modify the tMap. This "problem" is not related to regex but to our static way to define schema.
My new input is (2 last lines are new):
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
,24.24.24.24,var1=x&var3=foo,400
,1.2.3.4,var2=bar&var3=foobar,400
,1.2.3.4,var1=foo&var3=barfoo,400
,1.2.3.4,var3=barfoo,400
,1.2.3.4,var3=foo&var1=barfoo,400
,1.2.3.4,var3=foo&var1=barfoo&var2=hithere,400