Tokenizing Log files

One Star

Tokenizing Log files

Hi,
I would like to process Log files that come in a McAfee format. Processing means that I have to tokenize the Logs before doing something meaningful with them. However, I don't know how to tokenize such log files. Any ideas?
Here is a (deliberately simplified) example of two records:

192.168.1.12 10.2.33.12  123 4711 "www.google.com" TCP_MISS "something else"
127.0.0.1:12345 10.3.211.3 4321 53344 "www.domedomain.com/bla/xyz.php?x=1,y=2" TCP_MISS "more text"


As you can see:

The fields are delimited by means of a "space" character.
Strings are normally enclosed by "" but can contain space characters (and theoretically even " characters)
The string containing the TCP status is the exception as it is a String but it is not enclosed by ""
Dates are enclosed by [] but can contain space characters, too.

How can I tokenize such Log files in Talend? The regular CSV import component is too simple for this, I believe. Any ideas?
Thanks
Matt
Moderator

Re: Tokenizing Log files

Hi,
Can you please set an exmaple with expected result for us?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Tokenizing Log files

Hi,
I did provide 2 example lines in my post.
The outcome would be a Schema in which each field would be separated. So for the two examples above:
192.168.1.12 10.2.33.12 123 4711 "www.google.com" TCP_MISS "something else"
1.192.168.1.12
2.10.2.33.12
3.
4.123
5.4711
6.www.google.com
7.TCP_MISS
8."something else"
And the second line:
127.0.0.1:12345 10.3.211.3 4321 53344 "www.domedomain.com/bla/xyz.php?x=1,y=2" TCP_MISS "more text"
1.127.0.0.1:12345
2.10.3.211.3
3.
4.4321
5.53344
6.www.domedomain.com/bla/xyz.php?x=1,y=2
7.TCP_MISS
8."more text"
How can that be done? That's an excerpt from a Standard Log file - so I assume that Talend must have means to process These file types?`
Thanks
Matt
One Star

Re: Tokenizing Log files

So I have experimented with Talend a bit and did not come across any Feature that could help process this type of file.
Does Talend really have nothing for Log Files?
I experimented a bit with Regular expressions and came up with this:
^(+)\s(+)\s"(.*?)"\s\\s(.*?)\s(\d+)\s(\d+)\s(\d+)\s(.{34})\s(.*?)\s(\d+)\s(.*?)\s(\d+)\s(.*?)\s(.*?)\s(\w+\b)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(+)$

For a slightly different Log Format, though:
192.246.180.238 112.225.107.58 "Undefined"  "GET http://something-bea.xyz.de.zz.com:7215/ExternalInformationServices/SAMCS?WSDL HTTP/1.1" 407 343 3793 "Apache-HttpClient/4.1.1 (java 1.5)" "-" 81 "-" 3 "-/-" "" "TCP_MISS" "-" "Authenticate Offer NTLM" "jhgjhg876g87-test" "-" 0.0.0.0

But when I try this in Talend, I get an exception:



Exception in thread "main" java.lang.Error: Unaufgelöstes Kompilierungsproblem:


Ungültige Escapezeichenfolge (gültig sind \b \t \n \f \r \" \' \\ )






at csa_pilot.regex_test_0_1.regex_test.tFileInputRegex_1Process(regex_test.java:529)


at csa_pilot.regex_test_0_1.regex_test.runJobInTOS(regex_test.java:1043)


at csa_pilot.regex_test_0_1.regex_test.main(regex_test.java:900)


Any idea?