Five Stars

Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

My Input log file looks like this 

 

2017-05-09 10:18:52.743 INFO  (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.u.p.LogUpdateProcessorFactory [UIMATestCollection1]  webapp=/solr path=/update params={}{} 0 66
2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)

 I am using tFileInputRegex component 

 

The regex to parse the file is as shown here 

 

"^"+
"([0-9]{4}\\-[0-9]{2}\\-[0-9]{2})"+" "+
"([0-9]{2}\\:[0-9]{2}\\:[0-9]{2}\\.[0-9]{3})"+" "+
"(.*?)"+" "+
"\\((.*)\\)"+" "+
"\\[(.*)\\]"+" "+
"(.*)"

I am getting the partial output as shown below 

 

.----------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------.
|                                                                                               tLogRow_1                                                                                               |
|=---------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+------------------------------------------=|
|Date      |Time        |Log_Level|App_Thread      |Collection                                                                                              |Message                                    |
|=---------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+------------------------------------------=|
|2017-05-09|10:18:52.743|INFO     |qtp1543727556-22|   x:UIMATestCollection1] o.a.s.u.p.LogUpdateProcessorFactory [UIMATestCollection1                      | webapp=/solr path=/update params={}{} 0 66|
|2017-05-09|10:18:52.745|ERROR    |qtp1543727556-22|   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1|unknown field 'sentence'                   |
'----------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------'

tFileInputRegex ConfigarationtFileInputRegex Configaration

But i want  tFileInputRegex to ignore the row separator ("\n")  when parsing the above input file and need to include the error message in the second line in the last column by ignoring the row separator. Please suggest if any solution.

2 ACCEPTED SOLUTIONS

Accepted Solutions
Community Manager

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Hello 

tFileInputRegex read the file line by line, each line will be parsed with regex. As a workaround, read the whole file content as a string, replace all the new line character+at character to a special character, output the string to a temporary file before parsing it with regex. After parsing the file, replace all the special characters with new line character+at if needed, for example:

tfileinputRaw--main--tJavaRow1--main--tFileOutputDelimited

   |

onsubjobok

   |
tFileInputRegex--main--tJavaRow2--main--tLogRow

 

tFileInputRegex: read the new file generated by tfileOuputDelimited.

 

on tJavaRow1:

output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");

 

on tJavaRow2:

output_row.Date=input_row.Date;

//...other columns....

output_row.Message=input_row.replaceAll("@","\r\n");

 

Regards

Shong

----------------------------------------------------------
Talend | Data Agility for Modern Business
Twelve Stars TRF
Twelve Stars

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Try to change "\n" by "\\n" as "\" is a special character for regex.

output_row.content = (input_row.content.toString()).replaceAll("\\n@","@")

TRF
6 REPLIES
Community Manager

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Hello 

tFileInputRegex read the file line by line, each line will be parsed with regex. As a workaround, read the whole file content as a string, replace all the new line character+at character to a special character, output the string to a temporary file before parsing it with regex. After parsing the file, replace all the special characters with new line character+at if needed, for example:

tfileinputRaw--main--tJavaRow1--main--tFileOutputDelimited

   |

onsubjobok

   |
tFileInputRegex--main--tJavaRow2--main--tLogRow

 

tFileInputRegex: read the new file generated by tfileOuputDelimited.

 

on tJavaRow1:

output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");

 

on tJavaRow2:

output_row.Date=input_row.Date;

//...other columns....

output_row.Message=input_row.replaceAll("@","\r\n");

 

Regards

Shong

----------------------------------------------------------
Talend | Data Agility for Modern Business
Five Stars

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Thanks For your Support. Really It helped a lot.

I am working on it. But Got stuck with very little Error..

 

tfileinputRaw--main--tJavaRow1--main--tFileOutputDelimited

 

This is my tJavaRow1 
output_row.content = (input_row.content.toString()).replaceAll("\n\tat","@");

 

Below is my input file

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

 

Output in the tFileOutputDelimiteris

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

if i use tJavaRow2 and put the following command below replaceAll("\n@","@") is not working. I am getting output as above

 

 

 

tLogRow output is 

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
[statistics] disconnected

Now I want to remove \n before @   in my output file.

 

My expected output is 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence’@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

if I put the same multiline in Eclipse and use  val bSmiley Frustratedtring = a.replaceAll("\n@", "@"); in scala output is getting in single line.

can u please suggest something on this. 

Thanks In Advance....

Community Manager

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Hi
This is my tJavaRow1:
output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");
It generates only line in the output file, it seems you don't use the same code on tJavaRow1.

Regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Five Stars

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Thanks for the reply and support. I tried yours tJavaRow Code

output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");

It is not showing any changes 

my Input file contains first \n after \r and at. may be for that. 

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

By yours tJavaCode  i am getting same output like below (after executing)

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

so after trying with yours i changed to 

 

output_row.content = (input_row.content.toString()).replaceAll("\n\tat","@"); 

 

which is giving 

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

now i want to get the above output in a single line. 

for that tJavaRow2 i used with 

output_row.content = (input_row.content.toString()).replaceAll("\n@","@");

 

But getting the above output only no changes means not able to remove the \n

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

 

In this I put the exported talend job (Archive file to import) and input file 

Can U Please check if posible 

https://drive.google.com/open?id=0B-hwVI6s7kodd0dWWFUtVWZHRTg

https://drive.google.com/open?id=0B-hwVI6s7kodSlVSMXNKbmNYeDg

 

 

Twelve Stars TRF
Twelve Stars

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Try to change "\n" by "\\n" as "\" is a special character for regex.

output_row.content = (input_row.content.toString()).replaceAll("\\n@","@")

TRF
Five Stars

Re: Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

Thanks a  lot it Worked for me.....

Thanks for Support....