Six Stars

Clean accented character and white space in column

I have a workflow as follows. In the column 'summary', i want to remove

1. question mark(?)
2. white space from the text
3. replace accented alphabets with the english equivalent. For example é into e.

Capture.JPG

Input

?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager

Output

at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion |
Jinanhaolu N manager

For the accented alphabet, above is just a sample as it can be anything and i do not have a finite list to produce for an example.

 

Thanks in advance!!



1 ACCEPTED SOLUTION

Accepted Solutions
Six Stars

Re: Clean accented character and white space in column

Hi,

The following steps might helps you.

Step1: Change file read encoding 

1.PNG

 

Step2: Create new routines stripAccents with below script.

package routines;
import java.text.Normalizer;
public class stripAccents {

public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
}

 

2.PNG

 

create job src--> tMap--> tLogRow

3.PNG

COL as input in Source and row1.COL as in put in tMap. COL as output in tMap.

 

output COL --> stripAccents.stripAccents(row1.COL).replaceAll("[?]", "").replaceAll("^ ", "") 

 

Input Data:

?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager
aaaéééàààçççbbbb
Shenzhen WenTong electronic co.Ltd Ñ power adapter

 

Output Data:

4.PNG

Hope this helps!

Regards,

Veeranjaneyulu Boppudi
18 REPLIES
Six Stars

Re: Clean accented character and white space in column

Hi,

 

Please provide some sample data and expected output.

 

Regards,

Veeranjaneyulu Boppudi
Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

Hi,

Here is an example of howto:

Capture.PNG

1st, load the commons-lang3-3.4.jar file and import org.apache.commons.lang3.StringUtils.

For that, in tLibraryLoad Basic settings select "commons-lang3-3.4.jar", then in Advanced setting enter import "org.apache.commons.lang3.StringUtils;" in the import field.

In tJavaRow, enter the following (maybe something similar in tMap depending on your use case):

output_row.line = StringUtils.stripAccents(input_row.line);

tFixedFlowInput is here to generate data for the flow ("aaaéééàààçççbbbb" for my example), and the result is:

aaaeeeaaacccbbbb

Hope this helps,

 


TRF
Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

Sorry, I forgot "?" and space.

Just replace:

output_row.line = StringUtils.stripAccents(input_row.line);

with:

output_row.line = StringUtils.stripAccents(input_row.line).replaceAll("[? ]", "");

 

That's all.

 


TRF
Six Stars

Re: Clean accented character and white space in column

How should i connect tLibraryLoad and tJavaRow in my workflow?

should it be as follows? Please suggest if i should arrange this palettes in different way.

 

tMap -> tLibraryLoad -> tJavaRow -> tFileOutputDelimited

Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

well, if you just want to remove starting white spaces (not all) just use:

output_row.line = StringUtils.stripAccents(input_row.line).replaceAll("[?]", "").replaceAll("^ ", "");

maybe exists a shorter form, but it works:

Capture.PNG

Regards,


TRF
Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

Usually, place the tLibraryLoad at the bebenning of the job.
In my example, because there is nothing else inthe job, it's the 1st component and the following tFixedFlowInput is connected with a trigger onSubjobOk (or onComponentOk).

Don't forget to indicate the topic as solved (if it's) - also Kudos are welcome Smiley Wink

TRF
Six Stars

Re: Clean accented character and white space in column

I downloaded the jar file from http://book2s.com/java/jar/c/commons-lang3/download-commons-lang3-3.4.jar.html and  tried working with the suggested solution and made tLibrary as first component. Below is how tLibraryLoad is configured

Basic SettingsBasic SettingsAdvanced settingsAdvanced settings

And this is how tJavaRow is configured. I added the column name 'summary' after output_row and input_row in the code as follows

 

Capture.JPG

However, i am getting error

Execution failed : Job compile errors 
At least job "Test2_Copy" has a compile errors, please fix and export again.
Error Line: 49
Detail Message: Syntax error on token ""org.apache.commons.lang3.StringUtils;"", delete this token
There may be some other errors caused by JVM compatibility. Make sure your JVM setup is similar to the studio.

 

Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

you must load the library first: tLibraryLoad - onSubjob OK -> tFileList

also verifiy Advanced setting of tLibraryLoad. It must contain import org.apache.commons.lang3.StringUtils; in the Import field.

 

 

Edit: OK, forget, just remove both " in the Import field (that's Java code, not just a string)

 

 

 


TRF
Six Stars

Re: Clean accented character and white space in column

I inserted import org.apache.commons.lang3.StringUtils; in the advanced settings field and it ran without any error, however the output is not what i need. It simply replace accented Ñ with a question mark ?

 

Shenzhen WenTong electronic co.Ltd Ñ power adapter

 

is converted into 

 

Shenzhen WenTong electronic co.Ltd ? power adapter

 

 

Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

What's the encoding of the tFileInputDelimited?

TRF
Six Stars

Re: Clean accented character and white space in column

UTF-8


TRF wrote:
What's the encoding of the tFileInputDelimited?

 

Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

But is your file encoded as utf8?
I just tested on my side and it works fine.

TRF
Six Stars

Re: Clean accented character and white space in column

Hi,

The following steps might helps you.

Step1: Change file read encoding 

1.PNG

 

Step2: Create new routines stripAccents with below script.

package routines;
import java.text.Normalizer;
public class stripAccents {

public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
}

 

2.PNG

 

create job src--> tMap--> tLogRow

3.PNG

COL as input in Source and row1.COL as in put in tMap. COL as output in tMap.

 

output COL --> stripAccents.stripAccents(row1.COL).replaceAll("[?]", "").replaceAll("^ ", "") 

 

Input Data:

?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager
aaaéééàààçççbbbb
Shenzhen WenTong electronic co.Ltd Ñ power adapter

 

Output Data:

4.PNG

Hope this helps!

Regards,

Veeranjaneyulu Boppudi
Six Stars

Re: Clean accented character and white space in column

@TRF can you post screenshot? @vboppudi file is in UTF-8 format and if i change the format in input, file is not read properly, I faced this issue and it took me a week to understand the reason and after i switched to UTF-8, data was read properly.

Six Stars

Re: Clean accented character and white space in column

Hi,
If i change encoding to UTF-8, i am not able to read data properly. Getting like below
|at Shenzhen Xingjiexun Electronics Co.Ltd |
|Designer at FabUnion | |
|Jinanhaolu � manager |
|aaa���������bbbb |
|Shenzhen WenTong electronic co.Ltd � power adapter
Regards,
Veeranjaneyulu Boppudi
Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

Here is the job with the tFileInputDelimited:

Capture.PNG

The Advanced settings tab of the tFileInputDelimited:

Capture.PNG

The input file with the Encoding menu (from Notepad++):

Capture.PNG

Finally, the result:

Capture.PNG

@Enthusiast, let us know the encoding system for your file.

 

Regards,


TRF
Six Stars

Re: Clean accented character and white space in column

Its appearing as ANSI when i open it in Notepad++

Twelve Stars TRF
Twelve Stars

Re: Clean accented character and white space in column

So just select ISO-8859-15 as the encoding system in the Advanced settings tab.

It works (I've tried).

 


TRF