Open Studio for DQ can not handle special characters in CSV-File encoded as utf-8

Four Stars

Open Studio for DQ can not handle special characters in CSV-File encoded as utf-8

Hi, 

 

I'm using the Talend Open Studio for Data Quality Version 6.5.1 to analyze the quality of data in a csv file which is encoded in UTF-8. If I select the indicator 'Soundex Frequency' for a column which values contains special characters like "ü" and "é" and run the analysis I get the following error message: 

 

 

2018-05-04 17:14:20,232 ERROR org.talend.dq.analysis.AnalysisExecutor  - java.lang.IllegalArgumentException: The character is not mapped: Ü
java.lang.IllegalArgumentException: The character is not mapped: Ü
	at org.apache.commons.codec.language.Soundex.map(Soundex.java:226)
	at org.apache.commons.codec.language.Soundex.getMappingCode(Soundex.java:180)
	at org.apache.commons.codec.language.Soundex.soundex(Soundex.java:264)
	at org.talend.dataquality.indicators.impl.SoundexFreqIndicatorImpl.handle(SoundexFreqIndicatorImpl.java:283)
	at org.talend.dq.indicators.DelimitedFileIndicatorEvaluator.handleByARow(DelimitedFileIndicatorEvaluator.java:335)
	at org.talend.dq.indicators.DelimitedFileIndicatorEvaluator.useCsvReader(DelimitedFileIndicatorEvaluator.java:257)
	at org.talend.dq.indicators.DelimitedFileIndicatorEvaluator.executeSqlQuery(DelimitedFileIndicatorEvaluator.java:115)
	at org.talend.dq.indicators.Evaluator.evaluateIndicators(Evaluator.java:146)
	at org.talend.dq.indicators.Evaluator.evaluateIndicators(Evaluator.java:207)
	at org.talend.dq.analysis.DelimitedFileAnalysisExecutor.runAnalysis(DelimitedFileAnalysisExecutor.java:70)
	at org.talend.dq.analysis.AnalysisExecutor.execute(AnalysisExecutor.java:146)
	at org.talend.dq.analysis.AnalysisExecutorSelector.executeAnalysis(AnalysisExecutorSelector.java:171)
	at org.talend.dataprofiler.core.ui.action.actions.RunAnalysisAction$1.runInWorkspace(RunAnalysisAction.java:222)
	at org.eclipse.core.internal.resources.InternalWorkspaceJob.run(InternalWorkspaceJob.java:38)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)

 

I've already tried to solve the problem by the solution of this post: https://community.talend.com/t5/Design-and-Development/Handling-special-characters/m-p/25169#M4268

and I checked "Allow specific characters (UTF8,...) for columns of schemas" under Window / Preferences / Talend / Specific Settings.

Neither of the solutions worked for me. 

 

Is there any workaround to solve the problem?

 

Thanks in advance

Frank

 

Moderator

Re: Open Studio for DQ can not handle special characters in CSV-File encoded as utf-8

Hello,

Have you tried to add -Dfile.encoding=utf8 in the ini (config file) and restart your studio to see if it works?

Best regards

Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Employee

Re: Open Studio for DQ can not handle special characters in CSV-File encoded as utf-8

hi
we don't support that the indicator 'Soundex Frequency' to run
for a column which values contains special characters like "ü" and "é" and Chinese/Japanese characters.
get this error is normal, we will not fix this

Tutorial

Introduction to Talend Open Studio for Data Integration.

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.