Reading Special Characters like Trade Marks and Register Marks

Four Stars

Reading Special Characters like Trade Marks and Register Marks

Hi,

 

My source has a text data with all spacial characters like trade marks and Register Marks, While moving to Target as text, I am getting some spcial characters before registration marks and ? marks before single quotes. I have used utf-8 encoding etc, still no use,  I am not seeing trademark sign like TM, i see ¢. 

 Cash+â„¢ Signature®

Is there any alternative solution to this?

 Any thoughts ???

 

 

 


Accepted Solutions
Eleven Stars

Re: Reading Special Characters like Trade Marks and Register Marks

what is the encoding of source file ?

UTF-8 is not able to understand the source encoding.
Regards
Abhishek KUMAR
Seven Stars

Re: Reading Special Characters like Trade Marks and Register Marks

Strange, that "Cash+â„¢ Signature®" does look like UTF-8. However, it might be going through a double conversion - UTF-8 being re-encoded into UTF-8 as if it were not already. Do you have any intermediate steps that might be being written as one format e.g. UTF-8 but read as another e.g. ISO-8859?


All Replies
Eleven Stars

Re: Reading Special Characters like Trade Marks and Register Marks

what is the encoding of source file ?

UTF-8 is not able to understand the source encoding.
Regards
Abhishek KUMAR
Four Stars

Re: Reading Special Characters like Trade Marks and Register Marks

Yep it is very tricky, actually reading from hive and writing into hive, but within talend server all looks good like same linux box, but when writing to hdfs different box, I see all kind of junks. Even tlogrow also writes junk characters. Both boxes have same characters set.
Eleven Stars

Re: Reading Special Characters like Trade Marks and Register Marks

Please check file encoding in hdfs by command file ? is it UTF-8 ? otherwise change the encoding to utf-8 by iconv command and try to see if all looks good.
Regards
Abhishek KUMAR
Seven Stars

Re: Reading Special Characters like Trade Marks and Register Marks

Strange, that "Cash+â„¢ Signature®" does look like UTF-8. However, it might be going through a double conversion - UTF-8 being re-encoded into UTF-8 as if it were not already. Do you have any intermediate steps that might be being written as one format e.g. UTF-8 but read as another e.g. ISO-8859?

Four Stars

Re: Reading Special Characters like Trade Marks and Register Marks

Thank you for all the support. Every one helped me to find a solution. So I accept all comments as solutions.

Yes, file is going thru dobule conversion, the person who gave the file to me to figure out the issue has messed up . Going thru all your comments helped me to identify the issue and final fix is that the output file is writing into utf8 instead of default encoding iso1252 something.

So by changing that all got were in line without any issues to load into hives.

Thank you all

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Put Massive Amounts of Data to Work

Learn how to make your data more available, reduce costs and cut your build time

Watch Now

How OTTO Utilizes Big Data to Deliver Personalized Experiences

Read about OTTO's experiences with Big Data and Personalized Experiences

Blog

Talend Integration with Databricks

Take a look at this video about Talend Integration with Databricks

Watch Now