One Star

Can Data Quality analyse unstructured data, such as data in csv file?

Hi ,
I would like to use Data quality (DQ) to analyse/validate data in CSV files,i.e. highlighting invalid data based on user predefined rules/constraints.
I have read Data Quality documentation, Talend Open Studio for DQ provides a powerful data profiling tool for users to analysis database tables, rows and columns with great UX design. However, I could not find any content that describes how to analyse unstructured data, such as content in CSV.
If DQ does not provide such functionality to validate data in CSV files, do you have any suggestion to approach my data validation goal? Since it is a open source project, is it possible to extend it to read text files? and then reuse existing data profiling component (defined rules/constraints + validate + highlight invalid data)?
Is this trunk the right place I should look at? http://www.talendforge.org/trac/top/browser/trunk.

Thank you in advance.
Yukun
2 REPLIES
Employee

Re: Can Data Quality analyse unstructured data, such as data in csv file?

Hello Yukun,
the studio can analyze csv files, but if your csv fields contain unstructured text and you want to dig into that unstructured text, then I would suggest you to have a look how to create your own Java indicator at https://help.talend.com/pages/viewpage.action?pageId=20824858#Raa27234
Then you could share your indicators with the community by uploading them to the Talend Exchange website.
In the enterprise version of the studio, we provide a component that does text parsing and extraction from some parser rules: https://help.talend.com/search/all?query=tStandardizeRow&content-lang=en
One Star

Re: Can Data Quality analyse unstructured data, such as data in csv file?

Hi Scorreia, thank you for your reply, now I find the fileDelimited connection option in DQ, so I am able to analysis my csv files.
Cheers
Yukun