I'm designing a framework for migrating large datasets to cloud. As part of the process, I had to design an engine to handle data quality issues, particularly removing unwanted characters, such as "Esc" character, carriage return and also certain unicode characters that are reserved for private use (e.g. \u0092).
Currently I'm able to remove such characters simply by using something like:
currentRow = currentRow.replace("\n", "");
currentRow = currentRow.replace("\r", "");
currentRow = currentRow.replace("\u0092", "");
What I'm willing to do is to store these in a SQL database. Storing values such as \n, \r or \u0092 in a SQL table won't do the trick. Can you please describe how I can store these characters in the database and retrieve them properly so Talend can interpret them correctly?
Any advise would be appreciated.
I found out that I can store \u0092 in SQL by storing what gets returned as a result of SELECT CHAR(146). However, this doesn't seem to work when I'm feeding that into the same "replace" command. Any thoughts?
The replaceAll() method can take a regular expression and replace all instances that match that expression in a String. The regex is usually just provided as a string:
If you want to maintain the set of characters in a database, you can store their string values and build a process that takes the list of characters returned by a query and turns it into your regex string. As a rough example, you could connect your database input component to a FlowToIterate to a tJava component. The tJava component would add the string for the current iteration to a globalMap variable:
globalMap.put("regexVar",((String)globalMap.get("regexVar")) + row1.invalidChars)
You'll want to initialize the globalMap variable to avoid null pointer exceptions, probably with "[" to start off the regex set, and add a "]" after all iterations to close the set. Once the regex string is built, you would use it in a later component as: