Address Standardization, possibly using tExtractRegexFields
Hi, I have an enterprise version of Talend for Data Services. I am trying to standardize address data for a large data set, but using the Google components doesn't work as 1. it is extremely slow and 2. I run out of available queries as the data set is over 40,000 records The addresses can come across in 2 ways. Way 1, with 5 separate columns: Address Line, City, State, Zip, Country Example: 100 Main St | New York | New York | 90909 | US Way 2, 1 column: Address Example: 100 Main St, New York, New York, 90909, US I need to have the data separated like this: Address Number, Street Name, City, State, Zip Code, Country
I am having trouble getting the Regex correct as I am new to Java and the Talend process of things. Is there a better way to do this? Or can anyone offer input as to how to set up the Regex. The job process is currently: FTPget----tFileInputDelimited----tMap(modifying columns)---tSplitRow(being used to pivot certain items)----tHashOutput Somewhere within there I need to separate the address fields. Any help with this would be great! Thank you.