Address Standardization, possibly using tExtractRegexFields

Four Stars

Address Standardization, possibly using tExtractRegexFields

Hi, I have an enterprise version of Talend for Data Services.
I am trying to standardize address data for a large data set, but using the Google components doesn't work as 1. it is extremely slow and 2. I run out of available queries as the data set is over 40,000 records
The addresses can come across in 2 ways.
Way 1, with 5 separate columns:
Address Line, City, State, Zip, Country
Example: 100 Main St | New York | New York | 90909 | US
Way 2, 1 column:
Address
Example: 100 Main St, New York, New York, 90909, US
I need to have the data separated like this:
Address Number, Street Name, City, State, Zip Code, Country

I am having trouble getting the Regex correct as I am new to Java and the Talend process of things. Is there a better way to do this? Or can anyone offer input as to how to set up the Regex.
The job process is currently:
FTPget----tFileInputDelimited----tMap(modifying columns)---tSplitRow(being used to pivot certain items)----tHashOutput
Somewhere within there I need to separate the address fields. Any help with this would be great! Thank you.
Moderator

Re: Address Standardization, possibly using tExtractRegexFields

Hi mw629,
Have you tried to use TalendHelpCenter:tExtractDelimitedFields which can generate multiple columns from a given column in a delimited file.

Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.