Four Stars

tExtractDelimitedFields on variable location field

I have a location column that I'm processing through tExtractDelimitedFields to break out into city, region, country

However, the input isn't consistent, sometimes just containing a country, sometimes a region and country, sometimes a city, region, and country:

 

 

|=--------------------------------=|
|location                          |
|=--------------------------------=| |Bangkok, Thailand | |Sydney, New South Wales, Australia| |Atlanta, Georgia, United States | |Zurich, Switzerland | |Peru | |=--------------------------------=|

Which is obviously and correctly split by tExtractDelimitedFields to:

|=----------------+-----------------------------+--------------=|
|city             |region                       |country        |
|=----------------+-----------------------------+--------------=|
|Bangkok          | Thailand                    |               |
|Sydney           | New South Wales             | Australia     |
|Atlanta          | Georgia                     | United States |
|Zurich           | Switzerland                 |               |
|Peru             |                             |               |
|=----------------|-----------------------------|--------------=|

Now, I'd like to correct the locations of the different elements of the location field. My thought was to extract the fields into variables, and shift them to the right until the country field is no longer blank. But it occurred to me that

  1. This might be a solved problem that I'm unable to find by simple looking for "location" or "tExtractDelimitedFields" in the community
  2. I'm also looking for some pointers on how to do the shifting if that's the solution
  3. While I'm at it, I'm seeing the whitespace after the split in the output. I can easily use tMap to apply a trim. I used tMap immediately before the tExtractDelimitedFields to drop two columns. Is it better to carry the two rows and use tMap just once to drop and trim at the same time, or to use it twice, to drop unneeded columns early, then trim later?

 

 

 

  • Data Integration
1 ACCEPTED SOLUTION

Accepted Solutions
Seven Stars TRF
Seven Stars

Re: tExtractDelimitedFields on variable location field

Here is the solution I propose as a more elegant way (let you decide this point!)

Use a tFileInputFullRow to read the file as it is (schema is one field called "line").
Use tMap to left pad the line field with the missing "," if any:
Capture.PNG
Here is the magic, get a substring from ",," starting a position depending of the number of "," in the line, then add the line to this substring which contains 0, 1 or 2 "," (for ex: ",Bangkok, Thailand" or ",,Peru").
Now you can use tExtractDelimitedFields as you want (just have to change the Field separator to "," and to define the desired output schema):
Capture.PNG

Check the "Trim Column" option on Advanced setting tabs.

And the result as expected with just 1 line of Java code:
Capture.PNG
Hope this helps.


TRF
4 REPLIES
Four Stars

Re: tExtractDelimitedFields on variable location field

I assume I brute-forced it with tJavaRow:

 

if ("" == input_row.country && "" == input_row.region) {
	output_row.city = "";
	output_row.region = "";
	output_row.country = input_row.city;
} else if ("" == input_row.country) {
	output_row.city = "";
	output_row.region = input_row.city;
	output_row.country = input_row.region;
} else {
	output_row.city = input_row.city;
	output_row.region = input_row.region;
	output_row.country = input_row.country;
}

If there's a more elegant way to do it, please let me know!

Seven Stars TRF
Seven Stars

Re: tExtractDelimitedFields on variable location field

Here is the solution I propose as a more elegant way (let you decide this point!)

Use a tFileInputFullRow to read the file as it is (schema is one field called "line").
Use tMap to left pad the line field with the missing "," if any:
Capture.PNG
Here is the magic, get a substring from ",," starting a position depending of the number of "," in the line, then add the line to this substring which contains 0, 1 or 2 "," (for ex: ",Bangkok, Thailand" or ",,Peru").
Now you can use tExtractDelimitedFields as you want (just have to change the Field separator to "," and to define the desired output schema):
Capture.PNG

Check the "Trim Column" option on Advanced setting tabs.

And the result as expected with just 1 line of Java code:
Capture.PNG
Hope this helps.


TRF
Seven Stars TRF
Seven Stars

Re: tExtractDelimitedFields on variable location field

@jwhite_tc, did my proposition help you to solve your case?


TRF
Four Stars

Re: tExtractDelimitedFields on variable location field

It did indeed! Basically pad the front of the location field with commas to force it to three total. And the trim in the advanced settings. Lovely!