Five Stars

How to Denormalize a JSON source

Hi,

 

I have a JSON source, with a lot of nested levels, and would have them flatted to denormalized rows.

Since tExtractJSONFields permits me just to loop on an array field, I have to use more than one tExtractJSONFields cascade, to loop on any nested array I have.

Any tExtractJSONFields then will extract some fields that I don't know how to temporary store, to re-join them together in denormalized rows.

 

 

From:

{"level1":[

  {"value1":111, "level2":[

      {"value2":222, "level3":[

           {"value3":"333-1"},

           {"value3":"333-2"}

        ]}

   ]}

]

 

To (2 records):

111|222|"333-1"

111|222|"333-2"

 

My only idea, and I don't like it, is to use tHashOutput in the middle of the tExtractJSONFields chain, between one and the next one, to fill 2 hash buffers (with "value1" and "value2") and make them available (by tHashInput) at the end of the chain, together with "value3" items.

 

Any better idea?

 

Regards,

Lorenzo

  • Data Integration
Tags (1)
7 REPLIES
Ten Stars

Re: How to Denormalize a JSON source

It sounds like you essentially have the right idea (using multiple tExtractJSONFields). To store your temporary data you can use tHashOutput components and read it back in with tHashInput components. Remember that you will need to also output a "key" field for your looped data so that you can join it back.

Rilhia Solutions
Five Stars

Re: How to Denormalize a JSON source

... and Key is the (big) problem, now.

 

I have no any key at any level of the JSON structure, and I don't know how to generate it in the way that will be available to all tHashOutput, along the tExtractJSONFields chain.

Any idea?

 

(it's a pity that there is no any Talend component that convert a JSON structure in a multi-dimensional Java Hash. I should use a tJava component and write the code to convert... :-( )

 

 

Ten Stars

Re: How to Denormalize a JSON source

Sorry, I didn't read your original post properly. But your key problem doesn't look too hard to solve. All you need is to pass a unique identifier to the looped rows. In your example above that could be the "value1" and "value2" fields for their respective loops. If they turn out to not be unique, you could actually use a sequence generated at the outer level and provided to the inner level. Both work and I use them all the time when working with the Open Source product.

 

With regard to your question as to why Talend do not provide a better mechanism, they do....with the Enterprise Edition. It is called Talend Data Mapper. But to be honest this provides its own issues in usage and documentation.

Rilhia Solutions
Five Stars

Re: How to Denormalize a JSON source

How to pass a generated sequence through tExtractJSONFields components? tExtractJSONFields does not permit to add any external field to the parsed result.

The sequence you mention has to be passed ahead through the chain, right? 

Ten Stars

Re: How to Denormalize a JSON source

If you add a tmap between the tExtractJSONFields components, you can manipulate your data that way. To pass a field "through" the tExtractJSONFields component, just supply the column in the schema for the tExtractJSONFields component, but do nothing with it in the component. The data you supply will be passed through untouched.

Rilhia Solutions
Five Stars

Re: How to Denormalize a JSON source

Oh, yes, I did not know that you can pass fields through the JSON component.

But... the tExtractJSONFields components run over nested loops, that mean the sequence should be generated in the most nested loop (to be available in the last tHashOutput component) and propagated back to the previous loops (for tHashOutput components purpose).

 

Consider three nested loops: the first loop with 2 items, the second one with 2 items, the third one with 3 items.

We have to generate 12 key values. Where and how generate them and propagate them depending on all loops state?

 

Thank you,

Lorenzo

Ten Stars

Re: How to Denormalize a JSON source

You can do this with the following (high level) structure....

 

tExtractJSONFields1 --> tMap --> tExtractJSONFields --> tMap --> tExtractJSONFields

 

As each row is released by the tExtractJSONFields into the tMap components, a sequence number can be generated and passed "through" the tExtractJSONFields component. These sequence numbers can then be used to identify related records.

 

3 nested loops will require 3 sequences. While this might generate lots of actual ids, you only need to worry about the 3 sequences. Provide a column for each (e.g. key1, key2, key3) and then you are just matching like you would in any other scenario.

Rilhia Solutions