how to design component that adds data based on main flow

One Star

how to design component that adds data based on main flow

Hallo,
I've just written my first component, and now that it works, I want to learn how to do it right.
My component connects via RMI to a proprietary server interface. It is not meant for generating the main data flow but for getting some additional data on the side. The additional data is subsequently used to decide how to process the main data. My solution for now is to write the additional data to the globalMap under defined keys and tell my job developers to extract it from there. This requires the job developers to have some basic understanding of Java coding. I would prefer to do it in a more intuitive, graphical way, but couldn't find a solution that worked.
I have looked into what tMap lookups do, but I need to access the main flow data for my RMI calls. As far as I could see none of the tMap lookup options provides me access to the main flow. Is there a way to access the main data flow from a component that is connected to a tMap as a lookup? Or any other component supporting lookups?
The other approach I looked into was to add columns to the main flow. Is this possible? I think it should be somehow, but couldn't find an example, and my own guesswork implementation didn't come to anything. Besides, it feels wrong to add columns in a component, because the number of columns is usually assigned by the job developer. Is there an example of a component that adds columns on its own?
Thanks for anybody who helps me along, or just tells me to forget about it,
Greetings,
Florian.
One Star

Re: how to design component that adds data based on main flow

Hi Florian,
Could you explain in some more detail what it is you want to lookup in the RMI call return and/or what you need frmo the main flow in this lookup, because it seems to me that this should be possible, I just don't really get what it is you want.
Regards,
Arno
One Star

Re: how to design component that adds data based on main flow

Hi Arno,
thanks for trying to understand. I have different use cases. All should be handled by the same Talend component.
Typical use cases looks like this:
The Talend job reads an input file line by line from the file system, breaks each line up into different columns, and writes the result to an Oracle database.
The main flow are daily updates about financial instruments. My new component should be able to call another application by RMI, and look up the value of a field "CodeConversion" for this instrument.
If the field "CodeConversion" contains something like "YesPleaseDoConvert" my new component will have to do antother RMI call. This time it should look up the internal code "75a" that corresponds to the external code "42042" coming from the input file.
A third case would be to look up if there is a transaction pending for this instrument, because the Talend job has to produce an additional output file (=start a subjob) in this case.
Maybe the unusual thing here is that I want to do everything with the same component. The reason behind this is that the Talend job will eventually run on a customer system. I have only limited access to this system, and I want the flexibility to add new functions without having to install new components.
Does this help to understand the intended scenario?
As far as I see, with the tMap lookup I could only execute the same RMI call each time, not provide the instrument's ID or anything else from the main flow as a call parameter.
In the meantime, I have found "tAddLocationFromIP". This component indeed adds a column to the main flow. I have copied this approach, and multiplied it to 15 additional columns. Also, I have been pointed to tOracleConnection which does something similar in error cases.
So I have a working solution now, and a limited set of precedents. Still it feels iffy. I would very much appreciate the opinion of a more experienced component designer.
Greetings,
Florian
One Star

Re: how to design component that adds data based on main flow

Hi Florian,
I think I'm getting the point a little bit more. Still I'm not sure if I understand why you want all this functionality inside one component.
Updating functionality in one component takes a fresh install of the job, just as much as updating functionality in a job it self.
Besides that, I think you should start by dividing the problem into several smaller pieces and only after completing the job try to wrap things together and "package" things into single joblets or components. Using joblets and/or subjobs makes it possible to easily re-use certain functionality.
Especially because all functionality you describe sound like pretty easy to do when using standard Talend components (and some customs for you external systems).
Best regards,
Arno
One Star

Re: how to design component that adds data based on main flow

Hi Arno,
now I see what was I missed to mention.
> all functionality you describe sound like pretty easy to do when using standard Talend components
No, it's not easy.
1. All those fields, codes, transactions to look up exist only in the working memory of the other application, and the only available interface are the RMI calls.
2. I re-use the API design and documentation of an already existing, proprietary scripting language.
3. I am not the one who writes the Talend jobs. The Talend jobs will be written by experts who know the format and the business context of the incoming or outgoing data. Some of them know the proprietary scripting language.
4. The component will be used in many different ways. I do not control how the jobs will be organized, which Talend environment is used etc.
> Updating functionality in one component takes a fresh install of the job, just as much as updating functionality in a job it self.
On the server side, the application I connect to has its own development and update cycles. On the client side, the Talend job developer has his own development and update cycle. My component which provides the RMI connection is designed to de-couple those competing cycles. I do not want to update the component every time a Talend job developer needs new functionality.
I'm not sure where this explanation leads to, but it should help to understand why I chose a one-size-fits-all approach. That said, things would become simpler, in a way, if I split off some specialized components, but the basic question of how to integrate my component(s) into the main flow stays the same.
Greetings,
Florian
One Star

Re: how to design component that adds data based on main flow

Hello Florian, the task is still a bit "fuzzy" for me, so I am not 100% sure I understood it completely.
However I understood you may have two main steps :
1) The component getting the data
2) An optional decode which could bedone in tMap, but unfortunately the logic behind may be too complex and you need to rely on external proprietary APIs
For point 2, one option you may have is to use routines.
I still fail to completely get why you cannot decode in the component itself, but I guess there must be a good reason, but a decode functionality is a typical task for a static method which eventually can still use an external API
One Star

Re: how to design component that adds data based on main flow

Sigh. I can't get it explained properly.
I'll give it another try: The job will import some data, and use my component to get some supplementary information. The input for my tSuppleInfo comes from the main flow record. Where should the output go?
--
Why I prefer a component over routines.
Thousands of different jobs will use my component to get this supplementary information. Well, not thousands, but perhaps dozens. The jobs will not be written by me, but by people with different backgrounds. Some are programmers, others are bankers, most are somewhere in between. I have to keep it as simple as possible. They should drag the component into the job, fill out some parameters, and connect the arrows. You can't drag and connect routines with the mouse. Of course I may try to tell the job developers to jump through hoops, but then they may prefer not to, walk around the hoops, and not use the component.
--
Why I need to make RMI calls.
No, neither the optional decode nor any of the other functions can be done in tMap, or in my component itself. It can only be done in the external application because only the external application has the data. Data based on a customer-driven, historically aware, context-sensitive rule system evaluated on the spot.
--
My question is not whether I should use a component, or whether I need to make RMI calls, but how I should design the component. What's the best way to integrate it into the job? Which connectors to use? Are there good examples to follow? The input parameters are easy - the component parameters/settings are intuitive and ready to use. But the output of a component like mine seems a bit tricky. (I think that's because Talend doesn't want to fool around with side flows that magically re-integrate into the main flow.) I'll go with the additional columns, for version 1.0, but still have some doubts. Maybe somewhere out there somebody can understand me, has had a similar problem, found an astonishingly elegant, mind-blowingly simple solution, and would like to share it. It might be possible with a tMap and a neat trick.
One Star

Re: how to design component that adds data based on main flow

You seem to keep saying the same thing.
A picture may help showing the data flows and how your component should interact with other components.
One Star

Re: how to design component that adds data based on main flow

Yeah, really, either your are telling the story way more complex than it is, either you are not giving us the full picture Smiley Happy
Anyhow, I understand your basic need is to decode something, does nto matter if this decoding is done internally in Talend or if you fetch the result externally :
you basically have 1 field in input (plus eventually other fields) that are not used in the process, and 1+1 field in output (the added field is the decoded one).
For some reason you love components, good, that's fine for me too.
In that case you need :
a schema input (normally inherited from the preceding component, via the data connection that links it to your component) and a schema output which in most cases will be equal to the incoming schema PLUS the decoded field.
In your parameters you will need all the info needed to connect to the RMI service and you will use them most likely in the _begin section fo your component, where you will init the RMI connection.
Two additional parameters will be 1) the input field , 2) the output field
Your component will probably be set to DATA_AUTO_PROPAGATE which will allow you to avoid iterating all the metadata to copy over to the output connection.
In the _main section of your component you will read the input column, call the RMI, get the response and populate the output column.
in the _end section you wil clean up whatever is needed with your RMI or other stuff that might need de-init attention.
Assuming that for each record in input you have ALWAYS one and only one record in the output, than that's pretty much it.
Hope it helps,
Francesco
One Star

Re: how to design component that adds data based on main flow

Hallo Francesco,
ok, pictures. I can do that. Eh, I can give it a try. Not sure about this Image Upload Slots. Preview doesn't show anything.
The real jobs are way more complex, of course, but to illustrate my points I've simplified a job to something basic like this:
1. Read input file
2. Do something
3. Write output file

The job is part of an order process, and "2. Do something" involves checking the limit for individual orders to avoid costly consequences in case of typos, program errors, and rogue traders.

Now I don't want to have the same limit for all types of orders, but a customized limit our users can set for customized classes of instruments. The customizing is done in the external application, and tSuppleInfo is the interface to connect to the external application. It calls a function called GET_LIMIT over RMI, provides the ident and timestamp to identify the instrument, and gets the limit. Conceptually this is supplementary info I look up, I would like to do it as a lookup in Talend:


But this doesn't work. My lookup source needs to know the ident and timestamp, but it can't access the main flow. (Without forcing every job developer to provide something via globalMap or similar.)
The next best solution is to add columns to the main flow.

It works. It is clumsy. I can have more than one output value, and afaik I have to provide columns for all potential at design time. Every job developer has to wonder why, and live with or get rid of the unwanted, technically named extra columns. The job developer might have more than one tSuppleInfo. What then? But it's the best I've come up with.
Greetings,
Florian
One Star

Re: how to design component that adds data based on main flow

Ooook, starting to be a bit more clear now Smiley Happy
Something still not totally clear - you are going to process a full batch of records, right, not a single record each time, correct?
I mean (I suppose) you will have a set of records, each one with ident and timestamp.
In that case they have to stay with your main flow until you calculate your limit (which you could oput at the beginning of the process so you don't need to carry around the other columns anymore after that, if you don't need them.
The issue with this approach is that you will have a RMI call per each record and performances might not be acceptable (that actually depends a lot on the performances of the application you are calling).
If your remote application can accept (and has a better performance with) batches, than you could split the process in two : 1) get the distinct values of ident and tstamp, send them to the remote app, get the result and cache it locally (globalMAp most likely) 2) process all the records and update the limit(s) 3) output all the records
There is an undocumented feature called "virtual components" that can help you with this, the issue is that you will need to store everything in memory, which might not be applicable if you can have lots of records.
Virtual components are basically two components tied together, where the first one (input) reads the data, does whatever it needs with it and finally it stores it ina global buffer
The second component (output) reads the buffer, does whatever it needs with it and spits it out in a data flow.
Tehcnically (to the user) they appear as a single component (I think tSort is one of them... there are a few standard TOS components that are "virtual").
To keep it simple and reduce the amount of allocated memory, you could use a standard ocmponent that performs a RMI call "on the fly" in the main section, meaning it has an input connection that contains the ident and the tstamp, two parameters will prompt the user to specify which column has the ident and which one has the tstamp (you could provide a default, enabled witha checkbox, which will allow you to automatically identify them using standard column names).
The you have your row input connection :
Ident_<%=cid %>= <%=InrowName%>.<%=IdentName%> ;
TStamp_<%=cid%>= <%=InrowName%>.<%=TstampName%> ;
limits_<%=dic %> = MyRMICallAndWhateverINeedWithItLikeLoopsAndStuff_<%=cid%>.doThings(Ident_<%=cid %>,TStamp_<%=cid %>);
<%=OutrowName%>.<%=limitsName%> = limits_<%=dic %>;
Obviously here you can also perform other actions like filtering the row or setting a flag to reject it if conditions are not met :
if (limit_<%=cid%><<%=InrowName%>.<%=PriceName%> *<%=InrowName%>.<%=QuantityName%>)
<%=OutrowName%>.<%=tagForRejectColumn%> = true;
// or <%=OutrowName%> = null; to filter out the row, in that case remember to reconstruct the object at the begin of the main section with :
<%=OutrowName%>= new <%=OutrowName%>Struct();
About having multiple columns for the limits :
You would need dynamic schema which are only available in TIS, in TOS you could use a string field and store a list (i.e. comma separated ) with all the values
Again you will need a parameter to identify the name of the column used to store the limits and also in this case you can provide a default value enabled via a checkbox (hide the column selection with a SHOW_IF in the XML descriptor).
In the _begin section of your component you will create an instance of the RMI interface class
MyRMICallAndWhateverINeedWithItLikeLoopsAndStuff_<%=cid%> = new myPackage.mySubPackage.MyRMICallAndWhateverINeedWithItLikeLoopsAndStuff(<connection pars I gues>);
and in the _end section you will close the connection, dispose objects etc.
MyRMICallAndWhateverINeedWithItLikeLoopsAndStuff_<%=cid%>.closeconnection();
Finally, if you can standardize the input and output schema, then you can set them "green" in your XML descriptor and avoid potential mistakes from the end user
One Star

Re: how to design component that adds data based on main flow

Good evening Smiley Happy
i'm following this thread, as i'm in a similar pond. The virtual component features seems really the solution for all my problems! but it's still quite difficult to understand how to build a simple virtual-paired-couple components, as Sort and Aggregate are definitively difficult to reverse.
@sabuto,
perhaps you could help us providing a bit more information of this nice undocumented feature? Smiley Happy
best regards,
gabriele
One Star

Re: how to design component that adds data based on main flow

Did a few experiments with virtual components, but did not really like the overall idea to be honest.
It works if you don't have a huge data set to transfer, else it might get tricky.
Why?
Basically a virtual component, for what I was able to understand (by reverse engineering existing components) is created using 3 components :
1) "real" input component
2) a "real" output component
3) a "virtual" compoentn that has only icon, properties and xml descriptor
In the virtual one, in the CODEGENERATION part of the descriptor, you have something like this :
 <CODEGENERATION>
<TEMPLATES INPUT="BufOut" OUTPUT="BufIn">
<TEMPLATE NAME="BufOut" COMPONENT="tBufferTestOut">
<LINK_TO NAME="BufIn" CTYPE="ROWS_END" />
</TEMPLATE>
<TEMPLATE NAME="BufIn" COMPONENT="tBufferTestIn" />
<TEMPLATE_PARAM SOURCE="self.SCHEMA" TARGET="BufIn.SCHEMA" />
<TEMPLATE_PARAM SOURCE="self.SCHEMA"
TARGET="BufOut.SCHEMA" />
<TEMPLATE_PARAM SOURCE="self.UNIQUE_NAME"
TARGET="BufOut.DESTINATION" />
<TEMPLATE_PARAM SOURCE="self.UNIQUE_NAME"
TARGET="BufIn.ORIGIN" />
</TEMPLATES>
</CODEGENERATION>

In my example the components are :
tBufferTestOut
tBufferTestIn
tBufferTest (the virtual one)
this xml part here is used to assign the input and output real components.

<TEMPLATES INPUT="BufOut" OUTPUT="BufIn">
<TEMPLATE NAME="BufOut" COMPONENT="tBufferTestOut">
<LINK_TO NAME="BufIn" CTYPE="ROWS_END" />
</TEMPLATE>
<TEMPLATE NAME="BufIn" COMPONENT="tBufferTestIn" />

You wil notice that BufOut is assigned as input and BufIn as output... it's not a typo, I really wanted it that way.
Why?
Your virtual component needs to get (input) data from something, and this something needs to be able to provide (output) data to it, so it's basically an OUTPUT component.
Quite confusing eh?
I might add a tutorial on this one...
Anyhow, in a typical TOS data flow, the begin section is executed once, then the main section is executed per each record.
If you have component A and component B connected together, the main section of A and B are executed one after each other at each record.
This requires to hold in memory only one record at a time.
This rule does not apply to virtual components, instead component A will read all the records and will keep them in memory, will do whatever it needs with them and will finally post the result in a memory buffer.
When it is done, control goes to component B which will read this data from the memory buffer, do, whatever it needs with the records and finally outputs them to a data flow.
these declarations :
<TEMPLATE_PARAM SOURCE="self.SCHEMA" TARGET="BufIn.SCHEMA" />
<TEMPLATE_PARAM SOURCE="self.SCHEMA"
TARGET="BufOut.SCHEMA" />
<TEMPLATE_PARAM SOURCE="self.UNIQUE_NAME"
TARGET="BufOut.DESTINATION" />
<TEMPLATE_PARAM SOURCE="self.UNIQUE_NAME"
TARGET="BufIn.ORIGIN" />

are there so that each component can refer to the other one.
Practically, let's take the last declaration
<TEMPLATE_PARAM SOURCE="self.UNIQUE_NAME"
TARGET="BufIn.ORIGIN" />
It basically instructs the virtual component to COPY the value "UNIQUE_NAME" into the BufIn.ORIGIN parameter which is defined in the descriptor of the tBufferTestIn component, XML descriptor :
<PARAMETER NAME="ORIGIN" FIELD="TEXT" NUM_ROW="10"
REQUIRED="true">
<DEFAULT>tBufferTest_1</DEFAULT>
This is because when you deal with virtual components, the "real" ones, being the input and output one, DO NOT expose parameters.
Users will be abel to set only the parameters of the virtual component.
However there is no real java code in the virtual component, so those parameters would not be accessible anywhere.
For this reason the SOURCE / TARGET declaration in the virtual one is used to transfer values to the parameters of the sub components.
I know it's a bit messy... told you I did not like them a lot.
Finally, in the begin template of the testIn component I have :
String origin = ElementParameterParser.getValue(node, "__ORIGIN__");
for (INode pNode : node.getProcess().getNodesOfType("tBufferTestOut")) {
if (!pNode.getUniqueName().equals(origin + "_BufOut")) continue;
for (IConnection conn : pNode.getIncomingConnections()) {
rowName = conn.getName();
break;
}

}

As you can see I can get "origin" the usual way, then I can get the matching bufOut part.
Similarly, in the bufOut compoennt (begin section) I CAN do something like this :
String destination = ElementParameterParser.getValue(node, "__DESTINATION__");
String rowName= "";
if ((node.getIncomingConnections()!=null)&&(node.getIncomingConnections().size()>0)) {
rowName = node.getIncomingConnections().get(0).getName();
} else {
rowName="defaultRow";
}
String outrowName = "";
for (INode pNode : node.getProcess().getNodesOfType("tBufferTestIn")) {
if (!pNode.getUniqueName().equals(destination + "_BufIn")) continue;
for (IConnection conn : pNode.getOutgoingConnections()) {
outrowName = conn.getName();
break;
}
}

Hope it helps..
One Star

Re: how to design component that adds data based on main flow

finally i did it! I was able to build a virtual component with your advices, precious as always Smiley Happy
i still have to resolve some minor issues, but the whole process works like a charm, actually
a last question: how to propagate an input flow in the output.
lemme explain...
my component takes a schema in input, then calls a remote servises that add some columns in the output schema.
Let's say i have an incoming schema with two columns and three rows:
1 A
2 B
3 C
I propagated the entire rowStruct collection in a globalMap buffer and i passed to the "in" stage of the virtual component.
Then i pass a vector (let's say, the second colum) to my remote service that returns something like that
A foo
B bar
C gaz
Now i need to make a join to have a final output dataflow like:
1 A foo
2 B bar
3 C gaz
Ofc, there's tons of pure-java solution to make this join. However, it definitively looks like a regular Talend hash lookup reference, i think that part of the code could be already available. As i say, i would like to do that in a pure Talend way (because of performance, because of code cleaness, because i'm lazy)
So, i was wandering, how to use the tHash costructs to make an inplicit join in the "in" stage of the virtual component between the buffer coming by the "out" stage and the columns added by the webservice call?
as always, @sabuto, tnx in advance!
gabriele