Four Stars

Job Design Best Practices

Hi all,
On 12/12 I'm giving an intro to TOS DI talk at the OCJUG (RSVP here if you can make it). I'd like to offer a few slides on job design best practices and would love some feedback/discussion on this. I searched, but couldn't find much that wasn't what I'd call marketing material. As my audience is mostly Java programmers, I'm looking for specific practices for developers, not guidelines for managers.
Here's what I have so far: tips and gotchas that I've discovered by creating jobs and putting them in production. Please let me know if you agree or disagree, and add what's worked well for you.
If you?re embedding jobs, name them in capitalized camel case, e.g. TalendJob, as the job name becomes the class
name.
Otherwise, you may want to prefix them with a letter/number combo indicating phase (ETL) and order of execution (idea stolen from a previous forum thread).
For transactional behavior with rollback:
o Start the job with DB connection
o Check ?use existing connection? in all relevant components
o Check "Die on error" in all relevant components
o End job with commit component
Whenever appropriate (esp. for inserting data), use the schema from the repository for DB objects.
o Remember that when connecting a component, propagating changes to a DB component will change it to use a built-in schema, which won't get updated when you update the schema in the repository.
On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
Failure handling subjob:
o It?s an unconnected job (no triggers point to it)
o Use LogCatcher to catch, record component failures.
o Record failure in DB, file, email, etc.
o Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
If the job has an input file, upon completion move the file to a success or failure directory as appropriate. Consider appending a timestamp to the file name.
Use a context for job variables.
o Note you can specify type for variables.
o You can read from a file or database, or pass in a context if an embedded Java job.
o For multi-host deployment:
-- Export the job with a ?bootstrap? context that has all variables, but populates only a context config location that is the same for all machines.
-- The context config file has all values required for that host, e.g. test DB connection for test machine.
-- You can rely on the fact that Windows will interpret root as the main system drive, so ?/Data/? will translate to C:\Data\
-- Be mindful of file permissions for sensitive context data (e.g., DB password)
When using Java in expressions/filters/etc., use methods - not operators - whenever possible. For example, concat(String) instead of the dot operator, equals(Object) instead of ==.
Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.php?pid=110860
Use ?Bulk? output components when possible.
----
I'd like to hear your tips, esp. for things like recording job statistics/progress, and anything I might have missed.
Thanks,
Philip
http://philip.yurchuk.com/
4 REPLIES
Moderator

Re: Job Design Best Practices

Hi,
Thanks for your interest in talend.
Best regards
Sabrina
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.

Re: Job Design Best Practices

When you use java expressions in dialog boxes, be aware that they are inserted into the emitted code verbatim. Java expressions should be enclosed in (round parens) to make sure that they are evaluated as a unit in the resulting expression. You can't count on future versions of Talend generating code in same way as current version.
If you use the versioning feature ((Job/Version) it may be professional edition only?), save a new version to be development version, and deploy next-to-most recent, which has the known and now unchangable state of the job, in TAC.
Use the "View", "Label Format" feature frequently - leaving the default _UNIQUE_NAME_ is tantamount to java comments like "//assignment" or "for loop".
Don't use context variables to hold run time information. Or, depending on your shop policy, do use context variables to hold runtime data, and don't use the globalMap. But have a policy. When you are doing "pass whole context" in tRunJob, that would be an exceptional case. (context variables have a huge advantage of being typed and working with autocompletion)
I'm interested in your context config file - did you arrive at that opinion independently of the discussion at: http://www.talendbyexample.com/talend-reusable-context-load-job.html ?
Gotcha: Be aware that if you have an arrangement like:
Subjob1->On Subjob Okay -Subjob2 - On Subjob Okay - >Subjob 3
If Subjob2 is disabled, the overall job is now split into two jobs that run in whatever order they happen to run in, which can be confusing. Where this comes up in practice is Subjob2 is just something like tJava with a println.
Have a policy about when/whether to use the global variables that are automatically created by components (tFileInputPositional_1_NB_LINE) - in particular should components that read the files created by other components refer to those files with the auto-created global map variables, or should the file wriintg components write to files named by context variables and the file reading components read from files named by context variables? I think java devs are likely to instinctively prefer the under-my-control named context variables, but the GUI support for the autogenerated names is a bright spot in Studio.
I have found that variables in component dialog boxes getting evaluated at start-of-subjob time, instead of when the component occurs in the flow, is a significant source of confusion, so shops might consider having a prescribed way to use and name tfixedFlowInput to force iteration when needed to achieve run-time evaluation: http://www.talendforge.org/forum/viewtopic.php?id=28990

Re: Job Design Best Practices

This doesn't sound right to me:
When using Java in expressions/filters/etc., use methods - not operators - whenever possible. For example, concat(String) instead of the dot operator,

Can you give an example of what you are warning against? I think things like row6.InputFile.toLowerCase().contains("\\\\backup\\\\") are acceptable in Talend jobs.
equals(Object) instead of ==.

That part strikes me more like a training in Java thing. Which reminds me of something else that's not really "design," in your sense but still maybe relevant when talking guidelines - I've found even seasoned devs don't understand BigDecimal vs machine floating point types. I think shops should make sure designers have a cheat sheet of how to use BigDecimal (actually maybe that could get elaborated in this forum). I suspect that the majority of ETL shops will never encounter an appropriate place to use double or float, and I even wonder if the Talend GUI should do something to call those out as suspect types.
Four Stars

Re: Job Design Best Practices

Thanks Levin! Greatly appreciate your feedback and tips.
Regarding methods vs. operators, I believe my issue may have been caused by not using parentheses in an expression filter, as you caution for dialogs. In an expression filter in a tMap, I was doing a String compare using == that would have worked in a normal Java app, but did not evaluate to true when it should have. When I switched to the equals method (e.g., foo.equals(bar) where both are Strings) it evaluated/filtered correctly. Next time I will try wrapping it in parentheses to see if that fixes it.
The free version of TOS DI has versioning, and that's a good point that versions can only be increased.
Sounds like you've got a lot of experience creating interactive jobs, whereas my experience is all batch. I also need to get better at decomposing jobs using tRunJob. I'll review the docs and the example you reference. Sounds like there are perhaps non-obvious issues when using a context in conjunction with tRunJob.
In my talk I demonstrate the need for onSubjobOK or onComponentOK to chain jobs, otherwise execution order is not guaranteed. But thanks for pointing out the disabled component/subjob issue, I wasn't aware of that. Prior to deploy, I remove anything unnecessary for production, but I can see that disabled subjobs would cause issues during development and local testing.
Excellent point about float vs. BigDecimal. I wish Talend would default to BigDecimal instead of float for a DB decimal type, as I work in ecommerce. But yes, I've seen enterprise ecommerce systems running big name brands that use float to represent money! As an aside, this is a good article on representing money in Java:
http://www.javapractices.com/topic/TopicAction.do?Id=13
I'll concede that BigDecimal is not nearly as intuitive as float or double, but it's a must for those who represent money in Java.
Thanks again!