I made a boo-boo the other day. I promoted some code to production that had tLogRow components active. Not only is this inefficient, but it bloats the log file and could even have written personal data to log files, which would be hard to justify in terms of GDPR regulations.
So, I wrote a Python script to scan a Talend export file for active tLogRow components and print them out. If you are exporting a dozen jobs with 50 components on each, and you use tLogRow for debugging, this could be a useful safety net.
import sys import os.path import zipfile from xml.dom import minidom import re print "=====================" print "Talend Export Scanner" print "=====================" if (len(sys.argv)) < 2: print "Insufficent args" exit() zipfilename = sys.argv with zipfile.ZipFile(zipfilename, 'r') as zf: for name in zf.namelist(): if name.endswith('/'): continue if not name.endswith('.item'): continue if not re.match("^\w+/process/",name): continue #print "Parsing " + name f = zf.open(name) xdoc = minidom.parse(f) nodes = xdoc.getElementsByTagName('node') #print "There are " + str(len(nodes)) + " nodes" for node in nodes: node_problems = 0 if node.attributes["componentName"].value == "tLogRow": posX = node.attributes["posX"].value posY = node.attributes["posY"].value ePs = node.getElementsByTagName('elementParameter') #print "There are " + str(len(ePs)) + " elementParameters" active = True label = "" for eP in ePs: if eP.attributes['name'].value == 'ACTIVATE' and eP.attributes['value'].value == 'false': active = False if eP.attributes['name'].value == 'UNIQUE_NAME': unique_name = eP.attributes['value'].value if eP.attributes['name'].value == 'LABEL': label = " (" + eP.attributes['value'].value + ")" if active: if node_problems == 0: print print "Job: " + name node_problems += 1 print "Component " + unique_name + label + " is active" print
It should be picking up the label, the on-screen name of the component, but it isn't. I'll post a fix when I get time to look at it. Suggestions welcome.
Also welcome are suggestions for anything more that could be checked. I'm considering checking for a couple of connection objects that are different in production to development, and alerting if those are accidentally included, but those would be hard-coded to the names of the objects that we use in our environment. Maybe I could push them into a config file.
Yes, that other thread gives another reason to do something similar - I created a different script to do that, basically a cut down version of this one. If I had the time, I'd make it into a single generic script with command line switches, a bit like Unix find with switches like --jobname, I will post back here if I ever do!
Try Talend Cloud free for 30 days.
Introduction to Talend Open Studio for Data Integration.
Practical steps to developing your data integration strategy.
Create systems and workflow to manage clean data ingestion and data transformation.