Creating machine learning Jobs using anonymized real patient data


This article shows you how to create a machine learning (ML) Job using real, but anonymized data to build a predictive model. The example shows you how to use Talend Machine Learning components to build such predictive models. While these models could cover a wide range of use cases in several industries, the basic principles are the same.


For more information, see Machine Learning in the Talend Help Center.


Use case

The source data used in this article is from a university hospital study on the outcome of patients who suffered a subarachnoid hemorrhage. The patients were admitted to the hospital where their condition was monitored, for a couple of weeks, during treatment. Various blood tests were done every few days, and the samples were tested for specific blood markers. Each patient was classified with a likely outcome, a survivability score, known as a Hunt and Hess score. That data was collected alongside demographic data such as sex and age, together with clinical features such as predicted outcome for that patient.


The goal is to build a predictive model that predicts the clinical outcome for patients based on various parameters. Then compare the model to the actual predicted outcome to test and verify the Talend model. The ML Job uses the following specifications:

  • Use the anonymized real patient data from the university hospital to build a predictive model to model the predicted outcome (survivability score) for the patient
  • Build a model, train it with this data, then use it to make predictions
  • Test the model to measure its accuracy


Apache Spark MLlib overview

Spark MLlib is a fast, powerful, distributed machine learning (ML) framework on top of Spark Core. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib, and these simplify large scale machine learning pipelines.


MLlib consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization types such as basic statistics, classification and regression, linear models (SVMs, logistic regression, linear regression), collaborative filtering, clustering, dimensionality reduction, feature extraction and transformation and optimization.


Many Talend ML components allow you to call and utilize these MLlib algorithms to process and build ML Jobs.


Talend Machine Learning components

Talend ML components are grouped into four categories, Classification, Clustering, Recommendation, and Regression. This article focuses on the tPredict and tRandomForestModel components in the Classification category.

ML components.png


tPredict component

The tPredict component uses a given classification, clustering, or relationship model to analyze datasets incoming from its preceding component.


tRandomForestModel component

This example uses a random forest model. These work by constructing a multitude of decision trees at training time and then they output the class, that is, the mode of the classes (classification) or mean prediction (regression) of the individual trees.


A good analogy to consider is that you're walking blindly through a forest with your arms outstretched. Each time you hit a tree, you are deflected in a different direction.


The following diagram illustrates the architectural process involved.


The tRandomForestModel component is parameterized by the number of trees in the forest. In the diagram above, this is shown as Tree 1 through Tree B.


Talend Big Data & Machine Learning Sandbox

To replicate these Jobs or to build your own predictive model, based on your own use cases, download a free trial of Talend Big Data & Machine Learning Sandbox.


The sandbox has everything you need to build and run ML Jobs, without having to install all the components yourself. It also comes complete with sample, ready to use, Jobs.


Anonymized real patient data

The patient data, in spreadsheet format, is used to build a predictive model to predict the Hunt and Hess score based on specific data in the file.



To build your model, use the following process:

  1. Understand the data – what do all the terms mean?
  2. Look for patterns and dependencies – can we find what is dependent on what? What will we model?
  3. Select the variables that may be linked and examine those relationships.
  4. Try and build a model.
  5. Test and tune that model.
  6. Make predictions, then test their accuracy.


Step 2 is the key: Whatever your use case is, you need to understand these relationships and how they are linked. If there are dependencies between certain variables, then you can model these using certain MLlib functions.


This use case established that there is a relationship between a patient's survivability score (the Hunt and Hess score), their age (the older you are, the less likely you are to survive) and the results of specific markers in blood tests (these are excellent indicators of your clinical outcome). This data is used to build a predictive model.


You can use certain Talend ML components to help identify these relationships in the data. For more information on Talend ML components, see Machine Learning in the Talend Help Center.


Talend machine learning Jobs

This section shows you how to create the predictive model by building Standard Jobs and Big Data Batch Jobs:

  • Standard Job
    • Set up the environment
  • Big Data Jobs
    • Build a cluster analysis model
    • Build a predictive model
    • Train a predictive model
    • Test the model
    • Display the results


Building a Standard Job

Set up the environment by building a Standard Job that takes in the raw data, then filters the data to select only valid data. For example, you may only want data from patients in a specific age range.


The Job has three sets of data created using the following components:

  • Uses a tFilterRow component to create a dataset that collects data used purely to train your model. This is filtered data, so you want the data to be as good as possible.
  • Uses a tReplicate component to replicate (exact copies are used for both datasets) two more datasets to test and demo the model. This data is unfiltered. It is not necessary to filter the data, but the option is available if you want to do so.

    Note: If you only want a sample of the data for demo purposes, you can use a tSampleRow component.


This Job uses the following components:

  • tFileInputDelimited – gets the data from the HDFS filesystem, in this case, the data is stored as a CSV file
  • tFilterRow – filters on the data if required
  • tReplicate – makes two exact copies
  • tSampleRow – selects a sample of the data for demo purposes
  • tHDFSOutput (3 off) – writes the output data to the HDFS filesystem as three files, called test.csv, training.csv, and demo.csv



Building Big Data Jobs


Training the model

This section details the components used to build a Big Data Job to train the model. The Job takes the training data produced in the first Job and uses it to train a random forest model before that data is passed through a tModelEncoder model.



This Job uses the following components:

  • tFileInput

    This component takes the training data from the Hadoop file system. The only configuration is to specify the location of that file, that is, the same place where the setup Job wrote the training data.

  • tModelEncoder

    This component performs operations to transform data into the data format expected by the model training components. These operations consist of processing algorithms which transform given columns of this data and sends the result to the model training component that follows to eventually train and create a predictive model.



    By default, this component contains a set of four different transformations; this example uses the RFormula transformation.


    RFormula is used for implementing transformations which are required for fitting data against an R model formula. Within the function are a small set of R operators that can be used to describe the transformation required.


    This model uses the relationship between the Hunt and Hess score and the patient's age plus the results of various markers in blood tests. This example uses only two of these blood tests. Thus, the required transformation is defined as:


    Hunt_Hess ~ Cyt_B + D_Loop + Age


    Where Cyt_B and D_Loop are two of those blood tests.


  • tRandomForestModel

    This component analyses featured variables. These variables are usually pre-processed by the tModelEncoder component to generate a classifier model that is used by the tPredict component to classify given elements. It analyses incoming datasets based on applying the Random Forest algorithm. It then generates a classification model out of this analysis and writes this model either in memory or in a given file system.

    Note: It is necessary to configure the following settings:

    • Model location: where the model resides
    • Random forest hyper-parameters:

      • Number of trees in the forest: 128
      • Maximum depth of each tree in the forest: 10

    This is done for you in the configuration settings, as shown below.



After the Job is built and configured, you can run it. The Job runs through several stages, which are dependent upon the configuration (this is defined by the depth and the number of trees in the model). Running the Job produces the following output:



You are now ready to use the model to predict results.


Predicting results

This section details the components used to build a Big Data Job to predict results.



This Job uses the following components:

  • tFileInputDelimited

    This component takes the training data produced in the Standard Job and uses it as input for the predictions.

  • tPredict

    This component takes the input data and applies the model built to make predictions about a patient's survivability score (Hunt and Hess score) based on the variables previously defined; the patient's age and the results of certain blood tests.

    Note: It is necessary to configure the following settings:

    • Model location: where the model resides
    • Model Type: Random Forest Model

    This is done for you in the configuration settings, as shown below.



  • tFileOutputDelimited

    This component outputs the results of the predictive model to a file on the Hadoop filesystem.

After the Job is built, run the Job, the Job runs through the stages again, before finishing.


Testing the predictions

This section details the components used to build a Job to test the results of the model. You can compare the predictions the model made, against the test data you already have, and you can compare the predicted Hunt and Hess score against the actual Hunt and Hess score.



This Job contains the following components:

  • tFileInputDelimeted

    This component takes the test data as your input.

  • tPredict

    This component runs the test data through the model and sends the output is sent to the next component.

  • tAggregateRow

    This component takes the predicted Hunt and Hess score and compares it to the actual score. The component configuration is shown below:



  • tLogRow

    This component outputs the results.


Examining the results

This section examines the model to see how accurately it performs. For the comparison, construct what is known as a Confusion Matrix or an Error Matrix, as shown below:



These four types are used as a graphical and straightforward way to display your results:

  • True Positives (TP): Prediction of Hunt and Hess cases were correct.
  • True Negatives (TN): Prediction of Hunt and Hess cases were wrong.
  • False Positives (FP): Prediction of Hunt and Hess cases were correct, but shouldn’t have been wrong (also known as a Type I error).
  • False Negatives (FN): Prediction of Hunt and Hess cases were wrong, but should have correct (also known as a Type II error).

To determine the accuracy of the model, divide the product of the True Positives and True Negatives, then divide that by the total number of data points.


In this case, the test data had 73 useable data points. The results scores are:

  • TP = 60
  • TN = 3
  • FP = 6
  • FN = 4

Overall, how accurate is your model? If you do the calculations, you'll find that you get the following result:

(TP+TN)/Total = (60+3)/73 = 0.863 = or approximately 86% accurate


The data had a total of 90 data points. However, it can only use 73 of these data points because not all of the data is complete. This is not a great deal of data, but to build a model with an accuracy of 86% is quite good.


Having more data could improve the model and increase accuracy.



This article showed you how to use real-life data to build a predictive model with Talend Machine Learning components. You can use it is an example when constructing your machine learning Job for your use cases using your data.

Version history
Revision #:
11 of 11
Last update:
‎07-23-2019 02:09 PM
Updated by: