Using the Spark MLlib RFormula transformations in the tModelEncoder machine learning component

Overview

This article shows you how to use the RFormula transformation to build and train a machine learning model. The model uses a tModelEncoder component that processes algorithms along with various Spark MLlib transformation algorithms, to build and train the ML model.

 

For more information on building a machine learning Job using real data, see the Talend Community Knowledge Base article, Creating machine learning Jobs using anonymized real patient data.

 

Apache Spark MLlib

The Spark MLlib provides a large number of machine learning tools such as common ML algorithms, ML pipeline tools, and utilities for data handling and statistics.

 

Some of the algorithms in the MLlib can extract, transform, and select features from within data. Using these algorithms, you can extract features from raw data; you can scale, modify, and convert features; select features from within a larger subset of features; and finally, you can combine features with other algorithms.

 

This article examines one of the most useful algorithms, the RFormula algorithm.

 

For more information on the Spark MLlib and RFormula, see the Apache Spark Machine Learning Library (MLlib) Guide.

 

Talend tModelEncoder component

The tModelEncoder component transforms input data into the format that is expected by the training model components, such as the tKMeansModel component; these components are the components that train your data model. After the model training component receives the data, it applies a processing algorithm that is used to train and create your predictive model.

 

The tModelEncoder component offers several different Spark MLlib transformations. These algorithms, available in the Transformation column, vary depending on the type of the input schema columns to be processed.

 

Using RFormula within the tModelEncoder component

This example shows you how to use a tModelEncoder component with a machine learning Job to train your machine learning model.

 

Data, pulled from a file on the HDFS file system, is passed to the tModelEncoder component. On the Basic settings tab, RFormula (Spark 1.5+) is selected from the Transformation pull-down menu. The data is encoded, then a model is built; in this case, a Random Forest model.

rformula1.png

 

RFormula works by implementing the transforms required for fitting a dataset against an R model formula. Within the function, there are a small set of R operators that can be used to describe the required transformation. If you want to build a model for training and making predictions, then you can use the variables in the data to help the R Model figure out how those variables are related. The RFormula description does not show the exact relationship, but rather shows the preceding modeling component which terms should be used to build the model. The form of that relationship is worked out by the proceeding component when building the model.

 

RFormula allows you to use the following operators on the data:

~ separate target and terms
+ concat terms ("+ 0" means removing intercept)
- remove a term ("- 1" means removing intercept)
: interaction (multiplication for numeric values, or binarized categorical values)
. all columns except the target

 

An example of using RFormula

The easiest way to explain how to use RFormula to build a model is by providing an example. In this example, the data is passed onto the proceeding component that builds the actual model.

 

Imagine you are a retailer and you want to model which product sells best at certain times of the day. You want answers to questions like, at what time does bread sell the most in certain stores? What is your biggest seller in the evening in certain stores? You want to use the data to help you build a model of what items sell best at what times and what times do certain items sell in our stores.

 

You collect tables of data from your sales points in the form of:

  • Shop_ID
  • Item_ID
  • Time

You want to use the sales data in those tables to build a model of your sales in your stores. You know that for each item with an ID, each item you sell, you also have the variables Shop ID and Time.

 

So, you would use the following RFormula description: Item_ID ~ Shop_ID + Time.

 

RFormula is very simple. It lets you model data in a simple symbolic way using the operators listed above.

 

This would be the basis that the RFormula transformation uses to build the model. It takes all the data and constructs a dataframe. You could then use a Talend TPredict component to make predictions on what items sell at what times.

 

In Spark, a dataframe is a distributed collection of data organized into named columns. It is considered to be the equivalent of a table in a relational database or a dataframe in R/Python. Dataframes can be constructed from a wide array of sources such as structured data files, tables in Hive, and external databases.

 

In the Basic settings tab of the tModelEncoder component for the Transformations table, you would add the following to the Parameters column:

featuresCol=Features;labelCol=Label;formula=Item_ID~Shop_ID+Time

The configuration above organizes the data within the dataframe, as shown below:

Item_ID Shop_ID Time
---------------------------------
1 X1 X1
2 X2 Y2
n Xn Yn

for items 1 to n, with Shop_ID and Time variable for each value of n

 

When the Job containing the tModelEncoder component runs, it uses the data to train a predictive model. You can then use that model in another Job to make predictions for each shop about what items sell at what times, by using the tPredict component. The tPredict component uses the Spark Predict() function to make predictions using the model. The predict function returns a dataframe that contains the original (unchanged) columns and appends the predicted data along with the features in an additional column, which makes it easy to see and display the predicted results.

 

This use case predicts the best times items sell, and the results would look like this:

Item_ID Shop_ID Time Features Label
-------------------------------------------------------------------
1 X1 X1 [0.0, 0.0, X1, Y1] Y1P
2 X2 Y2 [0.0, 1.0. X2, Y2] Y2P
n Xn Yn [1.0, 0.0, Xn, Yn] YnP

where P represents the Predicted value

 

Conclusion

This article showed you how to use RFormula in the tModelEncoder component to build a model and how to use the tPredict component to make predictions. It provided a straightforward use case model as an example of how you can use RFormula to construct models that are more complicated and to suit whatever use cases you have.

 

For more information, see the Apache Spark Class RFormula documentation page.

Version history
Revision #:
15 of 15
Last update:
‎07-30-2019 01:32 PM
Updated by:
 
Labels (2)