Using Linear Regression with Apache® Ignite™

In the previous article in this Machine Learning series, we looked at the Apache® Ignite™ Machine Learning Grid. Now let’s take the opportunity to drill-down further into some of the Machine Learning algorithms that are supported in Apache Ignite and try out some examples using popular datasets.

If we search for suitable datasets to use, we can find many that are available. However, one dataset that is a good candidate for Linear Regression is House Prices. Very conveniently, we can find suitable data available through the UCI web site.

In this article we will train a Linear Regression model and calculate the R2 score.

Some data preparation is required to get the data into a suitable format for Apache Ignite. This is often what a Data Scientist may spend time doing.

First, we need to take the raw data and split it into training data (80%) and test data (20%). At the time of writing this article, Ignite does not support dedicated data splitting, but this functionality is on the roadmap for a future release. In the meantime, there are many free and open source tools available that can perform this type of data splitting or we could code this ourselves in one of the programming languages supported by Ignite. For this article we'll use Scikit-learn, and my colleague Anton Dmitriev at GridGain very kindly wrote the following code to achieve this task:


from sklearn import datasets
import pandas as pd

# Load Boston housing dataset.
boston_dataset = datasets.load_boston()
x = boston_dataset.data
y = boston_dataset.target

# Split it into train and test subsets.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=23)

# Save train set.
train_ds = pd.DataFrame(x_train, columns=boston_dataset.feature_names)
train_ds["TARGET"] = y_train
train_ds.to_csv("boston-housing-train.csv", index=False, header=None)

# Save test set.
test_ds = pd.DataFrame(x_test, columns=boston_dataset.feature_names)
test_ds["TARGET"] = y_test
test_ds.to_csv("boston-housing-test.csv", index=False, header=None)

# Train linear regression model.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)

# Score result model.
lr.score(x_test, y_test)

This code takes the dataset available from the UCI web site, performs the data split and then calculates the R2 score. The value returned is 0.745021053016975, or 74.5%. Later, we’ll compare this against the value returned by Ignite.

With our training and test data ready, we can start coding the application. My colleague Anton very kindly wrote a Java application for use with Ignite and you can download it from GitHub if you would like to follow along. Our algorithm is therefore:

  1. Read the training data and test data
  2. Store the training data and test data in Ignite
  3. Use the training data to fit the Linear Regression model
  4. Apply the model to the test data
  5. Determine the R2 score of the model

Since the dataset is quite small, we could just load it into standard Java data structures and run Linear Regression directly from within the Java program. Alternatively, we could load the data into Apache Ignite storage and then run Linear Regression on the stored data. The advantage of using Apache Ignite storage is that the data will be distributed across an entire cluster and, therefore, we will be performing distributed training. For large datasets, using Ignite storage could therefore have great benefits. In our example we will load the data into Ignite storage.

Read the training data and test data

We have two CSV files to read in - one for the training data and the other for the test data. We can use the following code to read values in from the CSV files:


private static void loadData(String fileName, IgniteCache<Integer, HouseObservation> cache)
        throws FileNotFoundException {

   Scanner scanner = new Scanner(new File(fileName));

   int cnt = 0;
   while (scanner.hasNextLine()) {
      String row = scanner.nextLine();
      String[] cells = row.split(",");
      double[] features = new double[cells.length - 1];

      for (int i = 0; i < cells.length - 1; i++)
         features[i] = Double.valueOf(cells[i]);
      double price = Double.valueOf(cells[cells.length - 1]);

      cache.put(cnt++, new HouseObservation(features, price));
   }
}

The code simply reads the data line-by-line and splits fields on a line by the CSV field separator. Each field value is then converted to double format and then the data are stored in Ignite.

Store the training data and test data in Ignite

The previous code stores data values in Ignite. To use this code, we need to create the Ignite storage first, as follows:


IgniteCache<Integer, HouseObservation> trainData = ignite.createCache("BOSTON_HOUSING_TRAIN");

IgniteCache<Integer, HouseObservation> testData = ignite.createCache("BOSTON_HOUSING_TEST");

Use the training data to create the Linear Regression model

Now that our data are stored, we can create the trainer as follows:


DatasetTrainer<LinearRegressionModel, Double> trainer = new LinearRegressionLSQRTrainer();

and fit a linear model to the training data, as follows:


LinearRegressionModel mdl = trainer.fit(
   ignite,
   trainData,
   (k, v) -> v.getFeatures(),  // Feature extractor.
   (k, v) -> v.getPrice()      // Label extractor.

Ignite stores data in a Key-Value (K-V) format, so the above code uses the value part. The target value is Price and the features are in the other columns.

Apply the model to the test data

Next, we are ready to check the test data against the trained linear model. On the Apache Ignite Machine Learning roadmap, there is a plan to provide built-in score calculators. For now, we can do the following:


double meanPrice = getMeanPrice(testData);
double u = 0, v = 0;

try (QueryCursor<Cache.Entry<Integer, HouseObservation>> cursor = testData.query(new ScanQuery<>())) {
   for (Cache.Entry<Integer, HouseObservation> testEntry : cursor) {
      HouseObservation observation = testEntry.getValue();

      double realPrice = observation.getPrice();
      double predictedPrice = mdl.apply(new DenseLocalOnHeapVector(observation.getFeatures()));

      u += Math.pow(realPrice - predictedPrice, 2);
      v += Math.pow(realPrice - meanPrice, 2);
   }
}

Here we calculate the residual sum of squares (u) and the total sum of squares (v).

Determine the R2 score of the model

We can find the value of R2 as 1 - u / v:


double score = 1 - u / v;

System.out.println("Score : " + score);

This gives us the value 0.7450194305206714 or 74.5%. This percentage is identical to what we achieved earlier with Scikit-learn.

Summary

Apache Ignite provides a library of Machine Learning algorithms. Through a Linear Regression example, we have seen the ease with which we can create a model, test the model and determine the R2 score of the model. We can now also use this model to make predictions.

Today, many Machine Learning tools are available, but they cannot scale beyond a single node and can only handle small quantities of data. In contrast, the benefits that Ignite provides are its ability to scale both:

  1. The size of the cluster (hundreds or thousands of machines).
  2. The quantity of data stored (hundreds of Gigabytes, Terabytes or even Petabytes).

Ignite can therefore run Machine Learning at scale. It can truly manage Machine Learning on Big Data using distributed processing.

In the next part of this Apache Ignite Machine Learning series, we’ll look at another Machine Learning algorithm. Stay tuned!