Using K-Means Clustering with Apache® Ignite™

July 19, 2018

In the previous article in this Machine Learning series, we looked at k-NN Classification with Apache® Ignite™. We’ll now look at another Machine Learning algorithm and conclude our series. In this article, we’ll look at K-Means Clustering using the Titanic dataset. Very conveniently, Kaggle provides the dataset in a CSV form. For our analysis, we are interested in two clusters: whether passengers survived or did not survive.

Some cleanup and formatting is required to get the data into a suitable format for Apache Ignite. The CSV data contains a number of columns, as follows:

Passenger id
Survived (0 = no, 1 = yes)
Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Passenger name
Gender
Age in years
Number of siblings / spouses aboard the Titanic
Number of parents / children aboard the Titanic
Ticket number
Passenger fare
Cabin number
Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Our first task is to remove any columns that are unique to a particular passenger and, therefore, do not correlate to survival. So, we can remove the following:

Passenger id
Passenger name
Ticket number
Cabin number

Next, we’ll remove any rows where data are missing, such as Age or Port of embarkation. We could impute these values, but we will remove missing values for our initial analysis.

Our final step will be to convert several fields to a numeric format. For example, Gender will be converted as follows:

0 = female
1 = male

and Port of embarkation as follows:

0 = Q (Queenstown)
1 = C (Cherbourg)
2 = S (Southampton)

The final dataset consists of the following columns:

Ticket class
Gender
Age in years
Number of siblings / spouses aboard the Titanic
Number of parents / children aboard the Titanic
Passenger fare
Port of embarkation
Survived

... and 712 rows of data. The Survived column has been moved to the end and will be the last column.

We’ll now split the data into training data (80%) and test data (20%). As we have done in the previous articles in this series, we’ll use Scikit-learn to do this data splitting for us.

With our training and test data ready, we can start coding the application. You can download the code from GitHub if you would like to follow along. Our algorithm is therefore:

Read the training data and test data
Store the training data and test data in Ignite
Use the training data to fit the K-Means Clustering model
Apply the model to the test data
Determine the confusion matrix and the accuracy of the model

Read the training data and test data

We can use the following code to read-in values from the CSV files:


private static void loadData(String fileName, IgniteCache<Integer, TitanicObservation> cache)
        throws FileNotFoundException {

   Scanner scanner = new Scanner(new File(fileName));

   int cnt = 0;
   while (scanner.hasNextLine()) {
      String row = scanner.nextLine();
      String[] cells = row.split(",");
      double[] features = new double[cells.length - 1];

      for (int i = 0; i < cells.length - 1; i++)
         features[i] = Double.valueOf(cells[i]);
      double survivedClass = Double.valueOf(cells[cells.length - 1]);

      cache.put(cnt++, new TitanicObservation(features, survivedClass));
   }
}

The code reads the data line-by-line and splits fields on a line by the CSV field separator. Each field value is then converted to double format and then the data are stored in Ignite.

Store the training data and test data in Ignite

The previous code stores data values in Ignite. To use this code, we need to create the Ignite storage first, as follows:


IgniteCache<Integer, TitanicObservation> trainData = getCache(ignite, "TITANIC_TRAIN");

IgniteCache<Integer, TitanicObservation> testData = getCache(ignite, "TITANIC_TEST");

loadData("src/main/resources/titanic-train.csv", trainData);

loadData("src/main/resources/titanic-test.csv", testData);

The code for getCache() implemented as follows:


private static IgniteCache<Integer, TitanicObservation> getCache(Ignite ignite, String cacheName) {

   CacheConfiguration<Integer, TitanicObservation> cacheConfiguration = new CacheConfiguration<>();
   cacheConfiguration.setName(cacheName);
   cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, 10));

   IgniteCache<Integer, TitanicObservation> cache = ignite.createCache(cacheConfiguration);

   return cache;
}

Use the training data to fit the K-Means Clustering model

Now that our data are stored, we can create the trainer as follows:


KMeansTrainer trainer = new KMeansTrainer()
        .withK(2)
        .withDistance(new EuclideanDistance())
        .withSeed(123L);

We set the value of k to 2 to represent the two clusters (survived and not survived). For distance measure we have several options, such as Euclidean, Hamming or Manhattan and we’ll use Euclidean in this case. We have also set the seed as 123.

We can now fit the K-Means Clustering model to the training data, as follows:


KMeansModel mdl = trainer.fit(
        ignite,
        trainData,
        (k, v) -> v.getFeatures(),        // Feature extractor.
        (k, v) -> v.getSurvivedClass()    // Label extractor.
);

Ignite stores data in a Key-Value (K-V) format, so the above code uses the value part. The target value is the Survived class and the features are in the other columns.

Apply the model to the test data

Next, we are ready to check the test data against the trained model. We can do this as follows:


int amountOfErrors = 0;
int totalAmount = 0;
int[][] confusionMtx = {{0, 0}, {0, 0}};

try (QueryCursor<Cache.Entry<Integer, TitanicObservation>> cursor = testData.query(new ScanQuery<>())) {
   for (Cache.Entry<Integer, TitanicObservation> testEntry : cursor) {
      TitanicObservation observation = testEntry.getValue();

      double groundTruth = observation.getSurvivedClass();
      double prediction = mdl.apply(new DenseLocalOnHeapVector(observation.getFeatures()));

      totalAmount++;
      if ((int) groundTruth != (int) prediction)
         amountOfErrors++;

      int idx1 = (int) prediction;
      int idx2 = (int) groundTruth;

      confusionMtx[idx1][idx2]++;

      System.out.printf(">>> | %.4f\t | %.0f\t\t\t|\n", prediction, groundTruth);
   }
}

Determine the confusion matrix and the accuracy of the model

Now we can compare how the model classifies against the actual survived values (Ground Truth) using our test data.

Running the code gives us the following summary:


>>> Absolute amount of errors 56

>>> Accuracy 0.6084

>>> Precision 0.5865

>>> Recall 0.9873

>>> Confusion matrix is [[78, 55], [1, 9]]

Can we improve upon these initial results? One thing we can try is to scale the features. In Scikit-learn and Ignite, we can use MinMaxScaler(), and applying this gives us the following summary:


>>> Absolute amount of errors 29

>>> Accuracy 0.7972

>>> Precision 0.8205

>>> Recall 0.8101

>>> Confusion matrix is [[64, 14], [15, 50]]

As part of further analysis, we should also investigate the relationship between Survived and features such as Age and Gender.

Summary

In the general case, K-Means Clustering doesn't suit supervised learning tasks. However, such an approach can be effective if classes are well separated. For our analysis, we were interested in two clusters: whether passengers survived or did not survive.

This concludes this series on Machine Learning with Apache Ignite. The reader is encouraged to try out the various examples provided with the Apache Ignite Machine Learning library.