Random forest is an ensemble learning method to solve any classification and regression problem. Random forest training builds a model composition (ensemble) of one type and uses some aggregation algorithm of several answers from models. Each model is trained on a part of the training dataset. The part is defined according to bagging and feature subspace methods. More information about these concepts may be found here: 1, 2, and 3.
There are several implementations of aggregation algorithms in Apache Ignite ML:
MeanValuePredictionsAggregator- computes answer of a random forest as mean value of predictions from all models in the given composition. Often this is is used for regression tasks.
OnMajorityPredictionsAggegator- gets a mode of predictions from all models in the given composition. This can be useful for a classification task. NOTE: This aggregator supports multi-classification tasks.
The random forest algorithm is implemented in Ignite ML as a special case of a model composition with specific aggregators for different problems (
MeanValuePredictionsAggregator for regression,
OnMajorityPredictionsAggegator for classification).
Here is an example of model usage:
ModelsComposition randomForest = …. double prediction = randomForest.apply(featuresVector);
The random forest training algorithm is implemented with
RandomForestClassifierTrainer trainers with the following parameters:
meta- features meta, list of feature type description such as:
featureId- index in features vector.
trueif a feature is categorical.
This meta-information is important for random forest training algorithms because it builds feature histograms and categorical features should be represented in histograms for all feature values:
featuresCountSelectionStrgy- sets strategy defining count of random features for learning one tree. There are several strategies: SQRT, LOG2, ALL and ONE_THIRD strategies implemented in the FeaturesCountSelectionStrategies class.
maxDepththe- sets the maximum tree depth.
minInpurityDelta- a node in a decision tree is split into two nodes if the impurity values on these two nodes is less than the unspilt node’s minImpurityDecrease value.
subSampleSize- value lying in the [0; MAX_DOUBLE]-interval. This parameter defines the count of sample repetitions in uniformly sampling with replacement.
seed- seed value used in random generators.
Random forest training may be used as follows:
RandomForestClassifierTrainer trainer = new RandomForestClassifierTrainer(featuresMeta) .withCountOfTrees(101) .withFeaturesCountSelectionStrgy(FeaturesCountSelectionStrategies.ONE_THIRD) .withMaxDepth(4) .withMinImpurityDelta(0.) .withSubSampleSize(0.3) .withSeed(0); ModelsComposition rf = trainer.fit( datasetBuilder, featureExtractor, labelExtractor );