🔗 NaiveBayesClassifier
The NaiveBayesClassifier
implements a trivial Naive Bayes classifier for
numerical data. The class offers standard classification functionality. Naive
Bayes is useful for multi-class classification (i.e. classes are 0
, 1
, 2
,
etc.), and due to its simplicity scales well to large-data scenarios.
Simple usage example:
// Train a Naive Bayes classifier on random data and predict labels:
// All data and labels are uniform random; 5 dimensional data, 4 classes.
// Replace with a data::Load() call or similar for a real application.
arma::mat dataset(5, 1000, arma::fill::randu); // 1000 points.
arma::Row<size_t> labels =
arma::randi<arma::Row<size_t>>(1000, arma::distr_param(0, 3));
arma::mat testDataset(5, 500, arma::fill::randu); // 500 test points.
mlpack::NaiveBayesClassifier nbc; // Step 1: create model.
nbc.Train(dataset, labels, 4); // Step 2: train model.
arma::Row<size_t> predictions;
nbc.Classify(testDataset, predictions); // Step 3: classify points.
// Print some information about the test predictions.
std::cout << arma::accu(predictions == 2) << " test points classified as class "
<< "2." << std::endl;
Quick links:
- Constructors: create
NaiveBayesClassifier
objects. Train()
: train model.Classify()
: classify with a trained model.- Other functionality for loading, saving, and inspecting.
- Examples of simple usage and links to detailed example projects.
- Template parameters for using different element types for a model.
See also:
🔗 Constructors
nbc = NaiveBayesClassifier()
- Initialize the model without training.
- You will need to call
Train()
later to train the model before callingClassify()
.
nbc = NaiveBayesClassifier(dimensionality, numClasses, epsilon=1e-10)
- Initialize model to the given dimensionality and number of classes without training.
- This is meant to be used with the incremental version of
Train()
that takes only a single point.
nbc = NaiveBayesClassifier(data, labels, numClasses, incremental=true, epsilon=1e-10)
- Train model, optionally specifying whether to do incremental training.
Constructor Parameters:
name | type | description | default |
---|---|---|---|
data |
arma::mat |
Column-major training matrix. | (N/A) |
labels |
arma::Row<size_t> |
Training labels, between 0 and numClasses - 1 (inclusive). Should have length data.n_cols . |
(N/A) |
numClasses |
size_t |
Number of classes in the dataset. | (N/A) |
incremental |
bool |
If true , then the model will not be reset before training, and will use a robust incremental algorithm for variance computation. |
true |
epsilon |
double |
Initial small value for sample variances, to prevent underflow (via log(0) ). |
1e-10 |
As an alternative to passing the epsilon
parameter, it can be set with the
standalone Epsilon()
method: nbc.Epsilon() = eps;
will set the value of
epsilon
to eps
for the next time non-incremental Train()
or Reset()
is
called.
🔗 Training
If training is not done as part of the constructor call, it can be done with the
Train()
function:
nbc.Train(data, labels, numClasses, incremental=true, epsilon=1e-10)
- Train model on the given data, optionally specifying whether to do incremental training.
- Arguments described in Constructor Parameters table above.
nbc.Train(point, label)
- Incrementally train on a single data point with the given label.
- Ensure that the model has the right size and number of classes by using the
appropriate constructor form to set
dimensionality
, or by callingReset()
(see other functionality).
name | type | description | default |
---|---|---|---|
point |
arma::vec |
Column-major training point (i.e. one column). | (N/A) |
label |
size_t |
Training label, in range 0 to numClasses . |
(N/A) |
Note: when performing incremental training, if data
has a different
dimensionality than the model, or if numClasses
is different, the model will
be reset. For single-point Train()
, if point
has different dimensionality,
an exception will be thrown.
🔗 Classification
Once a NaiveBayesClassifier
model is trained, the Classify()
member function
can be used to make class predictions for new data.
size_t predictedClass = nbc.Classify(point)
- (Single-point)
- Classify a single point, returning the predicted class (
0
throughnumClasses - 1
, inclusive).
nbc.Classify(point, prediction, probabilitiesVec)
- (Single-point)
- Classify a single point and compute class probabilities.
- The predicted class is stored in
prediction
. - The probability of class
i
can be accessed withprobabilitiesVec[i]
.
nbc.Classify(data, predictions)
- (Multi-point)
- Classify a set of points.
- The prediction for data point
i
can be accessed withpredictions[i]
.
nbc.Classify(data, predictions, probabilities)
- (Multi-point)
- Classify a set of points and compute class probabilities.
- The prediction for data point
i
can be accessed withpredictions[i]
. - The probability of class
j
for data pointi
can be accessed withprobabilities(j, i)
.
Classification Parameters:
usage | name | type | description |
---|---|---|---|
single-point | point |
arma::vec |
Single point for classification. |
single-point | prediction |
size_t& |
size_t to store class prediction into. |
single-point | probabilitiesVec |
arma::vec& |
arma::vec& to store class probabilities into; will have length 2. |
 |  |  |  |
multi-point | data |
arma::mat |
Set of column-major points for classification. |
multi-point | predictions |
arma::Row<size_t>& |
Vector of size_t s to store class prediction into; will be set to length data.n_cols . |
multi-point | probabilities |
arma::mat& |
Matrix to store class probabilities into (number of rows will be equal to 2; number of columns will be equal to data.n_cols ). |
🔗 Other Functionality
-
A
NaiveBayesClassifier
model can be serialized withdata::Save()
anddata::Load()
. -
nbc.Probabilities()
will return a column vector of lengthnumClasses
representing the prior probability of each class. -
nbc.Means()
will return a matrix with rows equal to the dimensionality of the model andnumClasses
columns. Columni
represents the sample mean of classi
. -
nbc.Variances()
will return a matrix with rows equal to the dimensionality of the model andnumClasses
columns. The element at rowi
and columnj
represents the sample variance in dimensioni
of classj
. -
nbc.Reset()
will reset the model to zeros; this is useful before incremental training. The formnbc.Reset(dimensionality, numClasses, epsilon=1e-10)
can also be used to set the dimensionality and number of classes in the reset model. -
nbc.TrainingPoints()
will return the number of points that the model has been trained on. Whennbc.Reset()
is called, this is reset to 0.
🔗 Simple Examples
See also the simple usage example for a trivial usage
of the NaiveBayesClassifier
class.
Train a Naive Bayes classifier incrementally, one point at a time, then compute accuracy on a test set and save the model to disk.
// See https://datasets.mlpack.org/mnist.train.csv.
arma::mat dataset;
mlpack::data::Load("mnist.train.csv", dataset, true);
// See https://datasets.mlpack.org/mnist.train.labels.csv.
arma::Row<size_t> labels;
mlpack::data::Load("mnist.train.labels.csv", labels, true);
mlpack::NaiveBayesClassifier nbc(dataset.n_rows /* dimensionality */,
10 /* numClasses */);
// Iterate over all points in the dataset and call Train() on each point.
for (size_t i = 0; i < dataset.n_cols; ++i)
nbc.Train(dataset.col(i), labels[i]);
// Now compute the accuracy of the fully trained model on a test set.
// See https://datasets.mlpack.org/mnist.test.csv.
arma::mat testDataset;
mlpack::data::Load("mnist.test.csv", testDataset, true);
// See https://datasets.mlpack.org/mnist.test.labels.csv.
arma::Row<size_t> testLabels;
mlpack::data::Load("mnist.test.labels.csv", testLabels, true);
arma::Row<size_t> predictions;
nbc.Classify(dataset, predictions);
const double trainAccuracy = 100.0 *
((double) arma::accu(predictions == labels)) / labels.n_elem;
std::cout << "Accuracy of model on training data: " << trainAccuracy << "\%."
<< std::endl;
nbc.Classify(testDataset, predictions);
const double testAccuracy = 100.0 *
((double) arma::accu(predictions == testLabels)) / testLabels.n_elem;
std::cout << "Accuracy of model on test data: " << testAccuracy << "\%."
<< std::endl;
// Save the model to disk with the name "nbc".
mlpack::data::Save("nbc_model.bin", "nbc", nbc, true);
Load a saved Naive Bayes classifier and print some information about it.
mlpack::NaiveBayesClassifier nbc;
// Load the model named "nbc" from "nbc_model.bin".
mlpack::data::Load("nbc_model.bin", "nbc", nbc, true);
// Print information about the model.
std::cout << "The dimensionality of the model in nbc_model.bin is "
<< nbc.Means().n_rows << "." << std::endl;
std::cout << "The number of classes in the model is "
<< nbc.Probabilities().n_elem << "." << std::endl;
std::cout << "The model was trained on " << nbc.TrainingPoints() << " points."
<< std::endl;
std::cout << "The prior probabilities of each class are: "
<< nbc.Probabilities().t();
// Compute the class probabilities of a random point.
// For our random point, we'll use one of the means plus some noise.
arma::vec randomPoint = nbc.Means().col(2) +
10.0 * arma::randu<arma::vec>(nbc.Means().n_rows);
size_t prediction;
arma::vec probabilities;
nbc.Classify(randomPoint, prediction, probabilities);
std::cout << "Random point class prediction: " << prediction << "."
<< std::endl;
std::cout << "Random point class probabilities: " << probabilities.t();
See also the following fully-working examples:
🔗 Advanced Functionality: Different Element Types
The NaiveBayesClassifier
class has one template parameter that can be used to
control the element type of the model. The full signature of the class is:
NaiveBayesClassifier<ModelMatType>
ModelMatType
specifies the type of matrix used for training data and internal
representation of model parameters.
-
Any matrix type that implements the Armadillo API can be used.
-
Train()
andClassify()
functions themselves are templatized and can allow any matrix type that has the same element type. So, for instance, aNaiveBayesClassifier<arma::mat>
can accept anarma::sp_mat
for training.
The example below trains a Naive Bayes model on sparse 32-bit floating point data, but uses dense 32-bit floating point matrices to store the model itself.
// Create random, sparse 100-dimensional data, with 3 classes.
arma::sp_fmat dataset;
dataset.sprandu(100, 5000, 0.3);
arma::Row<size_t> labels =
arma::randi<arma::Row<size_t>>(5000, arma::distr_param(0, 2));
mlpack::NaiveBayesClassifier<arma::fmat> nbc(dataset, labels, 3);
// Now classify a test point.
arma::sp_fvec point;
point.sprandu(100, 1, 0.3);
size_t prediction;
arma::fvec probabilitiesVec;
nbc.Classify(point, prediction, probabilitiesVec);
std::cout << "Prediction for random test point: " << prediction << "."
<< std::endl;
std::cout << "Class probabilities for random test point: "
<< probabilitiesVec.t();
Note: dense objects should be used for ModelMatType
, since in general
the mean and sample variance of sparse data is dense.