Friday, May 25, 2012

LATINO Model

Introduction
Exercise 1: Using the centroid classifier
Exercise 2: Using the batch-update centroid classifier


Introduction


(Intro TBD)

The following list briefly describes the most important LATINO Model data structures, predictive models, and clustering algorithms:

LabeledDataset<LblT, ExT>: A labeled dataset data structure. Contains a collection of LabeledExample<LblT, ExT>. An input into the LATINO supervised learners (predictive models).
UnlabeledDataset<ExT>: An unlabeled dataset data structure. An input into the LATINO clustering algorithms.
CosineSimilarity: An implementation of the cosine similarity measure.
DotProductSimilarity: An implementation of the dot product similarity measure.
CentroidClassifier<LblT>: An implementation of the centroid classifier.
BatchUpdateCentroidClassifier<LblT>: Another variant of the centroid classifier.
KnnClassifier<LblT, ExT>: An implementation of the k-nearest neighbours (k-NN) classifier.
KnnClassifierFast<LblT>: A k-NN implementation optimized for speed.
MaximumEntropyClassifier: An implementation of the maximum entropy classifier.
MaximumEntropyClassifierFast<LblT>: A maximum entropy classifier implementation optimized for speed.
NaiveBayesClassifier<LblT>: An implementation of the Naive Bayes classifier.
MajorityClassifier<LblT, ExT>: An implementation of the majority classifier.
KMeansClustering: An implementation of the k-means clustering algorithm.
KMeansClusteringFast: A k-means clustering implementation optimized for speed.
Prediction<LblT>: The output of the LATINO predictive models.
ClusteringResult: The output of the LATINO clustering algorithms.

In the following, we give several examples of using the LATINO Model data structures and algorithms.

Back to top

Exercise 1: Using the centroid classifier


The nearest centroid or nearest prototype classifier is a multi-class classifier that classifies an example into to the class with the nearest centroid. In this exercise, we load a training set and a test set, train a centroid classifier, and compute its accuracy on the test set. CentroidClassifier takes a LabeledDataset of SparseVector<double> as input and can use any similarity measure (ISimilarity) for classification.
using System;
using Latino;
using Latino.Model;

namespace Latino.Model.Tutorials
{
    class Program
    {
        static void Main(string[] args)
        {
            // load datasets
            LabeledDataset<int, SparseVector<double>> trainDataset = ModelUtils.LoadDataset(@"C:\Latino\Tutorials\Datasets\Example1\train.dat");
            LabeledDataset<int, SparseVector<double>> testDataset = ModelUtils.LoadDataset(@"C:\Latino\Tutorials\Datasets\Example1\test.dat");
            // train a centroid classifier            
            CentroidClassifier<int> classifier = new CentroidClassifier<int>();
            classifier.Similarity = CosineSimilarity.Instance;
            classifier.NormalizeCentroids = false;
            classifier.Train(trainDataset);
            // test the classifier
            int correct = 0;
            int all = 0;
            foreach (LabeledExample<int, SparseVector<double>> labeledExample in testDataset)
            {
                if (labeledExample.Example.Count != 0)
                {
                    Prediction<int> prediction = classifier.Predict(labeledExample.Example);
                    if (prediction.BestClassLabel == labeledExample.Label) { correct++; }
                    all++;
                }
            }
            // output the result
            Console.WriteLine("Correctly classified: {0} of {1} ({2:0.00}%)", correct, all, (double)correct / (double)all * 100.0); 
        }
    }
}
The Python equivalent (requires Python for .NET):
import clr
from Latino import * 
from Latino.Model import *

# load datasets
trainDataset = ModelUtils.LoadDataset("C:\\Latino\\Tutorials\\Datasets\\Example1\\train.dat")
testDataset = ModelUtils.LoadDataset("C:\\Latino\\Tutorials\\Datasets\\Example1\\test.dat")

# train a centroid classifier
classifier = CentroidClassifier[int]()
classifier.Similarity = CosineSimilarity.Instance
classifier.NormalizeCentroids = False
classifier.Train(trainDataset)

# test the classifier
correct = 0
all = 0
for labeledExample in testDataset:
 if labeledExample.Example.Count != 0:
  prediction = classifier.Predict(labeledExample.Example)
  if prediction.BestClassLabel == labeledExample.Label: correct += 1
  all += 1

# output the result
print "Correctly classified: {0} of {1} ({2:0.2f}%)".format(correct, all, float(correct) / float(all) * 100.0)
This code outputs the following to the console:
 
 Correctly classified: 574 of 598 (95.99%)
 
Back to top

Exercise 2: Using the batch-update centroid classifier


The batch-update centroid classifier [Tan, 2007] is a variant of the centroid classifier that iteratively repositions the centroids in order to reduce the classification error on the training set. In this exercise, we load a training set and a test set, train a batch-update centroid classifier, and compute its accuracy on the test set. BatchUpdateCentroidClassifier takes a LabeledDataset of SparseVector<double> as input and uses the dot product similarity measure for both training and classification. The lines 16–20 in the code snippet from the previous exercise need to be replaced with the following:
// normalize the feature vectors
foreach (LabeledExample<int, SparseVector<double>> labeledExample in trainDataset) { Utils.TryNrmVecL2(labeledExample.Example); }
foreach (LabeledExample<int, SparseVector<double>> labeledExample in testDataset) { Utils.TryNrmVecL2(labeledExample.Example); } 
// train a centroid classifier            
BatchUpdateCentroidClassifier<int> classifier = new BatchUpdateCentroidClassifier<int>();
classifier.Logger = Logger.GetInstanceLogger("Latino.Model.BatchUpdateCentroidClassifier");
classifier.Iterations = 20;
classifier.PositiveValuesOnly = true;
classifier.Damping = 0.8;
classifier.Train(trainDataset);
The modified code outputs the following to the console:
 
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Iteration 1 / 20 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Computing dot products ...
 Centroid 2 / 2 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Classifying training examples ...
 Example 2000 / 2000 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Training set error rate: 4.25%
 Centroid 2 / 2 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Iteration 2 / 20 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Computing dot products ...
 Centroid 2 / 2 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Classifying training examples ...
 Example 2000 / 2000 ...
 2012-05-29 10:24:11 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Training set error rate: 2.15%
 Centroid 2 / 2 ...
 ...
 2012-05-29 10:24:12 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Iteration 20 / 20 ...
 2012-05-29 10:24:12 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Computing dot products ...
 Centroid 2 / 2 ...
 2012-05-29 10:24:12 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Classifying training examples ...
 Example 2000 / 2000 ...
 2012-05-29 10:24:12 Latino.Model.BatchUpdateCentroidClassifier.1 Train
 INFO: Training set error rate: 0.35%
 Centroid 2 / 2 ...
 Correctly classified: 580 of 598 (96.99%)
 
Note that BatchUpdateCentroidClassifier normalizes the centroids in the Euclidean sense and uses the dot product similarity measure to compare feature vectors. The feature vectors should therefore be normalized in the Euclidean sense before they are used for training or classification. This is effectively equal to using the cosine similarity measure but considerably faster. Note also that a CentroidClassifier set to use the cosine similarity measure produces exactly the same result as a BatchUpdateCentroidClassifier with Iterations set to 0 (provided that the vectors are normalized), but the latter runs faster in the classification phase.

Back to top

Exercise 3: Using the k-nearest neighbours classifier


The k-nearest neighbours (k-NN) classifier is a multi-class classifier that classifies objects based on the closest training examples in the feature space. In this exercise, we load a training set and a test set, train a k-NN classifier, and compute its accuracy on the test set. KnnClassifier takes any type of examples as input and can use any similarity measure (ISimilarity) for classification. The lines 16–20 in the code snippet from Exercise 1 need to be replaced with the following:
// train a k-NN classifier            
KnnClassifier<int, SparseVector<double>> classifier = new KnnClassifier<int, SparseVector<double>>(CosineSimilarity.Instance);
classifier.K = 30;
classifier.SoftVoting = true;
classifier.Train(trainDataset);
The modified code outputs the following to the console:
 
 Correctly classified: 575 of 598 (96.15%)
 

Back to top

No comments:

Post a Comment