View on GitHub

record-classification

This project provides an automatic record classification tool.

Running an Experiment

This example aims to illustrate how to use classli to run an experiment. Running an experiment involves executing a batch command file for a number of times and produce aggregated results, e.g. aggregated precision/recall measurements if experiment includes evaluation of a classifier.

Before we start, let’s create a new folder named experiment_example. This folder will contain all the input files we pass to classli and all the output files classli produces.

To run an experiment we first need a batch command file, which contains steps in our experiment. Here is the batch command file we are going to use in this example:

# Step 1
init

# Step 2
set --classifier EXACT_MATCH --seed 42

# Step 3
load --from training_data.csv gold_standard -h -d "," -t 0.8

# Step 4
clean -c COMBINED

# Step 5
train

# Step 6
evaluate -o classified_evaluation_data.csv

The lines starting with # are comments and are ignored by classli. The explanation of the steps are:

Step 1 – initialise a new classification process
Step 2 – set the classifier to EXACT_MATCH, and seed the random number generator with 42 in order to make the results reproducible
Step 3 – load comma-separated gold standard data from training_data.csv file at the current working directory, and skip header record, where 80% of records are used for training and the remaining 20% are used for evaluation.
Step 4 – clean loaded gold standard and unseen data using all the predefined cleaners.
Step 5 – train the classifier using the loaded training records (i.e. 80% of the loaded gold standard data).
Step 6 – evaluate the classifier using the remaining 20% of the cleaned gold standard data, and store the classified evaluation records in a file called classified_evaluation_data.csv at the current working directory.

Store the batch command file into a file named batch.txt within the folder we created earlier. In this example we are going to re-use the training data set from the simple classification example. Make sure the training_data.csv is also copied into the folder.

Fire up the command line interface of your operating system (Command Prompt in windows, Terminal in Mac OS X and Shell in Linux). Change the current directory to the one we created earlier by typing cd, followed by a space, followed by the path to the directory.

We are now ready to run the experiment:

classli experiment --commands batch.txt --repeat 3

The above command runs the commands specified in batch.txt 3 times, where the results of each repetition is stored in folders called repetition_1, repetition_2 and repetition_3. The aggregated results of the evaluation is printed in the console:

Number Of Classes:             7.00 ± 0.00
Number Of Classifications:     5.00 ± 0.00
Number Of True Positives:      0.00 ± 0.00
Number Of True Negatives:      25.00 ± 0.00
Number Of False Negatives:     5.00 ± 0.00
Number Of False Positives:     5.00 ± 0.00
Macro Average Accuracy:        0.83 ± 0.00
Macro Average F1:              0.00 ± 0.00
Macro Average Precision:       NaN ± NaN
Macro Average Recall:          0.00 ± 0.00
Micro Average Accuracy:        0.71 ± 0.00
Micro Average F1:              0.00 ± 0.00
Micro Average Precision:       0.00 ± 0.00
Micro Average Recall:          0.00 ± 0.00

The ± values are the confidence intervals with 95% confidence level.

Home | CLI