View on GitHub

record-classification

This project provides an automatic record classification tool.

Batch Command Execution Example

In order to execute multiple commands at once, the commands need to be specified in a text file, where each line in the file corresponds to a command and its options. Here is an example:

# Step 1: initialise classli
init

# Step 2: set the classifier to EXACT_MATCH
set --classifier EXACT_MATCH

# Step 3: load gold standard data from file 'gs.csv' in current working directory,
#         skip header record, use all records for training
load --from gs.csv gold_standard -h -t 1.0

# Step 4: load unseen data from file 'unseen.csv' in current working directory,
#         skip header record
load --from unseen.csv unseen -h

# Step 5: clean loaded gold standard and unseen data using all predefined cleaners
clean -c COMBINED

# Step 6: train classifier using all cleaned gold standard data (no internal evaluation)
train -it 1.0

# Step 7: classify unseen records, store classified records in file
#         'classified-unseen.csv' in current working directory
classify -o classified-unseen.csv

The lines starting with # character are comments; they are ignored by the classli. The comments in the above example explain what each of the 8 commands do. Additionally, any line that only contains whitespace characters in the batch commands file is also ignored.

The above commands do not specify all the possible options of classli. For any unspecified option, classli uses internal default values. For example, in steps 3 and 4 gold standard records (from gs.csv) and unseen records (from unseen.csv) are loaded respectively. There are a lot of options we can specify to tell classli how to read the two CSV files. These options include -d for delimiter (the character by which the values are separated), -c for character encoding and many more. Since these options are not specified the default values are used, in this case , as delimiter and the operating system’s default character encoding. See usage for a complete list of all the options and their default values.

The gold standard file, gs.csv, must have at least 3 columns: id, label and class. The -h option in step 3 specifies that the first row in gs.csv contains the column labels and should not be considered as part of the data. Here is an example gold standard data file:

id,label,class
1,fish,swims
2,dog,barks
3,cat,purrs

The unseen file, unseen.csv, must have at least 2 columns: id and label. Here is an example unseen data file:

id,label
1,fish
2,dog

To run the commands in batch mode, they need to be stored in a text file. In this example we assume they are stored in a file called commands.txt. The file is then passed to classli for execution:

classli -c commands.txt

The -c option enables batch command execution mode in classli, and commands.txt appearing after -c option specifies where to find the file containing the commands.

Example files:

Home | CLI