View on GitHub

record-classification

This project provides an automatic record classification tool.

clean Command

The clean command performs cleaning of unseen and gold standard records, typically performed prior to classification. There are 3 ways to clean records:

  1. cleaning using one or more predefined cleaners,
  2. stop words removal using a user-defined collection of stop words, and
  3. spelling correction using a user-defined dictionary.

To clean using one or more predefined cleaners the following must be set option

One or more predefined cleaners can be specified in a single clean command. For example, the following command:

clean -c LOWER_CASE ENGLISH_STOP_WORDS PUNCTUATION

converts all the loaded record labels to lower case, removes predefined list of english stop words from the labels, and finally removes punctuation characters from the labels.

To clean stop words using a custom list of stop words the stop_words sub command is used. The stop_words sub command offers the following options:

To correct spelling of record labels using a custom dictionary the spelling sub command is used. The spelling correction replaces the words in the labels with words in the dictionary if their similarity is above a given threshold. The spelling sub command offers the following options:

Home | CLI