Simple, Fast, Effective, Active Learning

Recently, we “read” ten thousand recipes or so from a cooking web site. The purpose of doing so was to produce a formal representation of those recipes for use in temporal reasoning by a robot.

Our task was to produce ontology by reading the recipes subject to conflicting goals. On the one hand, the ontology was to be accurate so that the robot could reason, plan, and answer questions robustly. On the other hand, the ontology was to be produced automatically (with minimal human effort).[1]

In order to minimize human effort while still obtaining deep parses from which we produce ontology, we used more techniques from statistical natural language processing than we typically do in knowledge acquisition for deep QA, compliance, or policy automation. (Consider that NLP typically achieves less than 90% syntactic accuracy while such work demands near 100% semantic accuracy.)[2]

In the effort, we refined some prior work on representing words as vectors and semi-supervised learning. In particular, we adapted semi-supervised, active learning similar to Stratos & Collins 2015 using enhancements to the canonical correlation analysis (CCA) of Dhillon et al 2015 to obtain accurate part of speech tagging, as conveyed in the following graphic from Stratos & Collins:

Parts of Speech

In many NLP pipelines, a “parser” receives its input from the output of a part of speech “tagger”. Most typically in such pipelines, the tagger outputs one part of speech tag for each token (e.g., word) in the input.

We typically permit all the possible parts of speech to be considered during parsing rather than pruning some parts of speech during tagging. This allows us to find the right parse for many sentences that would otherwise be missed. For example, if a part of speech tagger selects the right tag 95% of the time, 1 out of 20 words will have the wrong tag. For text that contains an average of 15 or more words per sentence, relying on such tagging results in parsing failures for too many sentences.

Our client accepted that a more automated approach could not rival curated parses. But we agreed to minimize the initial use of human effort in a phase I approach. Thus, we needed to estimate the probabilities of parts of speech for a domain-specific vocabulary (i.e., for cooking).

The typical part of speech tagger uses a hidden Markov model (HMM) trained over a large corpus of manually tagged text. Our task constraints precluded the human effort to tag much of the recipe text, however.

We needed a method that minimized the investment required for such “supervised” learning…

Simple Semi-Supervised POS Tagging

Stratos and Collins’ MINITAGGER demonstrated the use of active learning to achieve state of the art accuracy while requiring an astonishingly small fraction of the human tagging effort of other approaches. There are few reasons for this impressive result, especially the use of fairly low-dimensional word-embeddings with efficient support vector machinery (SVM).

Distributed Representations of Words

There is a lot of literature on distributed representation of words. Typically, such representations are points in a high-dimensional space in which “similar” words are “close”. Such representations include Word2Vec and GloVe. The former vectors were obtained using neural networks while the latter were derived from co-occurrence matrices (from which GloVe derives the co-occurrence probabilities which are the basis for the algorithm).

In some ways, the distributed representation of words is similar to the use of latent semantic analysis (LSA) to represent the meaning of documents as vectors. LSA, in turn, is similar to principal components analysis (PCA), except that LSA uses term/document frequencies rather than term co-variance. Both use singular value decomposition (SVD) which is also used in canonical correlation analysis (CCA).

Canonical Correlation Analysis (CCA)

CCA determines relationships between two sets of variables without considering either set as dependent or independent, per se. In the case of word embeddings, the vector representations of words comprise one set of variables and the contexts in which those words occur comprise the other.

How we define the context of a word can vary. A context of the immediately preceding word would correspond to a bigram model, one word to each side would correspond to a trigram model, etc. We stuck with the window size of 5 as in the Stratos & Collins and Dhillon et al papers. Longer window sizes do not improve performance on syntactic tasks like part of speech tagging but do provide better representation of semantics for tasks like word sense disambiguation, as suggested in Pennington et al 2014.[3]

Computing the co-occurrence matrices for a vocabulary of V words requires on the order of V³ cells. For our target of at least 300,000 “words” in a multi-giga-word corpus, that’s a large number![4] We adapted Dhillon’s spectral learning toolkit (swell) and its use of the C++ linear algebra package Eigen to get past some memory issues and produced various sets of word vectors of dimensionality from 50, as used in Stratos & Collins to 300, as in GloVe and other representations.[5]

We evaluated the longer ones subjectively and found them to be state of the art, as expected given Dhillon et al 2015. We used the shorter ones to train a support vector machine (SVM) roughly as in MINITAGGER and as described below.

We also experimented with a variety of techniques to improve the resulting word vectors, including tokenization of several dozen different kinds of tokens (e.g., numbers or identifiers of various types, dates, times) and careful consideration of case rather than ubiquitous case folding. These adjustments to the co-occurrence matrices resulted in subjectively superior word vectors in smaller dimensionality or when given a smaller corpus. (The selected corpus has only a few million words for 10,000 recipes.)

Support Vector Machine (SVM)

Using the libSVM package and the following features, most of which are as in MINITAGGER (albeit within a Java package), we enjoyed the performance of the resulting active learning system, which would be impractical using less rapidly trainable alternatives, as suggested in Stratos & Collins 2015.

One of several case features for the focal word
One of several token type features for the focal word
Prefixes and suffixes up to length 4 of the focal word
50-dimensional eigen vectors for each of 5 words
Optional part of speech for each of 5 words

So now we can take a modest corpus in new domains and train an excellent part of speech tagger very easily. And we can improve its performance as much as needed using active training.

Parts of Speech

Simple Semi-Supervised POS Tagging

Distributed Representations of Words

Canonical Correlation Analysis (CCA)

Support Vector Machine (SVM)

Further Reading