Knowledge acquisition using lexical and semantic ontology

In developing a compliance application based on the institutional review board policies of John Hopkins’ Dept. of Medicine, we have to clarify the following sentence:

  • Projects involving drugs or medical devices other than the use of an approved drug or medical device in the course of medical practice and projects whose data will be submitted to or held for inspection by the FDA will not be exempt from JHM IRB review UNLESS that use falls within the Emergency Use provisions of 21 CFR 56.102 (d).

As you can see, there are a number of compound words and acronyms, as well as references to the Code of Federal Regulations that need to be defined or recognized to understand this sentence.  These include the following:

  1. FDA
  2. JHM
  3. IRB
  4. CFR

In addition, from a semantics standpoint, it is preferable to have ontological concepts that are more specific than the heads of the following compounds:

  1. medical device
  2. medical practice
  3. emergency use

And, perhaps, it is appropriate to have concepts for:

  1. approved drug
  2. JHM IRB

Ideally, perhaps, there would even be a concept for:

  • the Emergency Use provisions of 21 CFR 56.102 (d)

although we’ll skip some of these details to save our readers the tedium.

In the English Resource Grammar (ERG), which is one of the natural language processing systems that we use, there is already a lexical entry for “FDA”, defined as follows:

  • fda_n1 := n_-_pn_le & [ ORTH < “FDA” >, SYNSEM [ LKEYS.KEYREL.CARG “FDA”, PHON.ONSET voc ] ].

From our standpoint, this is impoverished in that it will produce logical axioms using a predicate like “FDA” instead of “Food(and(Drug))(Administration)”, which avoids the ambiguity of anything else FDA might stand for.  Even if the existing definition of FDA was acceptable, however, the other acronyms and compounds mentioned above do not exist in the ERG.  To address this, the Linguist provides for the definition of additional words, as discussed here.

In general, unknown words are common in any new domain.  For example, in Project Sherlock, we had to define a few hundred biological terms that occurred in Campbell’s Biology but which were not defined in the ERG, such as:

  • intramolecular
  • semipermeable
  • hydrophilicity
  • HDL
  • acidity

and so on.  As sentences that contain unknown words are processed by the Linguist, a dialog allows part of speech information to be entered such that the natural language processing produces better (i.e., more focused) results.  In the case of this sentence, right-clicking on “FDA” brings up the part of speech dialog, which is filled out below:

For the most part, this suffices in producing logic with terms corresponding to the words (actually, the morphological lemmas or stems of the words).  In the context of defining a more precise sense of the word the dialog continues with an additional form to acquire more information about the singular proper noun “FDA”:

Here we’ve clarified that “FDA” is an acronym (not just an abbreviation) and that it is voiced (i.e., that we would us “an” rather than “a” before “FDA spokesperson”).  We’ve also entered the unabbreviated form of the acronym and a word sense from WordNet.

On completing this dialog, the system wants to define the term “Food and Drug Administration”, which it knows is nominal, so it presents another form, as follows:

On completing this form, the system adds a lexical entry to the ERG (roughly) as follows:

  • Food+and+Drug+Administration_NNP_food_and_drug_administration%1:14:00:: := n_-_pn_le & [ ORTH < “Food”,”and”,”Drug”,”Administration” >, SYNSEM [ LKEYS.KEYREL.CARG “Food and Drug Administration”,  PHON.ONSET con ] ].

In addition, the software maintains a lexical ontology that represents the more precise word sense and ontological information of this lexical entry and its predicate.  After completing this, the prior form is updated to reflect the more precise word sense, too:

Which results in the following lexical entry being added to the ERG’s vocabulary, along with a variety of ontological information.

  • FDA_NNP_food_and_drug_administration%1:14:00:: := n_-_pn_le & [ ORTH < “FDA” >, SYNSEM [ LKEYS.KEYREL.CARG “Food and Drug Administration”,  PHON.ONSET voc ] ].

Some of the other cases are more fun or interesting, such as IRB:

Which we clarify as being an aconym for “institutional review board”.  Note that we think an IRB is a common noun.

As was the case for “Food and Drug Administration” above, it is desirable to define the compound noun “institutional review board” more completely, as in:

Which leads to further clarification of the head and non-head of the compound.

Which leads to further dialog regarding “board” and “review”.  In the case of “board”, we can select a particular sense of the noun as follows:

This is a good thing to do in the case of defining “review board”, but the following shows other lexical entries could have been selected:

This dialog also gives you some idea of additional ontological information known about various lexical entries.

Continuing with “review board”, the dialog allows the sense of “review” to be specified, as in:

Other senses of review are also available:

Having completed the senses of “review” and “board” within “review board”, we now have:

and an ERG entry with sense and ontological connections which will support logical and semantic interpretation of review boards in general (i.e., whether or not they are institutional).

  • review+board_NN := n_-_c_le & [ ORTH < “review”,”board” >, SYNSEM [ LKEYS.KEYREL.PRED “_review+board_n_rel”,  PHON.ONSET con ] ].

This returns the dialog to “institutional review board” which offers the opportunity to clarify the semantics of “institutional”, as follows:

which is derived from the noun “institution” which can be further clarified, as in:

After saying OK to the dialogs for “institution”, “institutional”, “institutional review board”, and “IRB”, the ontology is updated and the following is added to the ERG:

  • institutional+review+board_NN := n_-_c_le & [ ORTH < “institutional”,”review”,”board” >, SYNSEM [ LKEYS.KEYREL.PRED “_institutional+review+board_n_rel”,  PHON.ONSET voc ] ].
  • IRB_NN := n_-_c_le & [ ORTH < “IRB” >, SYNSEM [ LKEYS.KEYREL.PRED “_institutional+review+board_n_rel”,  PHON.ONSET voc ] ].

Hopefully, this gives you the idea as to how the lexicon can be extended to deal with unrecognized (i.e., new) vocabulary, including acronyms, compounds, and other parts of speech, including precise word sense and ontological information.

There is more depth to this than meets the eye, since there are additional capabilities to define senses as sentences of description logic and the senses of WordNet are organized with “synonym sets” and widely anchored or cross-referenced with widely available ontologies, including Yago and Open Cyc, for example.