Iterative Disambiguation – Commercial Intelligence

In a prior post we showed how extraordinarily ambiguous, long sentences can be precisely interpreted. Here we take a simpler look upon request.

Let’s take a sentence that has more than 10 parses and configure the software to disambiguate among no more than 10.

Once again, this is a trivial sentence to disambiguate in seconds without iterative parsing!

The immediate results might present:

Suppose the intent is not that the telescope is with my friend, so veto “telescope with my friend” with a right-click.

You can also affirm that “with my friend” is a complete prepositional phrase within the intent.

You can also affirm that “with the telescope” is a complete prepositional phrase within the intent.

Suppose the intent is not about a park with a telescope, so veto that “park with a telescope” with a right-click.

This leaves on a single derivation as shown below.

Now suppose that we were in the park but the man was not. If we veto “man in the park” we will have vetoed the last parse. (Try it, if you wish. Then press the reset button on the toolbar.)

Instead we can affirm that “in the park” is a complete prepositional phrase within the intent and reparse using the magnifying glass button on the toolbar (all of which is documented with tool-tips displayed upon hovering). and elsewhere at http://linguist.haleyai.com.

In a default configuration, reparsing will ask if we want to use the parts of speech implied by our disambiguation thus far, such as follows:

The lack of a check on “saw” is beyond the intent of this article, but feel free to select it. Doing so has no practical effect in this case.

Next, because we have affirmed some parse tree structure, we can use that structure to constrain the parses we obtain on the next iteration.

This “Bracketing Dialog” simply reflects the constraints on parse tree structure that we have specified.

Any fragment of text can be constrained to remain contiguous during parsing using this dialog. But be careful with it! You can easily choose constraints that are not grammatical, in which case you will find no results!

This is no problem really, since you can just reparse again and you will have no such constraints (but you can specify some even with no parses, as you will see if you try…)

In most cases, this dialog is used to continue refining a parse or to start over, perhaps because of an error or uncertainty beforehand.

To use what you’ve specified by affirming parse tree structure, press the OK button. To proceed to reparse without constraining parse structure toggle the selections and then press OK. The cancel button aborts the process of reparsing!

At this point, you will receive this dialog.

To get to our desired intent, veto “man in the park” to arrive at:

There is a little more grammatical disambiguation that can be done on the clauses tab. For example, whether “saw” is used in past or present tense!

Disambiguating Difficult Cases

The general guidelines for disambiguation have several aspects:

· reducing the degree of grammaticality will result in higher degrees of ambiguity (i.e., more time to parse and more resulting parses)

· the ambiguity of a sentences goes up significantly faster than its length

· certain words are highly ambiguous, and their ambiguities may multiply in the worst case

KnowBuddy™ and the Linguist™ addresses each of these so that correct parses can be obtained without extraordinary effort for longer sentences than most systems can handle.

Degree of Grammaticality

The parsing tab of the settings dialog allows for setting the degree of grammaticality used in parsing.

· parse formal text results in the lowest number of parses

· parsing informal text introduces more ambiguity

· parsing fragmented text increases ambiguity significantly

· parsing robustly results in the highest degree of ambiguity

By ambiguity we are referring to the number of derivations (parses) that may be found for a piece of text.

Generally, parsing text formally should be used and then use parsing of fragments as necessary. In some cases, however, robust parsing is necessary.

When there is too much Ambiguity

The parsing tab of the settings dialog may be used to adjust the amount of time parsing may take as well as the maximum number of parses that may be obtained.

Generally, these numbers should be in balance and reflect the degree of grammaticality with which parsing is being performed.

For example, to quickly work through relatively short, grammatical text from textbooks, parsing with formal grammaticality in less than 30 seconds with up to 100 derivations should be fine. For less grammatical or longer sentences, 30 seconds for 2 to 4 hundred parses may be appropriate. For longer sentences, a minute or more for perhaps 800 parses may seem necessary.

There are facilities in the Linguist that allow even loose grammaticality of long sentences to result in precise parsing, however.

When there are no parses

It is not unusual to find no parses for a sentence.

In such cases, at least one of a few possibilities must be the case:

· the sentence is not grammatical, or some word or expression is used in a way that the parser does not handle

· there is too much ambiguity and parsing restrictions are set too low

o e.g., the parser “times out” after finding a great many partial derivations without yet finding a full parse

o e.g., the parse reports “out of memory” or “edge limit exceeded” before it finds any parses

· there are unknown words in the sentence and the parser setting to clarify unknown words has been turned off

o note that applications may allow parsing to guess the part of speech of unknown words

o also note that applications may provide part of speech information for known or unknown words

· there are some words in the sentence that have 2 or more parts of speech

The following sections address all the above except for unknown words, which are addressed elsewhere.

Impact of Ambiguity

This section addresses 2 situations:

1. natural language processing is returning no parses due to exceeding one of the limits specified in the parsing tab of the settings dialog

2. natural language processing is returning no parses without exceeding any of those limits

In the 2^nd case, there is either an unknown word or the sentence is not grammatical given the “Degree of Grammaticality” settings discussed above.

The 1^st case above can arise for the combined reasons of too much ambiguity given the parser settings and a lack of grammaticality.

By addressing too much ambiguity as discussed here, the latter case, if applicable, becomes evident.

Sources of Ambiguity

As discussed in the section “Degree of Grammaticality” above, ambiguity increases as the degree of grammaticality is relaxed from formal to informal or fragmented text.

Even parsing formal text, however, will not avoid the problem of obtaining no parses for some sentences.

In such cases, some combination of the following must be the case:

· the sentence is relatively long given the parser settings

· grammatical ambiguity is high given the parser settings

· lexical ambiguity is causing problems

Lexical ambiguity is easily handled as discussed below and elsewhere (e.g., with regard to parts of speech disambiguation).

For any degree of grammaticality, the ambiguity of a sentence increases more quickly than its length. Thus, the first two bullets above are not independent but directly related.

Understanding Ambiguity

For any given sentence, the number of possible parses increases faster than its length. The reason for this is that the parser does not understand the text as person might.

Parsing considers plausible interpretations of words, phrases, and clauses and their composition into compound phrases and sentences. This may result in combinatorial explosions.

Statistics help order parsing activity and resulting parses in a way that usually makes more sense to human beings who have common sense, but the parser itself knows nothing about meaning.

So, when the parser sees “CAN”, it must consider the possibility of that word being a noun for a metal container, a modal verb for ability or possibility, or an abbreviation for Canada.

· N possibilities for a single word in the sentence can, in the worst case, increase ambiguity by multiplication with N (although the impact is typically lower due to grammatical constraints).

· Note that applications can constrain the parts of speech considered in a variety of ways, including part of speech tagging and limiting the lexicon (e.g., not including “CAN” as an abbreviation).

Such ambiguity is “lexical” in that it has to do with words not syntax or grammar. The following are more specifically grammatical.

When the parser comes across a prepositional phrase, it may be plausible for that prepositional phrase to complement any phrase that came before it.

· “I saw the man” has some lexical ambiguity but no grammatical ambiguity. (What kind of saw are you using?)

· “I saw the man in the park” has some grammatical ambiguity. (Where you in the park when you saw him or was the man in the park?)

· “I saw the man in the park with the telescope” has 4 times as much ambiguity. (The parser does not know that you do not saw with a telescope!)

When the parser comes across disjunctions or conjunctions (which connect 2 or more phrases or clauses) there is frequent and possibly extreme ambiguity in what phrases or clauses it connects.

· “Walk or chew gum” is grammatically ambiguous. (The parser does not know whether gum can be walked.)

· “Walk or chew gum while listening” is perhaps twice as ambiguous. (Should you only listen while chewing or can you walk while listening?)

Another common but less pervasive source of ambiguity is compound words comprising more than 2 words.

· The parser does not know about carbon, steel, or bearings, so it considers “carbon steel bearings” in the same way it would “steel carbon bearings”.

The foregoing are the primary sources of grammatical ambiguity. Each of them is easily addressed using KnowBuddy, as described below.

Ambiguity in Practice

Anyone can learn a lot about ambiguity using KnowBuddy!

1. Start KnowBuddy and log on, if necessary.[1]

2. Open the “public” knowledge-base.[2]

3. Double-click on a short sentence (up to 10 words long).

4. Hit the reset button on the toolbar of the analysis form that opens.[3]

Unless you have settings on the “analysis” tab of the Settings Dialog configured otherwise, you will be presented with disambiguation dialogs containing parse trees and parts of speech.

The following assumes you have seen and these dialogs, read about them in documentation provided elsewhere, and used them as encouraged above.

Pathological Cases of Lexical Ambiguity

Other documentation addresses the various parts of speech and dialogs pertaining to parts of speech that are presented by KnowBuddy and the Linguist. Here we focus on typically problematic sources of lexical ambiguity.

Assuming we are using the system with the following configuration (not all of which can be adjusted in the settings dialog):

· including Roman numerals (integers written using the letters I, V, X, L, C, D, M in either case)

· including uppercase letters as potentially proper (e.g., “B” may refer to some labeled item)

· supporting the parsing of dates (e.g., integers up to 31 may designate days of the month and integers up to 12 may designate months of the year)

· supported the parsing of times of day (e.g., integers up to 12 may designate hours of the day and integers up to 59 may designate minutes of an hour or seconds of a minute)

The following shows the result of parsing in a certain configuration where in under 10 seconds over 800 parses were obtained for the following sentence:

· I can mean one as in the roman numeral I or 1.

The problem in this trivial sentence is a combination of annoying lexical ambiguity and grammatical ambiguity.

The lexical ambiguities include:

· The 1^st ‘I’ can be either a pronoun or the roman numeral for one.

· The ‘one’ can be either a pronoun, a modal auxiliary (correct), or a synonym of “fire”. (Something can be a mean one!)

· The 2^nd ‘I’ has the additional ambiguity that it could be a reference to section ‘I’ or the letter ‘i’ itself.[4]

· the ‘1’ can be either the integer 1 or the day of the month numbered 1 or the hour of the day numbered 1.

Of course, we can apply statistical part of speech tagging to inform the parser that:

· The 1^st ‘I’ is a pronoun.

· The ‘one’ is an integer or cardinality.

· The 2^nd ‘I’ is an integer or cardinality.

· The ‘1’ is an integer or cardinality.

The problem with introducing part of speech tagging is that it leads to many parsing failures which come in 2 “flavors”:

1. Interpreting a word with a given part of speech is not grammatical.

2. Only incorrect parses are obtained because the appropriate part of speech is not provided.

The state of the art in artificial intelligence is that, except for easy words, an incorrect tag is chosen 10% or more of the time.

· The math suggests, and empirical results verify, that the state of the art loses the correct parse most of the time (except for trivial sentences).

Fortunately, this ambiguity is easy to handle as described here below, but it should be noted that.

Configuring Lexical Ambiguity

As suggested above, applications typically configure the system to their needs, avoiding issues that are superfluous given their needs or inconsistent with their requirements.

· Roman numerals may not be important in parsing medical text.

· Dates and times may not be important in parsing scientific journal articles.

· Treating stand-alone uppercase letters as proper may not be important for hand-written policies.

· 1^st and 2^nd person pronouns can be omitted from the lexicon in many cases.

· Proper nouns for lowercase letters are typically omitted from the lexicon.

· Proper nouns for the names of people, places, or things may be omitted (or augmented).

It is not uncommon for governance, risk, & compliance (GRC), policy automation, or decision management applications to prune and control the lexicon aggressively.

It is also common to introduce 1^st-pass heuristics to reduce disambiguation effort. For example, every sentence with ‘I’ could first be parsed assuming the pronoun.

· Such options are incorporated directly into KnowBuddy as appropriate.

· Please send suggestions to linguist@haleyai.com.

Application-specific pre-processing may also be employed for selective tagging as well as recognizing words that “belong together”.[5]

· Please send any compound words that would be generally useful if added to the lexicon to linguist@haleyai.com.

In any case, handling even pathological ambiguity should be straightforward using KnowBuddy.

For example, checking the phrase below:

And selecting parts of speech, such as:

· that “mean” is used as a verb,

· that “in” is used as preposition (not a phrase),

· that “one” is used as an adjective (if you are comfortable, othewise cancel),

o one as a pronoun would be a noun phrase

o one as an hour of the day would be a noun

· that “1” is similarly adjectival (perhaps after you have some more experience)

Then, using the reparse button on the toolbar, you can affirm all the following parts of speech:

And that all these words “belong together” (these were implied by your selections above):

Lack of Grammaticality

o typically, the lack of grammaticality is apparent, but sometimes it is easily overlooked

§ information about where in the text parsing is having trouble is provied

o in some cases, the source of parse failure can be challenging to identify

§ using the parse tree dialogs and bracketing of the text usually identifies to the problem

§ alternatively, reformulating the sentence, typically into shorter parts, always leads to the problem

[1] Logging in may be bypassed using cached credentials in accordance with the “general” tab of the Settings Dialog.

[2] Every new user has access to the “public” knowledge base, but the knowledge-base selection dialog may be bypassed if the last knowledge-base used is automatically re-opened. If so, clear the corresponding setting on the “general” tab of the Settings Dialog and re-launch to open the “public” knowledge base.

[3] Familiarize yourself with any toolbar by hovering over its icons or controls and reading the tool-tips.

[4] Pragmatically, I is not considered as potential proper (other than as the pronoun) when it is the first word in a sentence.

[5] Please feel free to contact linguist@haleyai.com to assess and plan for any potential application-specific needs.