TA/NLP: It’s a jungle out there!

Text analytics and natural language processing have made tremendous advances in the last few years.  Unfortunately, there is a lot more to understanding natural language that TA/NLP.

I was reading a paper today about NLP pipelines for question answering that used machine learning to find what tools are good at what tasks and to configure a pipeline by selecting the best tool for a given task from each of the types of components in the pipeline.  The paper has a long list of various components, so I checked a few out.  Most of those of interest were available on the web so that they could be easily composed into pipelines without a lot of software setup.  Looking at these I quickly tired in disappointment.  Here are some of the reasons.

I am not surprised by these results.  NLU is hard.  But they are not particularly strong results either.  I’m surprised that people find such results useful (if they do).

I know that Cal. App. is an organization (i.e., California Court of Appeal), but I’m not surprised this system missed it, but perhaps it should have identified California?

I understand that the People can be confusing, especially at the beginning of a sentence, but a knowledgeable or learned system would have recognized People as referring to persons (the public) or an organization (e.g., a government).

Similarly, a knowledge or learned system would recognize this as a reference to a legal case and perhaps classified it with regard to organization, place, and event.

I guess I can understand why this system doesn’t bother identifying single words as concepts.  But I don’t really understand how “oil well” is a concept but not “oil well driller” or “rock bit” or even “effective life”.  It’s important to understand that “transfer of title” is a concept or event here, too. (Perhaps it should even classify ‘(a) of this section’ as referring to a work.)

It’s pretty clear that a company is an organization, not just a concept, so I’m a little surprised by these errors.  It missed section 6352 as a work (as above), so this system would not be very good for legal or contract purposes.

The mistake on “private” is interesting, especially that it did not identify “private individuals” as either a concept or as persons.

In real life, text is not always pretty.  This is real life. The prior examples are verbatim from the California Sales and Use Tax Law & Regulations.

The following text occurs in the body of a medical test prep exam from one of our clients.  Note that it misses lots of people here.  It’s also almost completely arbitrary about what it classifies as a concept.  It’s particularly odd that it classifies an adjective as a concept when the concept should be “clinical features”.  It misses “Pancreas” (twice) and “Blumgart’s Surgery” as referring to a person or organization and being a concept specializing “surgery”.  It also misses that various parts f this are references to works.  And it misses that 2010 and 2012 are references to years (events).

The following may be fine, but how do you use it? How do those modifiers of 23 and year-old relate to each other? Isn’t she the subject of the telling??  That’s enough for me.

The following is a little difficult to “parse”, especially the first row.  I don’t get the ‘that’ as the subject nor ‘her be’ as the object of several of these. (And I wonder, what is ‘it’ as a predicate?)

I expect activities to be the subject of ‘require’ and the objective of ‘participates’.  I expect her to be the subject of ‘participates’ and (part of) the object of ‘requires’.

In the “you get what you pay for” category, here are some results from Amazon’s Comprehend service.  I like that it got the section as an entity and I don’t fault it for perceiving some negative sentiment here (except as follows).

To the extent such things (like Amazon) do well with the smallest noun phrases, they could also do some prepositional phrases, which would be nice, but they are far from understanding what phrases complement (i.e., other phrases or clauses).

I would be more impressed if it noted the more interesting phrases here, such as ‘collect or pay’ or ‘collect or pay a use tax’ or ‘duty to collect or pay a use tax’.  On the other hand, it takes a knowledge or very learned system to do that.

And even more impressed if it recognized (with useful confidence) that its the company being relieved from a duty that is negated (without negative sentiment).