Benjamin Grosof, co-founder of Coherent Knowledge Systems, is also involved with developing a standard ontology for the financial services industry (i.e., FIBO). In the course of working on FIBO, he is developing a demonstration of defeasible logic concerning Regulation W of the The Federal Reserve Act. Regulation W specifies which transactions involving banks and their affiliates are prohibited under Section 23A of the Act. In the course of doing this, there are various documents which are being captured within the Linguist™ platform. This is a brief note of how those documents can be imported into the platform for curation into formal semantics and logic (as Benjamin and Coherent are doing).
There is a document from the Federal Reserve with an appendix reviewing Regulation W:
PDFs are challenging because they are images more than documents. There are lots of problems in getting text for natural language processing out of PDFs (as discussed in great detail in this article). Microsoft Word does not extract the text well at all. Google does a pretty nice job, but it breaks up the paragraphs into unrelated divisions. Every line on the screen becomes its own division in the resulting HTML. It renders fine on screen, but it’s not good for extracting paragraphs and sentences for knowledge acquisition. You can see for yourself by Googling for the above URL and looking at the cached HTML Google maintains. Here’s how it looks in the Linguist:
The next step is to click the XHTML button. This leads to a question about normalizing the document from the Google PDF structure. Answering yes results in the following:
At this point we’re good to go. You can use the XHTML Explorer to get rid of extraneous things, like removing headers and footers (e.g., Page X of Y). But the objective is to get sentences from the text reading for natural language processing. By clicking in a paragraph of text on the right and using a menu for the highlighted node on the left you can pop-up a dialog like this:
And so on to populate the knowledge base for curation.