Importing documents into a knowledge base

Many users land up wanting to import sentences in the Linguist rather than type or paste them in one at a time.  One way to do this is to right click on a group within the Linguist and import a file of sentences, one per line, into that group.  But if you want to import a large document and retain its outline structure, some application-specific use of the Linguist APIs may be the way to go.

We’ve written up how to import a whole web site  on California Sales & Use Tax:
There are quite a few ways to do this, but we took the step by step approach of:
  1. walk the web site and extract the content of the law into a number of files
  2. process the files into a the content of individual sections within an outline structure
  3. break the content down into individual sentences
  4. create the outline structure and populate it with the sentences
The resulting knowledge base is ready for text analysis, automated parsing, and disambiguation, as shown here:

Our favorite case is importing pages from web sites, such as Wikipedia or regulatory text that appears on-line.  For the most part, the text is available as HTML.  In some cases, it’s available as Word documents.  And sometimes it’s only available as PDF.

  • PDF documents can be saved as text, imported into Word, or read programmatically
  • Microsoft Word is no problem since it can be saved as HTML
  • The Linguist has utilities that extract the HTML from ePub textbooks
  • The Linguist also has utilities that extract text from PDFs, but PDFs typically require more work
    • PDF can be saved to text (possibly with in-line headers & footers)
    • PDF can be imported by Word (sometimes losing headers & flow)
    • Mutli-column PDF can be difficult if the flow of the document is not clear
    • PDF can be difficult if there is excess hyphenation due to even justification
  • HTML can be transformed into XML which is easy to process in various languages (e.g., Java).
In the article we discuss a number of finer points, especially with regard to tokenization and sentence splitting.