We are collaborating in the acquisition of knowledge concerning clinical trials. Initially, we are looking at trials related to pancreatic cancer, such as A Study Using 18F-FAZA and PET Scans to Study Hypoxia in Pancreatic Cancer.
At http://clinicaltrials.gov, each trial is rendered as HTML for browsing from underlying XML files which can be downloaded. Although we can parse the underlying XML into content for knowledge acquisition automatically, this article looks at acquiring the knowledge about an individual trial using the web presentation. In particular, we look at the logical, semantic, and linguistic issues of understanding eligibility criteria.
In the trial referenced above, the following are listed as eligibility criteria for inclusion (there are additional criteria concerning exclusion):
- Minimum age of 18 years old
- Histologic diagnosis of pancreatic adenocarcinoma
- TNM (7th edition) cT1-4, N0-1, M0-1
- No cytotoxic anti-cancer therapy for advanced / metastatic pancreatic cancer prior to study entry
- Ability to provide written informed consent to participate in the study
- ECOG performance status 0, 1 or 2.
- Patient should have the following blood counts at baseline: ANC equal or greater to 1.5 x 109/L; Platelets equal or greater to 100 x 109/L; Hgb equal or greater to 9g/Dl
- Patient should have the following blood chemistry levels at baseline: AST (SGOT), ALT (SGPT) equal or less than 5 x upper limit of normal range (ULN) is allowed
- Patient has an identifiable tumor (pancreatic tumor and/or metastasis) by imaging (CT scan and/or MR)
- Patient must agree to use contraception considered adequate and appropriate by the investigator and if the patient is female of child-bearing potential, as evidenced by regular menstrual periods, she must have a negative serum pregnancy test (B-hCG)
Acquiring the logical semantics of such criteria such that they may be leveraged in an application poses a number of issues and challenges. For example, the first criteria needs to be understood as any of the following:
- Every patient participating in the study must be at least 18 years old.
- No patient under 18 years old may participate in the study.
- Patients participating in the study must have a minimum age of 18 years old.
Some of these read better than the others. The fact is that the criteria are written with a lot of context in mind. The context includes the particular study and that each criteria concerns the eligibility of each and every patient who may participate in the study.
Consider that the first criteria above is not a sentence in English nor does it correspond to a logical axiom, whether in first order logic, a description logic, or an ontology. Linguistically, the first criteria above is a fragment of a noun phrase.
Interpreting such criteria logically requires that they be directly or indirectly interpreted as logical axioms, by which we mean to include first order axioms or description logic axioms, as in web ontology language (i.e., OWL).
The reformulated bullets above meet the criteria of being complete English sentences, which correspond directly to logical axioms. However, even those sentences have a remaining issue with the definite reference involving “the”.
Of course, references to a particular study are pervasive within its inclusion and exclusion criteria. There is also an implicit reference to the patient in each such criteria. The question is, how do we capture or represent such implicit references or context within the knowledge acquired from the clinical trial summaries as drafted? For example, if we interpret the first criteria above as:
- The patient has a minimum age of 18 years old.
- {∀(?x6)patient(?x6)⇒{∃(?x9)({∃(?x15)(#(?x15,18)∧year(?x15)∧minimum(age)(of(?x15))(?x9))}∧have(predicate)(?x6,?x9,old(?x9)))}}
Note that the logic here is universally quantified over patients which is not what we desire. Rather, we want these to be patients that participate in the study.
- Every patient in the study has a minimum age of 18 years old.
- {∃(?x8)(study(?x8)∧{∀(?x6)patient(in(?x8))(?x6)⇒{∃(?x22)(#(?x22,18)∧year(?x22)∧{∃(?x16)(minimum(age)(of(?x22))(?x16)∧have(predicate)(?x6,?x16,old(?x16)))})}})}
To complete the logic, we want to resolve the particular study, perhaps using the Clinical Trials ID number (NCT01542177 in this case) instead of the variable ?x8 above, as in:
- {∀(?x6)patient(in(NCT01542177))(?x6)⇒{∃(?x22)(#(?x22,18)∧year(?x22)∧{∃(?x16)(minimum(age)(of(?x22))(?x16)∧have(predicate)(?x6,?x16,old(?x16)))})}})
To simplify the acquisition of the linguistic fragments that occur as eligibility criteria, we are doing a few things. First, we parse and disambiguate noun phrases and other fragments in addition to complete sentences. Second, we understand that such fragments occur within the context of universals over patients and a definite trial (which could be quantified universally or existentially). In doing this, we have found that it is better (or, perhaps, we simply prefer) to make the fragments sentences, both because it’s easier for now and because the sentences tend to clarify the intended meaning beyond the text an author never thought a machine would process. Moreover, the resulting sentences are more accessible to patients! (Indeed, we hope patients and their loved ones do much of the reformulation and disambiguation through a non-profit.)
For example, the following may be more verbose than necessary since acronyms and abbreviations of units pose no issues, for example. Nonetheless, more people will find the information perspicuous, such as in the context of an application that helps people find trials for which they are eligible:
- Patients must be at least 18 years old.
- Patients must be histologically diagnosed with pancreatic adenocarcinoma.
- Patients’ cancer must have tumors of a measurable extent with no lymph node metastasis beyond regional lymph nodes.
- Patients have not received cytotoxic anti-cancer therapy for advanced or metastatic pancreatic cancer prior to the study.
- Patients are able to provide written informed consent to participate in the study.
- Patients are ambulatory and capable of all self-care.
- Patient’s blood has at least 1,500 neutrofil granulocytes per microliter
- Patient’s blood has at least 100,000 platelets per microliter
- Patient’s blood has at least 9 grams of hemoglobin per deciliter
- Patient’s have acceptable liver function as indicated by serum alanine and asparate aminotransferase levels at most five times the upper limit of normal
- Patient has a pancreatic tumor or metastasis identifiable by computed tomography or magnetic resonance imaging
- Patient must agree to use contraception considered adequate and appropriate by the investigator
- Female patients with regular menstrual periods must have a negative blood serum pregnancy test for beta human chorionic gonadotrophin
These can be disambiguated to produce logical formulas such as shown above using our software. You can see some examples of that process in the video discussed in this article on translating English into logic and read more about it in this post concerning knowledge acquisition from a biology textbook.
When you look over the original sentences, you will probably agree that some need such clarity. But some could be handled more generally with less revision. We would really appreciate your opinions and suggestions as to which cases should be treated how!