Natural language processing (NLP)
Natural language processing (NLP) is an automated technique that converts narrative documents into a coded form that is appropriate for computer-based analysis. Capabilities that NLP provides in the context of healthcare include parsing a sentence into its component structures, understanding the medical vocabulary and clinical terms used, disambiguating the context in order to interpret the clinical terms correctly within the broader context of the documentation, and representing the processed information for further use.
Health care providers routinely capture large amounts of clinical data at the point of care. As we move into the era of Electronic Health Records (EHRs), much discussion has occurred about how best to enter relevant clinical data. Direct entry of data into the EHR by healthcare providers is important for obtaining the most accurate information. There are several options for data entry that have been considered.
Coded data entry is an approach that may result in the most effective capture of information. Free text entry of data by healthcare providers limits the usefulness of the data, other than for future healthcare encounters. However, coded entry systems, applied at the point of care, may limit the productivity of the user. And, coding systems may be too rigid or succinct to capture the broad details and subtleties of the clinical encounter.
This technology is now maturing and is showing great promise. Proper and judicious use of NLP technology can have a wide array of uses within the healthcare industry. NLP technology is currently being used most commonly in medical records (such as for the detection of adverse events that occur during hospitalizations and clinical conditions described in narrative radiology reports), as well as in coding for billing purposes. It remains unclear, however, whether natural language processors will eventually be able to detect a wide enough range of the complex events of clinical care to play a broader role in the coded data entry process in the EHRs of the future.
Natural Language Processing in EMRs
The term Natural Language Processing is used to describe the function of software or hardware components in a computer system which analyze or synthesize spoken or written language. The term ‘natural’ is meant to distinguish human speech and writing from more formal languages, such as mathematical notations or programming languages where the vocabulary and syntax are comparatively restricted. (Jackson & Moulinier, 2007)
One of the goals of using EHRs in healthcare is to represent health information in a structured form so that it can be re-used for decision support and quality assurance. However, free text fields are often used in EHRs to facilitate rapid data entry by clinicians. These free text items create a major limitation because of the inability to reliably access the vast amount of clinical information that is locked in the application as free text. Natural language processing systems potentially offer a solution because they can extract individual words and also represent well-defined relations among words.
Understanding natural language involves three components (Friedman & Hripcsak, 1999):
- Syntax, or understanding the structure of sentences
- Semantics, or understanding the meaning of words and how they are combined to form the meaning of a sentence
- Domain knowledge, or information about the subject matter
Natural language processing has made good progress in the field of medicine because its domain is restricted and it uses a sub-language which has less variety, ambiguity and complexity than a general language and it involves only specific information and relations relevant to the particular subject. For example, there is an informational category associated with body location (e.g., chest), with symptom (e.g., pain) and with severity (e.g., severe). All the systems that have been developed use different amounts of syntactic, semantic and domain knowledge.
For example, some of the systems that are well known in the clinical world are: LSP (Linguistic String Project) which is a pioneer in general English language processing and has been adapted to medical text, MENELAS was created by a consortium that aimed to provide better access to patient discharge summaries. Both LSP and MENELAS use comprehensive syntactic and semantic knowledge about the structure of the complete sentence. The MedLEE system operates as an independent module of a clinical information system at New York Presbyterian Hospital and is used daily. It is the first NLP system used for actual patient care that was shown to improve care. It has also been integrated into voice recognition systems. The MedLEE system relies heavily on general semantic patterns interleaved with some syntax and also includes knowledge of the entire sentence (Friedman, Hripcsak, & Shablinsky, An evaluation of natural language processing methodologies, 1998).
Once information is extracted from text, it can be stored in any well-defined format. For example, information stored in XML can be used in web reports (Friedman & Hripcsak, 1999). Besides free text, Natural language processing can also be used in conjunction with continuous voice recognition systems in EHRs. Integrating a voice recognition system with a natural language processing system substantially enhances the functionality of the voice system. Physicians can dictate notes in their usual fashion while the natural language processor translates the report into a structured encoded form in the background (Friedman & Hripcsak, 1999).
Another application of Natural language processing in EHRs is to provide the ability to link from the EHR to on-line information resources. This is done by parsing the plain text reports from Web based EHRs and using the results to identify clinical findings in the text. This information is used to provide automated links to on-line information resources via Infobuttons. (Janetzki, Allen, & Cimino, 2004) A study to evaluate the automated detection of clinical conditions described in narrative reports found that the natural language processor was not distinguishable from physicians on how it interpreted narrative reports. It was also found to be superior to other comparison subjects used in the study which included internists, radiologists, laypersons and other computer methods. This study seems to show that natural language processing has the ability to extract clinical information from narrative reports in a manner that would support automated decision support and clinical research. (Hripcsak, Friedman, Alderson, DuMouchel, Johnson, & Clayton, 1995)
Biomedical Text Mining
One of my daily tasks as a librarian is searching in the published literature for articles on a specific topic. For example, a pediatrician needs the current practice guideline on immunizations. I find the needed article by searching in a biomedical bibliographic database called Medline. The search strategies employed in this task include using MeSH, Boolean operators, limits, etc. I am part of the labor force of manual text mining and didn’t know it.
What is text mining? “Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation.” 
As information seekers, we find medical information from two types of sources: structured (e.g. blood pressure, gender, age found in medical records) and unstructured (e.g. text in documents, web pages, manuals, reports, email, faxes, presentations, and published literature). Text mining applies to both types, but text mining tools are designed to target unstructured data, such as published literature.
In biomedicine, the volume of published biomedical research has resulted in exponentially growing biomedical knowledge base. Medline 2007, for example, contains 15 million records, a 2.5 million increase from 2004.  This knowledge base can be useful in aiding researchers in the diverse subfields of biomedicine to discover new knowledge. This is accomplished by biomedical text mining. Using this technology, researchers can identify needed information more efficiently, discover relationships obscured by the sheer volume of available information. The text mining tools such as SAS, ORACLE TEXT efficiently carry the burden of information overload. These tools employ algorithmic and statistical methods, context indexing, decision trees, filtering, etc.
CURRENT RESEARCH IN BIOMEDICAL TEXT MINING
In their article, “A survey of current work in biomedical text mining” published in 2005, Cohen and Hersh described five themes in text mining under current research.
- NAMED ENTITY RECOGNITION – identify all instances of a name for a specific thing (e.g. all of the gene names and symbols within a collection of articles) to extract key concepts of interest and allow those concepts to be represented in a consistent form.
- TEXT CLASSIFICATION – determine whether a document has certain characteristics of interest. Database curators found that this technique reduced the number of abstracts they have to read by two thirds.
- SYNONYM AND ABBREVIATION EXTRACTION – automate the collection and mapping of synonyms and abbreviation of biomedical entities.
- RELATIONSHIP EXTRACTION – detect a specific type of relationship between two entities, e.g. biochemical association.
- HYPOTHESIS GENERATION – identify unrecognized relationships worthy of further investigation that could lead to promising hypotheses.
Biomedical text mining increases usability and quality of text data. This enables researchers to use this data in clinical decision process. However, text-mining researchers need to work together towards interdisciplinary coordination and cooperation to develop tools based on real-world needs. 
Electronic Health Record (EHR) adoption has been very slow in the United States due to a number of factors.1 One in particular, which is more idiosyncratic to medicine because of it’s unique culture, is the method of data entry. Provider’s typically dictate because it is the most efficient, effective and, most importantly, what they are accustomed to doing.
One particular component which proponents advocate as one of the main reasons to dictate is retention of the meaning of the Provider-Patient dialogue, also referred to as “the story.”2 Those opposed state that the need and benefits for structured and encoded data far outweigh the inconvenience of data entry other than dictation.3
Natural Language Processing (NLP)4 has the potential to bridge this gap and satisfy the perspectives and concerns of both the advocates and opposers. Currently, numerous studies have demonstrated that NLP is able to extract some meaning from dictations either live or retrospectively. This allows the use of the information captured to be used instantly for Clinical Decision Support (CDS), leading to improved outcomes at the point of care.
Although NLP carries much promise, we have not yet obtained the level of accuracy and efficiency needed for general public promotion and deployment. With further study and development of NLP in conjunction with the rapid pace of development of computing technology (ie quantum computing5) and decreasing costs, we may soon attain the levels of accuracy needed for nationwide deployment. This may then resolve the issue of the lost “story”, thereby addressing the goals and concerns of all stakeholders.
- “Crossing the Quality Chasm: A New Health System for the 21st century” IOM, 2001
- Kalitzkus, Vera PhD, et al “Narrative-Based Medicine: Potential, Pitfalls, and Practice” The Permanente Journal/Winter 2009/Vol 13 No 1
- Johnson, Stephen B, et al: “JAMIA” 2008 15: 54-64
Submitted by Paul Z. Seville
Natural language processing and its future in medicine
Physician regularly enter a huge amount of data for each patient into their electronic health record (EHR). Data entered can be either in a narrative text (free text) or it can be in a form of coded (structured) data entry. Structured data entry helps easy retrieval of information in the HER but free text fields are used in EHRs to facilitate rapid data entry by clinicians and it is some times more appropriate to enter information about patients that is not well described if entered as structured data. Yet it creates a limitation as it is very hard to access clinical information that is locked in these free text fields. It is important to retrieve data to be used for many purposes such as for automated decision support or for statistical purposes.
Natural Language processing (NLP) can offer a solution for this problem as it extracts words from free text and also present well defined relations among these words and include an appropriate modifier for the records. NLP encode data through using terminology from a well defined vocabulary, represent relations among the concepts. To understand textual language NLP involves many components as syntactic, semantic and domain knowledge components.
There are many systems using the NLP to achieve their goals. Some of these systems are: The SPRUS system, The MedLEE system and MENELAS system. Some of the important resources for NLP is the UMLS and SPECIALIST Lexicon, grammar and domain models. There is some work investigating the use of SNOMED and ICD-10 like UMLS in the NLP systems.
NLP can provide some important features in the future that can benefit EHR: These benefits could include providing a reasonable and accurate way to retrieve data from the electronic health records that could be used for billing in a more accurate way than assigning the ICD-9 manually to patients at the time of discharge and for web searching about clinical information and to find data for statistical and researches. It could be used with XML tags in a n easy way to retrieve data.
Continues voice recognition systems can also be integrated with NLP in a way that could facilitate data entry and reduce time consumption for physicians. This will help in translating the textual report into a structure data encoded and sorted in real time along with original text in a clinical repository. NLP system would likely help producing standardized output forms suitable for the web through the use of XML,
Natural language processing and its future in medicine, Friedman, C; Hripcsak, G
Submitted by (Tamer Etman)
- http://www.ischool.berkeley.edu/~hearst/text-mining.html accessed May 25, 2007.
- http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.html#Medline accessed May 28, 2007.
- Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005 Mar; 6(1):57-71.
Friedman, C., & Hripcsak, G. (1999). Natural language processing and its future in medicine. Journal of the Association of American Medical Colleges . Friedman, C., Hripcsak, G., & Shablinsky, I. (1998). An evaluation of natural language processing methodologies. AMIA . Hripcsak, G., Friedman, C., Alderson, P. O., DuMouchel, W., Johnson, S., & Clayton, P. (1995). Unlocking clinical data from narrative reports: a study of natural language processing. Annals of Internal Medicine . Jackson, P., & Moulinier, I. (2007). Natural Language Processing for Online Applications. John Benjamins Publishing Co. Janetzki, V., Allen, M., & Cimino, J. J. (2004). Using Natural Language Processing to Link from Medical text to On-Line Information Resources. MEDINFO .
- VandeVelde R, Degoulet P. Clinical Information Systems, A Component-Based Approach. New York: Springer, 2003.
- The on-ramp to EHR: Exploring the symbiotic relationship between EHR and transcription solutions. MedQuist 2006. 
- Melton GB, Hripcsak G. Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries. JAMIA 12(4): 448, 2005.
- Hripcsak G, Friedman C, Alerson PO, DuMouchel W, Johnson SB, Clayton PD. Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing. Ann Int Med 122(9): 681, 1995.