Natural language processing (NLP)

From Clinfowiki
Jump to: navigation, search

Natural language processing (NLP) is an automated technique that converts narrative documents into a coded form that is appropriate for computer-based analysis. Capabilities that NLP provides in the context of healthcare include parsing a sentence into its component structures, understanding the medical vocabulary and clinical terms used, disambiguating the context in order to interpret the clinical terms correctly within the broader context of the documentation, and representing the processed information for further use.


Health care providers routinely capture large amounts of clinical data at the point of care. As we move into the era of Electronic Health Records (EHRs), much discussion has occurred about how best to enter relevant clinical data. Direct entry of data into the EHR by healthcare providers is important for obtaining the most accurate information. There are several options for data entry that have been considered.

Coded data entry is an approach that may result in the most effective capture of information. Free text entry of data by healthcare providers limits the usefulness of the data, other than for future healthcare encounters. However, coded entry systems, applied at the point of care, may limit the productivity of the user. And, coding systems may be too rigid or succinct to capture the broad details and subtleties of the clinical encounter.

This technology is now maturing and is showing great promise. Proper and judicious use of NLP technology can have a wide array of uses within the healthcare industry. NLP technology is currently being used most commonly in medical records (such as for the detection of adverse events that occur during hospitalizations and clinical conditions described in narrative radiology reports), as well as in coding for billing purposes. It remains unclear, however, whether natural language processors will eventually be able to detect a wide enough range of the complex events of clinical care to play a broader role in the coded data entry process in the EHRs of the future.

Natural Language Processing in EMRs

One of the goals of using EHRs in healthcare is to represent health information in a structured form so that it can be re-used for decision support and quality assurance. However, free text fields are often used in EHRs to facilitate rapid data entry by clinicians. These free text items create a major limitation because of the inability to reliably access the vast amount of clinical information that is locked in the application as free text. Natural language processing systems potentially offer a solution because they transform unstructured data into a more useful structured form by extracting individual words and concepts and representing well-defined relationships among words. Once information is extracted from text, it can be stored in any well-defined format. For example, information stored in XML can be used in web reports (Friedman & Hripcsak, 1999).

Natural language processing is used to describe the function of software or hardware components in a computer system which analyze or synthesize spoken or written language. The term ‘natural’ is meant to distinguish human speech and writing from more formal languages, such as mathematical notations or programming languages where the vocabulary and syntax are comparatively restricted(Jackson & Moulinier, 2007). Natural language processing (NLP) combines many disciplines including informational retrieval, artificial intelligence, machine learning and above all, computational linguistics. The rules and concepts of computational linguistics drive processing rules for methods and algorithms that are commonly applied in NLP (Nadkarni & Ohno-Machado 2011).

Research efforts to develop NLP systems were begun in the 1960s using “time-shared, mainframe computers that communicated by telephone lines to remote data-entry terminals and printers and experienced a regrowth in the 1980s through the application of machine learning techniques made possible by advances in computer technology (Collen 2012). The goal of using artificial intelligence for NLP was to improve accuracy and reduce human interaction by applying sophisticated computer-driven methods to develop algorithms rather than relying on hand-written instructions (Spyns 1996).

Understanding Language

Understanding natural language involves three components (Friedman & Hripcsak, 1999):

  1. Syntax, or understanding the structure of sentences
  2. Semantics, or understanding the meaning of words and how they are combined to form the meaning of a sentence
  3. Domain knowledge, or information about the subject matter

As NLP systems process data, each stage relies on results from previous stages for accuracy. Often, different rules are applied at each stage, creating a complex system of dependent processes. For example, the syntactic stage may rely on tokenization processes, automatic spelling correction or normalization of terminology to detect phrases and sentences (Chomsky 2002). After phrases or complete sentences have been identified, the syntactic process parses the words into grammatical components and subcomponents, such as verbs, nouns, adjectives and other parts of speech(Li & Patrick 2012). One of the difficulties in creating meaning in a structured, progressive manner is that parts of a discourse may have distinctly different meanings in isolation; not until combined to create a complete discourse is the true meaning of a phrase clear.

Natural language processing has made progress in the field of medicine because its domain is restricted and it uses a sub-language which has less variety, ambiguity and complexity than a general language and it involves only specific information and relations relevant to the particular subject. For example, there is an informational category associated with body location (e.g., chest), with symptom (e.g., pain) and with severity (e.g., severe). However, despite this progress, developers have faced challenges because designing basic algorithms for decoding contextual meaning requires that a text or verbal discourse follow a recognized sentence structure. In a real-world clinical setting, spelling errors, terms such as abbreviations and acronyms, and partial or incorrectly formed sentences are often found in clinical notes (Xu & Friedman 2008).

Besides free text, natural language processing can also be used in conjunction with continuous voice recognition systems in EHRs. Integrating a voice recognition system with a natural language processing system substantially enhances the functionality of the voice system. Physicians can dictate notes in their usual fashion while the natural language processor translates the report into a structured encoded form in the background (Friedman & Hripcsak, 1999). Voice recognition, however, presents additional challenges. Systems that process voice recordings must be programmed to understand an even greater level of detail. Consequently, NLP systems must now process data through several increasingly more complex stages (in addition to those mentioned above) including phonology, or the basic sounds (phonemes) used in human speech, morphology, or the structure and construction of words from root words and affixes, pragmatics, or understanding the underlying meaning within the context of a given situation or scenario and discourse, or the larger meaning of a set of words and phrases (Raskin 1987).

Another application of natural language processing in EHRs is to provide the ability to link from the EHR to on-line information resources. This is done by parsing the plain text reports from Web based EHRs and using the results to identify clinical findings in the text. This information is used to provide automated links to on-line information resources via Infobuttons (Janetzki, Allen, & Cimino, 2004). A study to evaluate the automated detection of clinical conditions described in narrative reports found that the natural language processor was not distinguishable from physicians on how it interpreted narrative reports. It was also found to be superior to other comparison subjects used in the study which included internists, radiologists, laypersons and other computer methods. This study seems to show that natural language processing has the ability to extract clinical information from narrative reports in a manner that would support automated decision support and clinical research. (Hripcsak, Friedman, Alderson, DuMouchel, Johnson, & Clayton, 1995)

Common NLP Systems

All the systems that have been developed use different amounts of syntactic, semantic and domain knowledge. For example, some of the systems that are well known in the clinical world are

  • LSP (Linguistic String Project) which is a pioneer in general English language processing and has been adapted to medical text.
  • MENELAS was created by a consortium that aimed to provide better access to patient discharge summaries. Both LSP and MENELAS use comprehensive syntactic and semantic knowledge about the structure of the complete sentence.
  • The MedLEE system operates as an independent module of a clinical information system at New York Presbyterian Hospital and is used daily. It is the first NLP system used for actual patient care that was shown to improve care. It has also been integrated into voice recognition systems. The MedLEE system relies heavily on general semantic patterns interleaved with some syntax and also includes knowledge of the entire sentence (Friedman, Hripcsak, & Shablinsky, An evaluation of natural language processing methodologies, 1998).

Additional NLP Tools and Systems

The Electronic Medical Records and Genomics (eMERGE) Network has created a survey of commonly used NLP frameworks and tools

General Frameworks

  • Apache Unstructured Information Management Architecture ([ UIMA]): Java framework for developing NLP pipelines, released under the Apache 2 license. UIMA provides Eclipse plug-ins for developing and testing UIMA-based applications. UIMA wrappers exist for a variety of other Java-based NLP component libraries.
  • General Architecture for Text Engineering (GATE): Java framework for developing NLP pipelines, developed at the University of Sheffield (UK). GATE includes a number of rule-based NLP components, and GATE wrappers exist for a variety of other Java-based NLP libraries
  • Natural Language Toolkit (NLTK): A Python library for developing NLP applications. This framework is accompanied by a book, which is useful for pedagogical purposes

Clinical Natural Language Processing Tools

  • Clinical Text and Knowledge Extraction System (cTAKES): cTAKES is built on top of Apache UIMA, and is composed of sets of UIMA processors that are assembled together into pipelines. Some of the processors are wrappers for Apache OpenNLP components, and some are custom built. cTAKES was developed at the Mayo Clinic, and is distributed by the Open Health NLP Consortium.
  • Computational Language and Education Research toolkit (cleartk): cleartk has been developed at the University of Colorado at Boulder, and provides a framework for developing statistical NLP components in Java. It is built on top of Apache UIMA.
  • Health Information Text Extraction (HITEX): HITEx was developed as part of the i2b2 project. It is a rule-based NLP pipeline based on the GATE framework.
  • NegEx (NegEx): NegEx is a tool developed at the University of Pittsburgh to detect negated terms from clinical text. The system utilizes trigger terms as a method to determine likely negation scenarios within a sentence.
  • ConText (ConText): ConText, an inference algorithm developed to identify patients with existing medical conditions. ConText infers whether or not a condition found in a patient’s medical record is still present from the context of the records. The program is an extension to NegEx, and is also developed by the University of Pittsburgh. ConText extends NegEx to not only detect negated concepts, but to also find temporality (recent, historical or hypothetical scenarios) and who the experiencer is (patient or other) of the concept.
  • National Library of Medicine's MetaMap (MetaMap): MetaMap is a comprehensive concept tagging system which is built on top of the Unified Medical Language System (UMLS). It requires an active UMLS Metathesaurus License Agreement for use.
  • MedEx - a tool for extraction medication information from clinical text (MedEx): MedEx processes free-text clinical records to recognize medication names and signature information, such as drug dose, frequency, route, and duration. Use is free with a UMLS license. It is a standalone application for Linux and Windows.
  • SecTag - section tagging hierarchy (SecTag): SecTag recognizes note section headers using NLP, Bayesian, spelling correction, and scoring techniques. The link here includes the SQL and CSV files for the section terminologies. Use is free with either a UMLS or LOINC license.
  • Stanford Named Entity Recognizer (NER): Stanford’s NER is a Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition in English and German.
  • Stanford CoreNLP (CoreNLP): Stanford CoreNLP is an integrated suite of natural language processing tools for English in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and coreference

Other Clinical NLP Systems

MediClass: The MediClass processor was developed at Kaiser Permanente’s CER Hub. MediClass ("Medical Classifier") a software technology that incorporates natural language processing and knowledge based systems to automatically identify clinical events in all types of data. MediClass applications process both coded data and text-based language in electronic health records to discern and measure study-defined aspects of health and healthcare .

Other Useful Information

A 2010 article written by Stanfill, Hersh, et al. entitled, “A systematic literature review of automated clinical coding and classification systems” provides one of the more recent comprehensive reviews of coding and classification systems used in NLP

Electronic Health Record (EHR) adoption has been very slow in the United States due to a number of factors.1 One in particular, which is more idiosyncratic to medicine because of it’s unique culture, is the method of data entry. Provider’s typically dictate because it is the most efficient, effective and, most importantly, what they are accustomed to doing.

One particular component which proponents advocate as one of the main reasons to dictate is retention of the meaning of the Provider-Patient dialogue, also referred to as “the story.”2 Those opposed state that the need and benefits for structured and encoded data far outweigh the inconvenience of data entry other than dictation.3

Natural Language Processing (NLP)4 has the potential to bridge this gap and satisfy the perspectives and concerns of both the advocates and opposers. Currently, numerous studies have demonstrated that NLP is able to extract some meaning from dictations either live or retrospectively. This allows the use of the information captured to be used instantly for Clinical Decision Support (CDS), leading to improved outcomes at the point of care.

Although NLP carries much promise, we have not yet obtained the level of accuracy and efficiency needed for general public promotion and deployment. With further study and development of NLP in conjunction with the rapid pace of development of computing technology (ie quantum computing5) and decreasing costs, we may soon attain the levels of accuracy needed for nationwide deployment. This may then resolve the issue of the lost “story”, thereby addressing the goals and concerns of all stakeholders.

  1. “Crossing the Quality Chasm: A New Health System for the 21st century” IOM, 2001
  2. Kalitzkus, Vera PhD, et al “Narrative-Based Medicine: Potential, Pitfalls, and Practice” The Permanente Journal/Winter 2009/Vol 13 No 1
  3. Johnson, Stephen B, et al: “JAMIA” 2008 15: 54-64

Submitted by Paul Z. Seville

Natural language processing and its future in medicine

Physician regularly enter a huge amount of data for each patient into their electronic health record (EHR). Data entered can be either in a narrative text (free text) or it can be in a form of coded (structured) data entry. Structured data entry helps easy retrieval of information in the HER but free text fields are used in EHRs to facilitate rapid data entry by clinicians and it is some times more appropriate to enter information about patients that is not well described if entered as structured data. Yet it creates a limitation as it is very hard to access clinical information that is locked in these free text fields. It is important to retrieve data to be used for many purposes such as for automated decision support or for statistical purposes.

Natural Language processing (NLP) can offer a solution for this problem as it extracts words from free text and also present well defined relations among these words and include an appropriate modifier for the records. NLP encode data through using terminology from a well defined vocabulary, represent relations among the concepts. To understand textual language NLP involves many components as syntactic, semantic and domain knowledge components.

There are many systems using the NLP to achieve their goals. Some of these systems are: The SPRUS system, The MedLEE system and MENELAS system. Some of the important resources for NLP is the UMLS and SPECIALIST Lexicon, grammar and domain models. There is some work investigating the use of SNOMED and ICD-10 like UMLS in the NLP systems.

NLP can provide some important features in the future that can benefit EHR: These benefits could include providing a reasonable and accurate way to retrieve data from the electronic health records that could be used for billing in a more accurate way than assigning the ICD-9 manually to patients at the time of discharge and for web searching about clinical information and to find data for statistical and researches. It could be used with XML tags in a n easy way to retrieve data.

Continues voice recognition systems can also be integrated with NLP in a way that could facilitate data entry and reduce time consumption for physicians. This will help in translating the textual report into a structure data encoded and sorted in real time along with original text in a clinical repository. NLP system would likely help producing standardized output forms suitable for the web through the use of XML,


Natural language processing and its future in medicine, Friedman, C; Hripcsak, G

Submitted by (Tamer Etman)


Friedman, C., & Hripcsak, G. (1999). Natural language processing and its future in medicine. Journal of the Association of American Medical Colleges . Friedman, C., Hripcsak, G., & Shablinsky, I. (1998). An evaluation of natural language processing methodologies. AMIA . Hripcsak, G., Friedman, C., Alderson, P. O., DuMouchel, W., Johnson, S., & Clayton, P. (1995). Unlocking clinical data from narrative reports: a study of natural language processing. Annals of Internal Medicine . Jackson, P., & Moulinier, I. (2007). Natural Language Processing for Online Applications. John Benjamins Publishing Co. Janetzki, V., Allen, M., & Cimino, J. J. (2004). Using Natural Language Processing to Link from Medical text to On-Line Information Resources. MEDINFO .


  1. VandeVelde R, Degoulet P. Clinical Information Systems, A Component-Based Approach. New York: Springer, 2003.
  2. The on-ramp to EHR: Exploring the symbiotic relationship between EHR and transcription solutions. MedQuist 2006. [1]
  3. Melton GB, Hripcsak G. Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries. JAMIA 12(4): 448, 2005.
  4. Hripcsak G, Friedman C, Alerson PO, DuMouchel W, Johnson SB, Clayton PD. Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing. Ann Int Med 122(9): 681, 1995.


  1. Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011 Sep-Oct;18(5):544-51. doi: 10.1136/amiajnl-2011-000464.
  2. Collen MF. Computer Medical Databases: the First Six Decades (1950-2010). London: Springer-Verlag London Ltd., 2012.
  3. Spyns P. Natural language processing in medicine: an overview. Methods Inf Med. 1996 Dec; 35(4-5):285-301.
  4. Raskin, V. (1987). Linguistics and natural language processing. Nirenburg (1987), 42-58.
  5. Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med. 1999 Aug;74(8):890-5.
  6. Chomsky N. Syntactic Structures. de Gruyter Mouton. 2002.
  7. Li M, Patrick J. Extracting temporal information from electronic patient records. AMIA Annu Symp Proc. 2012;2012:542-51. Epub 2012 Nov 3.

Submitted by (TJarmon)