Knowledge Extraction of Thoracic Radiology Reports
Using Statistical Natural Language Processing
Leandro Zerbinatti; Lincoln de Assis Moura Jr
[email protected]
Results
Introduction
This work promotes a study in health informatics technology
which analyses reports of chest X-ray through statistical natural
language processing methods for the purpose of supporting the
interoperability between health systems.
Another 200 reports of chest x-rays were selected to perform the
terms tagging experiment of with respect to the reference. The
efficiency obtained, which is the percentage of labeling of the
reports, was 45.55%.
Zipf’s constant was studied and it was determined that few words
make up the majority of the reports and that most of the words do
not have statistical significance.
Subsequentely, articles, prepositions and pronouns were
incorporated into the terms of reference under the linkage
concept of class. It is important to note that these terms do not
carry health knowledge to the text. Thus, the efficiency ratio was
73.23%, significantly increasing the efficiency obtained
previously.
The translation and comparison with existing standardized
medical vocabulary with international terminology, called
SNOMED-CT, was done based on the terms identified. The terms
that had a complete and direct correlation with the translated
terms were incorporated into the reference terms along with its
class and the word identifier.
Material and Methods
The method applied in our work is depicted in Figure below and
were synthetically were performed:
- The selection of the reference reports from the corpus. Reports
were first selected in 1000 and later the study was repeated in
1000 to other reports, thus totaling 2000 reports;
Figura 1 – Freqüência dos Tokens
Figura 2 – Freqüência dos n-gramas
- The application of statistical natural language processing,
extracting the words, n-grams and phrases;
Words entered as terms of reference in Portuguese were 'a',
'com', ’da', 'de', 'do', 'dos', 'e', 'em', 'no', 'na', 'que', 'ao', 'as', 'nas',
'o' and 'os'.
- The construction of the terms and referenced in Portuguese;
These word are not associated with a concept of SNOMED-CT.
Its function is to assign or connect concepts within sentences.
- The application of clinical reports labeler.
The improvement achieved is a direct result of the statistical
representation that these words have in the experimental set.
The clinical reports properly labeled with references to vocabulary
are structured documents required for interoperability between
systems based on ontologies described in OWL, the HL7 CDA
and archetypes.
Corpus
Reference Reports
Statistical Natural
Language
Processing
Words
N-grams
Phrases
Generation of
Portuguese
Vocabulary
Reference
Terms
Clinical Reports
Labeler
Reports
labeled
Classifier
Interoperability
Models
Ontologies
HL7/ CDA
Arquetypes
Conclusion
Was confirmed attendance to Zipf's law for the corpus
used. This law offers a few words that we use to
communicate, and is based on the principle of least effort.
.
The labeling efficiency obtained was 45.55%. Are added to
the 16 words including articles, prepositions and other
words to link concepts, the efficiency of labeling was
increased to 73.26%.
This increase in efficiency demonstrates that a few words
have great statistical significance and are responsible for
much
of
the
composition
of
the
corpus.
The patterns of representation of clinical information used
to enable semantic interoperability were ontologies
expressed in OWL, the HL7 / CDA and openEHR
archetypes of the Foundation.
Refefences
[1] Castilla AC. Instrumento de Investigação Clínico-Epidemiológica em Cardiologia Fundamentado no
Processamento de Linguagem Natural. Tese da Faculdade de Medicina de São Paulo da USP, SP, 2007.
[2] Chang, C.-H., Kayed, M., Girgis, M. R., & Shaalan, K., A Survey of Web Information Extraction Systems.
IEEE Transactions on Knowledge and Data Engineering, 2006.
[3] Chomsky, N., Language and Mind, Cambride University Press, 2006.
[4] Honorato, D.F., Metodologia para Mapeamento de Informações não estruturadas, descritas em laudos
médicos para uma representação atributo-valor. Dissertação do Instituto de Ciências Matemáticas e
Computação, USP, 2008.
[5] Manning CD, Schütze H, Foundations of Statistical Natural Language Processing, Stanford, The MIT
Press, 2003.
[6] Nardon, F. B. Compartilhamento de Conhecimento em Saúde utilizando Ontologias e Banco de Dados
Dedutivos, Tese da Escola Politécnica de São Paulo,USP, SP, 2003.
[7] NLTK – Natural Language Toolkit, http://nltk.sourceforge.net/index.php/Main_Page, acessado em
24/07/2008.
[8] SNOMED - Systematized Nomenclature of Medicine , http://snomed.org/, acessado em 25/07/2008.
[9] UMLS – Unified Medical Language System, http://www.nlm.nih.gov/research/umls/, acessado em
25/07/2008.
Download

Knowledge Extraction of Thoracic Radiology Reports Using