Knowledge Extraction of Thoracic Radiology Reports Using Statistical Natural Language Processing Leandro Zerbinatti; Lincoln de Assis Moura Jr [email protected] Results Introduction This work promotes a study in health informatics technology which analyses reports of chest X-ray through statistical natural language processing methods for the purpose of supporting the interoperability between health systems. Another 200 reports of chest x-rays were selected to perform the terms tagging experiment of with respect to the reference. The efficiency obtained, which is the percentage of labeling of the reports, was 45.55%. Zipf’s constant was studied and it was determined that few words make up the majority of the reports and that most of the words do not have statistical significance. Subsequentely, articles, prepositions and pronouns were incorporated into the terms of reference under the linkage concept of class. It is important to note that these terms do not carry health knowledge to the text. Thus, the efficiency ratio was 73.23%, significantly increasing the efficiency obtained previously. The translation and comparison with existing standardized medical vocabulary with international terminology, called SNOMED-CT, was done based on the terms identified. The terms that had a complete and direct correlation with the translated terms were incorporated into the reference terms along with its class and the word identifier. Material and Methods The method applied in our work is depicted in Figure below and were synthetically were performed: - The selection of the reference reports from the corpus. Reports were first selected in 1000 and later the study was repeated in 1000 to other reports, thus totaling 2000 reports; Figura 1 – Freqüência dos Tokens Figura 2 – Freqüência dos n-gramas - The application of statistical natural language processing, extracting the words, n-grams and phrases; Words entered as terms of reference in Portuguese were 'a', 'com', ’da', 'de', 'do', 'dos', 'e', 'em', 'no', 'na', 'que', 'ao', 'as', 'nas', 'o' and 'os'. - The construction of the terms and referenced in Portuguese; These word are not associated with a concept of SNOMED-CT. Its function is to assign or connect concepts within sentences. - The application of clinical reports labeler. The improvement achieved is a direct result of the statistical representation that these words have in the experimental set. The clinical reports properly labeled with references to vocabulary are structured documents required for interoperability between systems based on ontologies described in OWL, the HL7 CDA and archetypes. Corpus Reference Reports Statistical Natural Language Processing Words N-grams Phrases Generation of Portuguese Vocabulary Reference Terms Clinical Reports Labeler Reports labeled Classifier Interoperability Models Ontologies HL7/ CDA Arquetypes Conclusion Was confirmed attendance to Zipf's law for the corpus used. This law offers a few words that we use to communicate, and is based on the principle of least effort. . The labeling efficiency obtained was 45.55%. Are added to the 16 words including articles, prepositions and other words to link concepts, the efficiency of labeling was increased to 73.26%. This increase in efficiency demonstrates that a few words have great statistical significance and are responsible for much of the composition of the corpus. The patterns of representation of clinical information used to enable semantic interoperability were ontologies expressed in OWL, the HL7 / CDA and openEHR archetypes of the Foundation. Refefences [1] Castilla AC. Instrumento de Investigação Clínico-Epidemiológica em Cardiologia Fundamentado no Processamento de Linguagem Natural. Tese da Faculdade de Medicina de São Paulo da USP, SP, 2007. [2] Chang, C.-H., Kayed, M., Girgis, M. R., & Shaalan, K., A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering, 2006. [3] Chomsky, N., Language and Mind, Cambride University Press, 2006. [4] Honorato, D.F., Metodologia para Mapeamento de Informações não estruturadas, descritas em laudos médicos para uma representação atributo-valor. Dissertação do Instituto de Ciências Matemáticas e Computação, USP, 2008. [5] Manning CD, Schütze H, Foundations of Statistical Natural Language Processing, Stanford, The MIT Press, 2003. [6] Nardon, F. B. Compartilhamento de Conhecimento em Saúde utilizando Ontologias e Banco de Dados Dedutivos, Tese da Escola Politécnica de São Paulo,USP, SP, 2003. [7] NLTK – Natural Language Toolkit, http://nltk.sourceforge.net/index.php/Main_Page, acessado em 24/07/2008. [8] SNOMED - Systematized Nomenclature of Medicine , http://snomed.org/, acessado em 25/07/2008. [9] UMLS – Unified Medical Language System, http://www.nlm.nih.gov/research/umls/, acessado em 25/07/2008.