medical informatics and health informatics
medical informatics and health informatics
Health Informatics
Chapman & Hall/CRC
Data Mining and Knowledge Series
Series Editor: Vipin Kumar
S. Kaisar Alam
President & Chief Engineer at
Imagine Consulting Services
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to
publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or
the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright
material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.
copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400.
CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification
and explanation without intent to infringe.
Preface xxiii
Chapter Outlines xxv
Classroom Use of this Textbook xxix
Acknowledgments xxxi
About the Authors xxxiii
1 Introduction 1
1.1 Informatics 2
1.2 Modeling Healthcare Information 3
1.2.1 Data Abstraction 3
1.2.2 Raw Data to Information to Knowledge 4
1.2.3 Inference and Learning 4
1.3 Medical Informatics 5
1.3.1 Health Informatics 5
1.3.2 Clinical Informatics 5
1.3.2.1 Nursing informatics 6
1.3.2.2 Pharmacoinformatics 6
1.3.3 Patients’ Privacy and Confidentiality 6
1.4 Computational Health Informatics 7
1.4.1 Acceptance and Adoption 7
1.4.2 Emulating Human–Human Interactions 8
1.4.3 Improving Clinical Interfaces 8
1.4.4 Privacy and Security 9
1.5 Motivation and Learning Outcomes 9
1.6 Overview of Computational Health Informatics 10
1.6.1 Medical Databases 10
1.6.1.1 Electronic medical records 10
1.6.1.2 Information retrieval issues 12
1.6.1.3 Information de-identification 12
1.6.1.4 Maintaining patient privacy 13
1.6.1.5 Standardized medical knowledge bases 13
1.6.1.6 Automated data collection 13
1.6.2 Medical Information Exchange 14
1.6.2.1 Standards for information exchange 14
1.6.2.2 Types of connectivity 15
1.6.3 Integration of Electronic Health Records 16
1.6.3.1 Accessing from heterogeneous databases 16
1.6.3.2 Heterogeneity and interoperability 17
1.6.4 Knowledge Bases for Health Vocabulary 18
1.6.4.1 LOINC 18
1.6.4.2 MedDRA 18
vii
viii Contents
1.6.4.3 SNOMED 18
1.6.4.4 ICD 18
1.6.5 Concept Similarity and Ontology 19
1.6.6 Interfaces 20
1.6.6.1 Visual interfaces 20
1.6.6.2 Natural language interfaces 20
1.6.7 Intelligent Modeling and Data Analysis 21
1.6.7.1 Hidden Markov model 21
1.6.7.2 Uncertainty-based reasoning 22
1.6.7.3 Fuzzy logic 22
1.6.7.4 Bayesian probabilistic network 22
1.6.7.5 Speech-to-text conversion 22
1.6.7.6 Text analysis and generation 22
1.6.7.7 Heuristic reasoning 23
1.6.8 Machine Learning and Knowledge Discovery 23
1.6.8.1 Clustering 23
1.6.8.2 Regression analysis 24
1.6.8.3 Decision trees 25
1.6.8.4 Data mining 25
1.6.9 Medical Image Processing and Transmission 25
1.6.9.1 Image processing techniques 27
1.6.9.2 Medical image transmission 27
1.6.10 Biosignal Processing 27
1.6.10.1 ECG 28
1.6.10.2 EEG 28
1.6.10.3 MEG 28
1.6.11 Clinical Data Analytics 29
1.6.11.1 Evidence-based medicine 29
1.6.11.2 Survivability and hazard analysis 30
1.6.11.3 Randomized clinical trials 30
1.6.11.4 Clinical decision support system (CDSS) 30
1.6.11.5 Biomarkers discovery 30
1.6.12 Pervasive Health Care 31
1.6.12.1 Patient-care coordination 31
1.6.13 Bioinformatics for Disease and Drug Discovery 32
1.6.13.1 Biochemical reactions pathways 32
1.6.13.2 Genetic disease discovery 33
1.6.13.3 Vaccine development 33
1.6.13.4 Drug discovery 34
1.6.14 Pharmacokinetics and Drug Efficacy 34
1.6.14.1 Pharmacogenetics 35
1.7 Summary 35
1.8 Assessment 37
1.8.1 Concepts and Definitions 37
1.8.2 Problem Solving 37
1.8.3 Extended Response 40
Further Reading 41
2 Fundamentals 47
2.1 Data Modeling 47
2.1.1 Basic Data Structures 47
Contents ix
2.1.1.1 Histograms 48
2.1.1.2 Records in database 48
2.1.2 Modeling N-Dimensional Feature-Space 48
2.1.2.1 Proximity of multidimensional data 50
2.1.3 Modeling Graphs 51
2.1.3.1 Modeling graphs as matrices 51
2.1.3.2 Modeling graphs as a set of vertices 52
2.1.3.3 Modeling graphs as a set of edges 52
2.1.4 Trees for Database Search 53
2.1.4.1 Interval-based search 54
2.1.4.2 Limitations of binary trees 54
2.1.4.3 B+ trees for database access 54
2.1.4.4 PATRICIA tree – fast string-based search 55
2.1.5 Spatial Trees for Multidimensional Data 56
2.1.5.1 Quad tree 56
2.1.5.2 K-D (K-dimensional) tree 57
2.1.5.3 R (rectangular) tree 58
2.1.5.4 SS (similarity search) tree 59
2.1.5.5 VP (vantage-point) tree 60
2.1.6 Trees for Multidimensional Database Search 61
2.1.6.1 K-D-B tree and variants 61
2.1.7 Time-Series Data 63
2.1.7.1 Representing time-series data 63
2.1.7.2 Indexing structure 64
2.1.7.3 ISAX-based indexing 64
2.1.8 Trees for Spatiotemporal Access 65
2.1.8.1 Time-parameterized R-trees 65
2.2 Digitization of Sensor Data 66
2.2.1 Analog to Digital Conversion 66
2.2.1.1 Standardized sound format 67
2.2.1.2 Error correction and preprocessing 67
2.2.2 Digital Representation of Images 68
2.2.2.1 Proximity preserving image representation 68
2.2.2.2 Standardized image formats 69
2.2.2.3 Standardized video formats 69
2.2.3 Image Compression 70
2.2.3.1 Huffman coding 71
2.2.3.2 Segmentation and image compression 71
2.2.3.3 Compression in digital image formats 71
2.3 Approximate String Matching 71
2.3.1 Hamming Distance 72
2.3.2 Edit-Distance 72
2.3.2.1 Jaro–Winkler distance 72
2.3.2.2 Levenshtein edit-distance 73
2.3.3 Applications of Approximate String Matching 74
2.3.3.1 Dynamic programming 74
2.4 Statistics and Probability 75
2.4.1 Statistics 75
2.4.1.1 Basic metrics 76
2.4.1.2 Correlation 76
x Contents
2.4.2 Probability 77
2.4.2.1 Bayes’ theorem 78
2.4.3 Probability Distribution Functions 78
2.4.3.1 Gaussian distribution 78
2.4.3.2 Bivariate Gaussian distribution 79
2.4.3.3 Other distributions 79
2.4.4 Hypothesis and Verification 80
2.4.4.1 Confidence intervals and margin-of-errors 80
2.4.4.2 Hypothesis testing 81
2.4.5 Curve Fitting 81
2.4.5.1 Fitting a straight line 82
2.5 Modeling Multimedia Feature Space 83
2.5.1 Texture Modeling 83
2.5.1.1 Histogram as texture 84
2.5.1.2 Gradients as texture 84
2.5.1.3 Run-length matrices 85
2.5.1.4 Hurst operator 86
2.5.1.5 Cooccurrence (SGLD) matrix 87
2.5.1.6 Local binary pattern 88
2.5.1.7 Gabor filters 89
2.5.1.8 Wavelets 89
2.5.2 Shape Modeling 90
2.5.2.1 Contour-based techniques 90
2.6 Similarity-Based Search Techniques 92
2.6.1 Matching Query and Database Entity 93
2.6.2 Tree Traversal Techniques 93
2.6.2.1 Traversing R+ tree 93
2.6.2.2 Traversing SS-tree 94
2.6.2.3 Traversing K-D and K-D-B trees 94
2.6.2.4 Traversing VP trees 94
2.7 Temporal Abstraction and Inference 95
2.7.1 Modeling Time 95
2.7.2 Time Interval-Based Matching 95
2.7.2.1 Dynamic time warping 97
2.7.3 Temporal Analysis 98
2.7.4 Knowledge-Directed Temporal Analysis 99
2.8 Types of Databases 100
2.8.1 Relational Database 100
2.8.1.1 Limitations of relational databases 101
2.8.2 Object-Based Databases 101
2.8.2.1 Types of object-based databases 102
2.8.3 Multimedia Databases 102
2.8.4 Temporal Databases 102
2.8.4.1 Queries in temporal databases 103
2.8.4.2 Issues in temporal databases 103
2.8.5 Knowledge Bases 103
2.8.6 Distributed Databases and Knowledge Bases 103
2.9 Middleware for Information Exchange 104
2.9.1 eXtended Markup Language (XML) 104
2.9.2 SOAP and Message Envelope 105
Contents xi
12 Conclusion 531
12.1 Evolution of Health Informatics 532
12.2 Evolution of Standards 533
xxii Contents
Since the dawn of civilization, doctors and nurses have strived to relieve people from their pain. Continuous
improvement in science and information technology has enhanced the efforts of doctors and nurses by
giving them better tools to archive, analyze and transmit clinical data. Information technology promises
to provide available medical information seamlessly to providers and caregivers so they can optimize
their efforts for the best possible care.
In the last two decades, the increasing presence of computer processing has rendered health informa-
tion widely available. Combined with computational modeling and the development of distributed data-
bases, clinical data is being archived and analyzed using machine learning techniques and data mining,
generating a form of knowledge never seen before. This knowledge is improving life-expectancy by better
disease management, development of new vaccines, and drugs with reduced development-cycle time.
It is envisioned that, in the future, the seamless integration of information technology, intelligent
analysis techniques and medical science will provide quality care for an affordable price by incorporating
better clinical data analysis, providing pervasive care, removing duplicate medical treatment and labora-
tory data analysis and making data available electronically to collaborating healthcare providers.
The flow of information has raised many issues such as data-format standardizations, adoption of
technology and the need for intelligent user-friendly interfaces for the end users such as patients, doctors,
hospitals, nurses, pharmacies, insurance providers, policy makers and clinical researchers.
Despite the exponential growth of this multidisciplinary field, there has not been a single textbook
that provides the computational aspect of health informatics for both software developers and a new
generation of “Health Informatics Scientists”—the books written by clinical scientists present the topic
from the perspective of a clinical practitioner. There is a need for a textbook in Computational Health
Informatics that can prepare computer science or information technology students to understand the com-
putational techniques used in health informatics, along with the related medical concepts.
This book describes various computational techniques, including biostatistics, heterogeneous data-
bases, artificial intelligence, signal analysis, bioinformatics, image analysis, data communication for
transmission of clinical data and medical images and their application to clinical data analysis, as well as
management of electronic health records and their seamless integration to connect healthcare providers.
The book also discusses emerging areas of telemedicine, pervasive care, remote monitoring and bioinfor-
matics for the discovery of drugs, including pharmacokinetics and pharmacodynamics.
This textbook is based upon the Computational Health Informatics course that I have been teaching
since 2012, first to graduate students and then to senior-level undergraduate students beginning in 2014.
The course content evolved along with my understanding of the lack of knowledge and concepts students
need to develop software for health informatics. As I started writing the book, the course material also
evolved along with my knowledge. I included new material based upon my research on the ongoing evolu-
tion of Computational Health Informatics.
Javed, Kaisar and I committed to writing this textbook in 2015. It has taken four long years to come
to fruition, due to our other commitments as well as the need to do extensive research of the scattered
material available across multiple disciplines. When Leon Sterling came to know about our efforts, he
graciously contributed an important chapter about the need for new technology to meet patients’
emotional needs and satisfaction before it can be successfully adopted. The book itself has gone through
two revisions.
This textbook will assist (1) computer science students to understand concepts needed to develop tech-
niques and healthcare software; and (2) medical students and practitioners to understand the computational
xxiii
xxiv Preface
background and concepts for healthcare software and data management. The material is sufficient for one
semester at a senior or graduate-freshman level course. The book dwells on concepts and techniques;
however, specific in-depth algorithms have been avoided. We believe that the knowledge of the concepts
and techniques discussed will prepare the students to follow the necessary algorithms.
In my classes, I could cover Chapters 1 through 7, followed by Chapters 8 and 9. Other instructors
may find other combinations, including Chapter 10 on Bioinformatics for Drug Discovery, to be useful.
Because of the diversity of topics, I recommend sufficient classroom interactions between the instructors
and the students.
We hope that this book will provide a solid foundation to generate a new class of medical technocrats
who will understand and apply computational methods to facilitate patient-friendly automation in health-
care and improve the interpretation of clinical data.
We assume that students will have a background of two semesters of programming, introductory knowl-
edge of data structure concepts, and some knowledge of statistics and computer networks. The book
assumes that students can write at least 300 lines of code for developing projects. The book is divided into
12 chapters, including a concluding chapter. We have explained concepts in simple intuitive language at
an abstract level. We have described examples and case studies as needed.
Chapter 1 introduces informatics and data modeling, and describes modeling of health-related infor-
mation using the computational techniques to archive, retrieve, transmit, and analyze clinical and
patient-centric data. It introduces classifications of medical informatics and defines computational health
informatics. It describes the components of electronic health records, including medical images and trans-
parent integration of medical data from heterogeneous sources, including healthcare providers, medical
data warehouse, patients, pharmacy, hospitals and insurance agencies. It describes the need for secure
transmission of medical data between sources and user-friendly human–computer interfaces.
Chapter 2 describes foundational concepts derived from the needs of archiving, retrieving, transmitting
and intelligently analyzing data to extract human comprehensible knowledge from clinical data. It describes
data abstractions such as trees, graphs, strings and their matching; image modeling, matching analysis
techniques; formats to represent images; image compression techniques; basics of probability and statistics
needed for data analytics; curve fitting needed for data analysis; concepts in statistics; different types of
databases such as relational databases, object-based databases, multimedia databases, temporal databases;
knowledge bases and techniques to keep privacy and security in databases; middleware for data communi-
cation; basics of human physiology needed for health informatics; and basics of genomics and proteomics.
Chapter 3 describes various artificial intelligent and machine learning techniques that are used for the
automation of human–computer interactions, text extraction and summarization, form filling from doc-
tors’ natural language dictation about patients’ conditions, monitoring patients’ conditions, and clinical
data analysis to derive new information and knowledge from a huge amount of data generated. It describes
artificial intelligent techniques such as heuristic searches, probabilistic reasoning and modeling, deduc-
tion and induction; machine learning techniques such as data clustering, regression analysis, neural net-
works, support vector machines, Markov processes, Bayesian networks, hidden Markov models and data
mining. It also describes analysis and clustering of time-series data. It describes ontology and medical
dictionaries to understand and compare an assessment of the patients’ condition by various specialist
doctors. It also briefly describes techniques for automated information extraction, event analysis and sum-
marization from natural language texts.
Chapter 4 d iscusses organization of healthcare data that removes the duplication of patient records
while preserving the privacy of the patients as protected by HIPAA (Health Insurance Portability and
Accountability Act). HIPAA prevents providers and insurers from exposing information to others without
necessity and the patient’s consent. This chapter also describes automatic acquisition of data from medi-
cal sensors, conversion of data for automated archiving and retrieval from the heterogeneous healthcare
databases, and many popular standards for the exchange of health information electronically over the
Internet. It discusses interoperability and transformation of data to make them compatible with heteroge-
neous databases containing multimedia objects and temporal objects. It also discusses different views of
electronic medical records.
xxv
xxvi Chapter Outlines
Chapter 5 d escribes various medical imaging techniques needed to derive and analyze medical images
such as X-ray, computer-aided tomography (CAT), magnetic resonance imaging (MRI), ultrasound, posi-
tron emission tomography (PET), and other nuclear medicine and optics. The analysis of medical images
can offer noninvasively significant insights to clinicians. The chapter also describes various formats
and compression techniques for medical image archival, retrieval and transmission. Finally, the chapter
describes techniques for the application of medical image analysis with a focus on cancer detection and
computer-assisted treatment monitoring.
Chapter 6 describes DICOM, the standard for communicating digital images between medical databases.
The chapter describes data structures, modeling a medical process using entity-relationship modeling,
transmission protocols and various network levels involved in an image transmission. It also describes
briefly the security issues in transmitting medical images.
Chapter 7 d escribes the various signal analysis techniques to understand ECG (electrocardiograms) for
analyzing and monitoring heart-related diseases, electroencephalograms (EEG) to understand brain-
related diseases and electromyography (EMG) to analyze muscle-related abnormalities. It also describes
how various artificial intelligent techniques, described in Chapter 4, can be applied to extract and analyze
ECG and EEG. It discusses various applications of computational analysis to identify different diseases
related to the heart, brain and muscles.
Chapter 8 d escribes the application of various artificial intelligent techniques such as clustering, regres-
sion analysis, time-series data analysis, neural networks, clustering and data-mining to perform clini-
cal data analysis derived by clinical trials. It discusses statistical and computational techniques to study
drug efficacy, survivability and risk analysis. It describes some applications of clinical decision support
systems utilizing knowledge-based systems and artificial neural networks. It discusses the techniques to
identify and improve clinical processes and biomarkers for the cost-effectiveness of treatments. Finally,
three applications of clinical data analytics have been discussed for cancer detection, detection and man-
agement of dynamic organ failure and fatty-liver disease.
Chapter 9 d iscusses the concepts, techniques and some algorithms for remote care, automated monitor-
ing and transmission of signals, biosignal analysis, archiving the derived data for future analysis and
managing information security and patients’ privacy during data transmission and archiving. Remote
monitoring is becoming important to handle the shortage of medical practitioners, to provide elder-care
and to identify refractory conditions important in identifying disease-states of patients.
Chapter 10 d escribes bioinformatics and its application to drug discovery, efficacy analysis of drugs and
derivation of drug dosage and toxicity using pharmacokinetics and pharmacodynamics. This chapter
discusses biological concepts necessary for explaining bioinformatics, causes of various diseases, genetic
diseases, and pathway aberration-related diseases, as well as vaccine development and improvement in
the efficacy of drugs. The analysis techniques describe similarity-based search, genome alignment tech-
niques, dynamic programming techniques, SNP (single-nucleotide polymorphism), GWAS (genome-wide
association studies) and microarray analysis to identify signaling pathways. This chapter briefly describes
the structure of antibodies and computational techniques to improve the binding-affinity of antibodies to
improve drug-effectiveness.
Chapter 11 d iscusses the lack of understanding about the emotional needs of the healthcare providers,
caretakers and patients on the part of software developers. The potential for health informatics software to
improve health outcomes for patients is enormous. However, the effective utilization of health informatics
software depends upon the adoption and appropriation of the software by a wide range of stakeholders, with
a wide range of abilities and motivations. The emotional aspect of this interaction is vital. However, software
developers are often unaware of the patients’ emotional needs, experiences, and physical and emotional
Chapter Outlines xxvii
impairments, and thus ignore their needs in the developed software. This chapter also describes four
case studies where emotional factors have been taken into consideration during software development for
healthcare applications.
Chapter 12 describes the evolution of health informatics and its impact in an aging society and the need
of the developing world to provide quality care while maintaining the economy. It also discusses issues in
developing standards and adaptability. Finally, this chapter describes some future directions in computa-
tional health informatics.
There are five appendices at the end of the book that describe various sources or healthcare-related stan-
dards, conferences and journals, organizations, databases and companies related to healthcare. These
lists are representative subsets and are not meant to be exhaustive. The purpose of the appendices is to
provide the needed data-sources for doing research and project-reports needed for the course. Appendix I
describes the websites for major standards and formats described in this book. Appendix II summarizes
the list of the conferences and journals that were the source of material for this book. These conferences
and journals are rich sources for graduate research and course projects. The list is still not comprehensive,
and students should also find other sources for research. Appendix III lists major funding and databank
agencies, which are a rich source of data and are also involved in policy decisions regarding healthcare.
Appendix IV lists some major national and international databases that will be helpful in graduate and
undergraduate students’ research and projects. The list is certainly not comprehensive and misses many
research databanks from individual research groups and universities, yet is a major source of archived
data sufficient for research and projects. Finally, Appendix V lists a small representative subset of compa-
nies involved in the healthcare industry. It is divided into different classes such as EHR, medical imaging
devices and diagnostics, wearable devices and pervasive care and drug discovery.
Classroom Use of this Textbook
Based upon the experience in the Health Informatics course, this textbook is suitable for a one-semester
senior-level undergraduate course or freshman-level graduate course. For the graduate-level offerings, the
course needs to be augmented by the research articles given at the back of each chapter and various jour-
nals and conferences in the area (see Appendix I). A suggested distribution of the effort is given below:
SUGGESTED MINIMUM
TIME IN 75-MINUTE 45-MINUTE COVERAGE FOR A SEMESTER-
CHAPTER MINUTES LECTURES LECTURES LONG COURSE
Chapter 1 150 2.0 3.0 Full
Chapter 2 300 4.0 6.5 Full
Chapter 3 250 3.5 5.5 Full
Chapter 4 250 3.5 5.5 Full
Chapter 5 180 2.5 4.0 At least Sections 5.8 and 5.9
Chapter 6 180 2.5 4.0 Full
Chapter 7 180 2.5 4.0 At least Sections 7.1–7.5
Chapter 8 150 2.0 3.0 At least Sections 8.1, 8.5, and 8.6
Chapter 9 150 2.0 3.0 At least Sections 9.1–9.4
Chapter 10 180 2.5 4.0 Based on class makeup
Chapter 11 75 1.0 1.0 At least Sections 11.1 and 11.2
Total App. 2100 28 units 43 units
xxix
Acknowledgments
I thank Kent State University for the “Kent State University-Summa Health System Collaborative
Research Grant” that started my collaboration with Dr. Jeffrey Neilson (MD), who steered me to com-
putational health informatics from a medical practitioner’s perspective. I also acknowledge Jeff for gra-
ciously accepting my request to deliver guest lectures in my first offering of a graduate-level course during
Fall 2012.
I acknowledge all the researchers in this fast-growing field for their valuable contributions that
became an invaluable source of knowledge and learning. I must acknowledge Javed Iqbal Khan, who
nudged me to write this textbook with a promise to contribute the chapter on DICOM. I acknowledge
Kaisar Alam for contributing the chapter on medical image informatics. I also acknowledge the acquisi-
tion editor, Randi Cohen, for her constant encouragement and support throughout this long process of
writing and improving the text. I acknowledge the reviewers who raised the bar with useful comments. I
acknowledge Siemens Healthcare, research groups, researchers, publishers and medical practitioners who
permitted their copyrighted images and drawings to be included in this book.
My former PhD student, now Dr. Purva Gawde, contributed to the teaching and provision of feedback
of the material that was immensely helpful. She also developed an online version of the course material
from an earlier unpublished version of this book that was taught by her and myself. Finally, I acknowl-
edge my PhD advisor and friend, Leon Sterling, who graciously contributed a valuable chapter that raises
an important issue that any health technology ultimately must be human-friendly and easy to use for
adoption.
Being a longtime worker in medical informatics − medical image processing, high fidelity and com-
plex video communication, information coding, computation for radiation treatment planning, HIPAA to
DICOM, HL7, and medical IoT security – I have felt the need for a common compiled source of knowl-
edge in this multidisciplinary rich and vast area of computational health informatics. Each time I started
working on a topic, it involved vast self-learning into a seemingly different wilderness of knowledge.
There was never an ideal textbook in this highly important area that could prepare students from a com-
puter science and engineering background for the field.
Over the course of a year, discussions along these lines with Prof. Arvind Kumar Bansal eventually
resulted in this project. I am glad to see that finally that dream textbook is here. Given the rapid growth in
this area and its highly challenging multidisciplinary topical composition, it inevitably has many deficien-
cies. However, I am hopeful that, with feedback, this project will become perfect in a few years. More
importantly, it will now pave the way to allow students and practitioners to delve deeper into the area of
medical informatics with much sharper technical tools than has been possible previously.
I gratefully acknowledge the contribution of my advisor, David Y. Y. Yun, who introduced me to
the world of medical computing and affirmed the strength of seeking the bigger picture rising above the
individual subareas of computing.
xxxi
xxxii Acknowledgments
I have been in the area of medical imaging informatics and computer-aided diagnosis since early 1990s
and would like to thank Arvind for initiating this much-needed project. I believe that this book will fill a
conspicuous void and will be very useful to the practitioners in this area.
I would like to thank six individuals whom I consider both friends and mentors: my PhD advisor,
Kevin Parker; my postdoc supervisor, Jonathan Ophir (deceased); my former supervisor, Ernie Feleppa;
my former supervisor, Fred Lizzi (deceased); Kazi Khairul Islam; and Brian Garra. Finally, I would like
to thank my Creator, my parents (deceased), my wife, our two children, my two siblings and all my friends
and family. I really appreciate all your encouragements and support.
S. Kaisar Alam
About the Authors
Arvind Kumar Bansal is a full professor of Computer Science at Kent State
University. He received both B. Tech (1979) in Electrical Engineering and
M. Tech (1983) in Computer Engineering and Science from the Indian Institute
of Technology at Kanpur (IITK), India, and PhD (1988) in Computer Science
from Case Western Reserve University (CWRU), Cleveland, Ohio, USA.
He has been a faculty member of Computer Science at Kent State University,
Kent, Ohio, USA, since 1988 and has taught undergraduate and graduate-level
courses in the areas of artificial intelligence, computational health informatics,
multimedia languages and systems and programming languages. He also directs
the “Artificial Intelligence Laboratory” at Kent State University and has been
teaching “Computational Health Informatics” regularly since 2012.
His research contributions are in the areas of artificial intelligence, bioinfor-
matics, proteomics, biological computing models, massive parallel knowledge bases, program analysis,
ECG analysis, social robotics and multimedia languages and systems. He has published over 75 refereed
articles in journals and international conferences. His research has been funded by NASA and the US Air
Force. He has also served in many program committees in the areas of artificial intelligence, bioinformat-
ics, logic programming, multimedia, parallel programming and programming languages. In addition, he
has been an area editor in the international journal Tools with Artificial Intelligence and is a member of
IEEE and ACM.
Javed Iqbal Khan is a full professor of Computer Science at Kent State University.
He received his B. Tech (1987) in Electrical Engineering from the Bangladesh
University of Engineering and Technology (BUET), Bangladesh, and his MS (1990)
and PhD (1995) in Electrical Engineering (Computer Track) from the University of
Hawaii at Manoa, Hawaii, USA. He has been a faculty member of Computer Science
at Kent State University, Kent, Ohio, USA, since 1997. He has regularly taught under-
graduate and graduate courses in the areas of Internet engineering, peer-to-peer
systems, artificial intelligence, algorithms and networking.
His research contributions are in Internet Engineering, artificial intelligence,
automated knowledge extraction, routing and network decision-making with medi-
cal data, perceptual enhancement through eye-tracking, cyber infrastructure for medical-image com-
munication, and networking for education. He has published over 100 articles in refereed international
conferences and journals and has been in NSF panels, many program committees and the executive
committee of IEEE Internet Engineering. He also led a team that designed and implemented two national
educational networks as a part of UN-funded project. His research has been funded by World Bank, NSF,
DARPA and NASA. As well, he has been a Fulbright scholar and has served as a senior specialist on high-
performance education networking in the Fulbright National Roster of experts. He is an associate editor
of International Journal of Computer Networks and Applications and is a member of IEEE and ACM.
xxxiii
xxxiv About the Authors
S. Kaisar Alam received his PhD (1996) in Electrical Engineering from the
University of Rochester, New York, USA. His research publications and teaching
are in signal/image processing with applications to medical imaging. He was a
Principal Investigator at Riverside Research, New York from 1998 to 2013 and the
Chief Research Officer at an upcoming tech startup in Singapore from 2013 to 2017.
He has been a visiting professor at the Center for Computational Biomedicine Imaging
and Modeling (CBIM), Rutgers University, Piscataway, New Jersey (since 2013)
and an adjunct faculty at The College of New Jersey (TCNJ), Ewing, New Jersey
(since 2017). Currently, he runs his own consulting company specializing in medi-
cal image analysis and diagnostic and therapeutic applications of ultrasound. He is a Fellow of the American
Institute of Ultrasound in Medicine (AIUM) and a senior member of IEEE and has served in the AIUM
Technical Standards Committee and the Ultrasound Coordinating Committee of the RSNA-QIBA. He is
an associate editor of Ultrasonics (Elsevier) and Ultrasonic Imaging (Sage). Dr. Alam has been a recipient
of the prestigious Fulbright Scholar award.
1
2 Introduction to Computational Health Informatics
due to the availability of medical data and 14) reduction of paper consumption. For example, automated
image analysis can diagnose malignancies in mammograms and brain MRIs (magnetic resonance imaging).
Automated analysis of ECGs can facilitate the work of medical practitioners in the treatment of various
heart-related diseases. Automated analysis of EEG can predict an impending epilepsy attack.
There are some disadvantages in automation:
1. Automated interpretation cannot replace the wisdom of human healthcare providers due to
inherent limitations in modeling techniques that can affect overall accuracy.
2. The availability of automation encourages excessive and often meaningless data generation.
Handling a large amount of data is difficult and error prone.
3. Computer programmers introduce and enforce unnecessary checks and mundane questions in
the human–computer interface that take away additional time from healthcare providers, mak-
ing the automation less attractive. This issue has plagued the adoption of automation tools by
medical practitioners.
One problem in the slow adoption of automated healthcare systems is that the doctors and programmers
do not understand each other’s needs. Healthcare providers find the software overly imposing despite
understanding their advantages. Both need to be educated and trained: healthcare providers should under-
stand computational health informatics more, and information scientists should understand the needs of
the care providers and patients to make their software user-friendly.
Although there are a few disadvantages, the perceived advantages are significant. Due to automation,
healthcare services, pharmacy, medicine dispensing and nursing are getting seamlessly integrated. As shown
in Figure 1.1, there are many components of health-automation: patients’ electronic records, archiving and
analysis of physiological data, improvement in time-management of nursing, pharmacy, medicine dispensing
and billing. Computer-based system automation has seamlessly integrated these components in the last decade.
1.1 INFORMATICS
Informatics handles different aspects of information such as modeling a process, digitization of the
information, efficient electronic archiving and retrieval of the information, transferring the information,
grouping and classifying the information for enhanced data analysis and knowledge extraction, statisti-
cal analysis to identify data patterns, analysis of time-series data to identify a trend, learning from the
1 • Introduction 3
patterns to create simple rules and new medical knowledge, interacting with other information sources to
enhance existing knowledge and keeping the information current.
The key factor is the improvement of the overall system efficiency with significant reduction in pro-
cessing time and required resources. In terms of health care, the resources are: 1) availability of healthcare
providers; 2) availability of hospital beds and 3) support personnel hours to handle duplications. The increase
in system efficiency will be: 1) improved number of patients treated; 2) faster recovery of the patients due to
improved coordination and better diagnosis and 3) increased productivity.
Example 1.1
A patient is modeled abstractly as (personal information, list of healthcare providers, insurance provider,
disease history, list of medicines causing side-effects, symptoms, diagnosis, prognosis). Each field is fur-
ther decomposed into many subfields. For example, patients’ personal information is modeled as a tuple
(patient’s unique identifier such as social security number, name, address, emergency contact).
Example 1.2
A hospital bed is modeled as a tuple: (bed-id, location, patient-id, doctor-on-duty, nurse-on-duty,
patient’s entry-time, patient’s condition, patient’s lab data, list of medications administered, list of
signals monitored). A hospital department is modeled as an array of beds for inpatient treatment. Each
monitoring device is modeled as a tuple (type of signal, signal output, frequency of signal output,
archival format of the signal).
psychology. Clinical informatics involves health signal monitoring, nursing of the inpatients, man-
agement of patient–doctor encounters, nursing care, physiological data analysis, radiology image
analysis, ECG signal analysis, ophthalmological data analysis, managing the treatment record of the
patients, including medications dispensing record and the procedures involved in patients’ treatment.
There are many further subcategories of clinical informatics such as dental informatics, pharmaceuti-
cal informatics, nursing informatics and primary care informatics. Primary care informatics involves
all aspects of family practice, general internal medicine, educating patients, pediatrics, geriatrics and
advanced nursing. Dental informatics involves all aspects of dental care, including dental surgery and
prosthetics.
1.3.2.2 Pharmacoinformatics
AMIA defines pharmacoinformatics as all aspects related to using medications by the patients.
Pharmacoinformatics includes all aspects of research, analytics and development of computational tech-
niques, including decision support systems, for prescribing, verifying and dispensing, administering,
monitoring and educating the patients and care providers about the medication. Prescription includes the
streamlining and automating the process of prescription, administration, verification and billing of the
medication such that a medication once prescribed by the physician is automatically checked for side-
effects, duplication, and permission by the insurance company before dispensing.
Another key HIPAA is very important because when we transfer the information between the central
database to the end user who could be a patient, or a healthcare provider, or a pharmacy, ensure that proper
software filters filter out the information not needed by the end user. Due to the privacy constraints, the
archived data needs additional security and encryption.
Third important aspect is acceptance by the patients and relatives. Electronic devices are looked with
suspicion for many reasons: 1) violation of privacy; 2) presumed lack of response by the human care pro-
vider when an alert occurs; 3) fear of failure to operate at a critical time; 4) cumbersome entanglement and
interaction with human body in terms of the form factor (weight and size); 5) technical complexity and
lack of standardization resulting into a learning curve to operate; 6) lack of human courtesy and human-
like interaction and 7) lack of empathy specially for elderly patients. Devices still have a bigger form fac-
tor. Recent wireless sensors are better. However, there is no technology to assess the pain and emotion of
the patient just by watching the patient.
Technology is expensive to upgrade and integrate with the remaining information system. Because of
this limitation, hospitals are slow to upgrade the technology. A new technology also requires training of
the staff and patients to use. Unfortunately, due to the lack of standardization and backward compatibility
of operations, it is difficult to learn the changes in technology.
Example 1.3
Figure 1.2 shows a cross-sectional view into a heart showing various chambers and heart valves. A
physician when prompted with this visual interface can easily mark multiple heart-related abnormali-
ties on a computer screen.
1 • Introduction 9
FIGURE 1.2 A cross-sectional view of heart diagram for visual interface (Figure courtesy © Dr. Purva Gawde,
part of her PhD dissertation, used with permission).
This book is for the first course in “Computational Health Informatics” suitable for senior undergraduate
students and fresh graduate students. It is also suitable for the researchers in one discipline such as com-
puter scientists to understand the concepts and issues involved in “Computational Health Informatics” or a
physician getting educated in understanding computational techniques involved in the automation process
of healthcare.
This book prepares the students and the researchers to explore further in this fast-evolving field
and does not provide detailed algorithms and software of various approaches described throughout
this book. Detailed algorithms can be looked up in the cited research articles, and can be explored
further.
medicines and their dosages. The billing department only needs to know the procedure codes that
medical practitioner performed, stay-time in the hospital, types of treatment (regular or emergencies;
inpatient or outpatient), and insurance companies need to know the procedure codes, name of the
patients, any duplication of the procedure codes, and whether the procedure codes are allowed. Medical
practitioners should be able to query by date, range of dates and by content-based similarity. This
requires innovative information archiving/retrieval techniques.
With the availability of cheap computational power and ever-growing networking of computer, it
has become possible to put all the health informatics data into large databases and share the information
electronically using a secure computer network either by transmitting or by remotely accessing a central
database. This automation of data requires interoperability among multiple types of databases used by
health organizations. In addition, database records should be carried over the Internet to the other organi-
zation, medical practitioners or the patients.
EMR is a networked database that contains: personal information of patients, physicians’ infor-
mation, patient’s physician-related information, patients’ history, information about patient-doctor
encounters, including prescription and lab-results, information about patient monitoring, patient-
reminders, information about dispensing of the medicine by the pharmacists, list of pharmacies and
their information, information about prescriptions sent to pharmacies, information about hospital
facilities, information about lab facilities, information about insurance agencies, information about
billing, billing-audit related information, Addresses of the entities (patients, physicians, clinics, hos-
pitals etc.), treatment eligibility information, drug-related information. The information is related to
each other using primary keys such as patient-ids, provider-id, hospital-id, insurance-id, pharmacy-id,
encounter-id, etc. Each of these information contents has multiple fields. For example, patient infor-
mation includes fields such as (name, gender, title, occupation, employer, patient-id, social security
number, driver license, date of birth, New/repeat, address, phone(s), insurance company, insurance
type, emergency contact(s), etc.). Figure 1.3 shows an interconnection of a subset of these information
components related to an electronic health record (EHR) database. The textboxes show different rela-
tional tables.
The relational tables are connected through shared-key to other relational tables. For example,
the textbox “patient personal information” contains the set of information for each individual patient;
the textbox “physician information” contains the set of information for each individual physician; and
pharmacy information contains the set of information for each individual pharmacy. “Patient-physician
encounter” is a relational table containing all the information about appointment for a patient with
a physician. All the edges between the relational tables show the connection between the relational
tables using a common field used to identify individual records (tuples) uniquely in at least one of the
relational tables.
For example, “patient-id” will be a unique id for the relational table “Patient’s personal informa-
tion,” and “Physician-id” will be a unique-id for the relational table “physician information.” The field
“patient-id” connects the relational table “Patient-physician encounter” with the relational table “Patient’s
personal information” so that related records related to patients can be retrieved. Similarly, the relational
table “Patient-physician encounter” is connected to the relational table “physician information” so that
physician-related information can be retrieved. Alternately, we can answer a query, what all patients a
physician has seen over a period?
Medical databases can be geographically separated as each hospital has its own proprietary informa-
tion about the patients that cannot be shared with others, other than the patient, without proper authoriza-
tion as permitted by HIPAA. The cross-references about the same patient-ids are stored in a centralized
database so that the records from other hospitals can be retrieved with no duplications as described in
Section 3.3.2.
The advantages of EMR are: 1) instant access to integrated data to avoid duplication of the tests;
2) data-archiving for a long period to study the recovery of the patients; 3) analysis of disease specific data
to identify patterns of parameters that can cause the diseases; 4) data analytics to study the effectiveness,
toxicity (harmful effects) and proper dosage of medications; 5) integration of real-time monitoring of the
patients’ vital signs, lab results, diagnosis and medicine dissemination and 6) providing remote health
care to elderly patients. The overall effect is to reduce the cost of medical care while providing the opti-
mum use of the resources such as patients served per healthcare providers, patients treated per hospital
bed and patients served for every nurse.
1. Procedures for the diagnosis, treatment, surgeries and billing that require efficient compression
and transmission over the Internet;
2. Medicines administered to patients in different diseases;
3. Synonyms and antonyms used for automated text analysis so that similarity between two seem-
ingly different texts can be understood; and
4. Images and ECG histories of patients to compare the progression/remission rates of diseases.
FIGURE 1.5 A schematic for automated data acquisition using health service bus
each having a different standard and format. The adapters transform back and forth the data in common
format to the data format of individual virtual device and structured data suitable for archiving in a
medical database. There are multiple types of adapters to provide the data transformation as illustrated
in Figure 1.5.
Application layer
HL-7 message HL-7 message
Web interface
SOAP SOAP
message Transport layer message
FIGURE 1.6 A schematic of Internet-based medical information exchange using HL7 format
1. Medical Information Bus (MIB) that is used to send the sensor data to a central monitoring
station where it can be analyzed for any emergency condition and visualized by the nurse-on-duty.
MIB is used to gather data from the sensors used to monitor the patients post-surgery, in inten-
sive care units and during the surgery. Different sensors are developed by different vendors
using their own proprietary data-format. The task of the MIB is to provide a common data-
format for information exchange for data coming from different sensors. The advantage of MIB
is that devices from different vendors become plug-and-play. The current standard for MIB is
IEEE 11073, and it has seven layers: physical layer, data-link layer, network layer, transport
layer, session layer, presentation layer and application layer. The communication standard is
based upon an intermediate XML message-based language called MDDL – Medical Device
Data Language. Each physical device is described as “Virtual Medical Device” (VMD) in
MDDL. MDDL codes for: i) medical devices; ii) different types of alerts and iii) units for mea-
surements. More details about clinical interfaces are given in Sections 3.1.3–3.1.5.
2. Communication between different units within the same organization with homogeneous data-
bases is done using HL7 over the Intranet or secure cloud.
16 Introduction to Computational Health Informatics
Figure 1.7 shows a schema for information exchange using unique patient index. Two databases cor-
respond to requesting party and sending party. Each database is connected through the incoming and
outgoing data interfaces to the cross-reference database that store patients’ unique patient-id, while the
databases of an individual health-organization stores its local indexing for the patients.
The cross-reference database has two types of tables: 1) a table to look up the unique patient-index
given the local-indices and 2) a table to lookup local patient-id given the unique patient-index. Using
these tables, the local patient-id from the requesting database is converted to local patient-id of the
sending database and vice-versa. The outgoing data interfaces convert the local data format to a com-
mon data format such as HL7, and the incoming data interfaces convert the HL7 format back to the
local data format.
To solve this problem of incompatibility of operating systems, standards, formats, language librar-
ies, the multitude of health vocabulary meaning similar things, intelligent knowledge-based software is
needed to interface records and develop dictionaries that provide interoperability, and intelligent software
is needed to analyze documents that would identify and match the meaning of sentences and phrases.
1.6.4.1 LOINC
LOINC is a dictionary of universal code names for medical terminology related to EHRs. The use of
universal standardized codes facilitates the electronic exchange of medical information such as medical
procedures, including surgical procedures, lab-tests, devices used in lab-tests and clinical observations.
Data-exchange standards such as HL7 and IHE (Integrating the Healthcare Enterprise) interface with
LOINC codes. More details of LOINC have been described in Section 4.8.3.
1.6.4.2 MedDRA
MedDRA is a tool to encode and communicate the pharmaceutical terminologies related to the clinical
tests of drugs, vaccines and drug-delivery devices under development. It covers the tests and outcomes
starting from the clinical stage up to the marketing stage. Much of the clinical information such as dis-
eases and disorders, observed signs and symptoms, drug efficacy (effectiveness of the drug), side-effects,
adverse events; social relationships and family history are encoded and communicated using MedDRA.
These data are pooled, analyzed, compared and verified using standardized data analytics tools. The
medical terminology of MedDRA is extensive, and is structured hierarchically.
The advantage of MedDRA is to provide encoding free from language and cultural barriers. The
results and outcomes are shared among the clinical researchers. It is a rich, highly specific, hierarchical,
medically oriented and rigorously maintained terminology designed to meet the needs of drug regulators
and the pharmaceutical industry as a shared international standard.
1.6.4.3 SNOMED
SNOMED CT (Systematized Nomenclature of Medicine − Clinical Terms) describes universal codes
for medical terms such as medicine, procedures, microorganisms, diseases, synonyms, anatomy of
where a disease occurs, functions and structure of medicines, chemical agents or microorganisms caus-
ing a disease, chemical name of the drugs, disease diagnosis, devices and activities used in treating
diseases and social relationships associated with disease conditions. SNOMED codes are transmitted
over the Internet using exchange standards such as HL7. The coding structure of SNOMED has been
discussed in Section 4.8.2.
1.6.4.4 ICD
ICD is a world standard for the classification of diseases, their symptoms, their diagnostics, abnormal
findings, origin and spread, complaints and social circumstances. It is supported by WHO (World Health
Organization), and is used worldwide to collect statistics of the treatment, symptoms and fatalities caused
1 • Introduction 19
by various diseases. ICD keeps getting updated as the medicines, treatments and medical procedures get
updated. The current version is ICD-10. More details of ICD codes and their structure are discussed in
Section 4.8.4.
Example 1.4
The word “growth” in the context of tumor diagnosis will mean malignancy of the tumor, and will be
a cause of much anxiety. However, growth within the context of a child will be a healthy and welcome
aspect. Multiple medical terminologies have been used within the same and different dictionaries that
have similar meaning.
Example 1.5
A cardiologist describes to a physician that a patient has “arrhythmia” (a condition describing an
irregular beat-pattern of heart); another cardiologist describes it as “tachycardia”; the third cardiolo-
gist describes to the patient as “fast heartbeat;” first cardiologist describes to another cardiologist as
“ventricular arrhythmia.”
FIGURE 1.8 An illustration of hierarchical structure for the ontology in Example 1.5
20 Introduction to Computational Health Informatics
A close scrutiny of the phrases shows that all four conditions are related semantically using a hierar-
chical network. The phrase “fast heartbeat” is related to “tachycardia” using the relation is-meaning-of;
the entity “tachycardia” is related to the entity “arrhythmia” using the subclass relation is-a; the entity
“ventricular arrhythmia” is related to the entity “arrhythmia” using the subclass relationship is-a; the entity
“tachycardia” is related to the entity “ventricular arrhythmia” using a relationship can-be. A hierarchical
structure with entities as nodes and relations as edges relates all four terms using transitivity of relations as
illustrated in Figure 1.8.
In Section 1.6.4, many standards were described. These standards have overlapping domains. One
database may use medical codes using one standard such as LOINC while the other may use SNOMED.
Ontology will relate the two terms from different databases or match two natural language descriptions.
More details about ontology description and their role in integrating information in the heterogeneous
databases are described under Section 3.9.
1.6.6 Interfaces
When operating in a heterogeneous environment in a culturally diverse country, the data has to be inter-
faced to: 1) patient; 2) medical practitioners, including specialists, surgeons, nurses, pharmacists, para-
medics and radiologists; 3) billing personnel; 4) appointment and social interaction staff and 5) data
analysts and academic researchers.
The data have multiple types such as patients’ family history, encounter with the medical practitio-
ners, medication history, diagnosis from the symptoms, history of the symptoms, pre- and post-history
radiology images, billing amounts along with codes. The data is exchanged between multiple hetero-
geneous databases (often placed behind the firewalls) and between medical practitioners using differ-
ent medical terminology based upon their specialization. This information exchange requires medical
interfaces.
Medical interfaces provide: 1) a seamless transition of image and data between heterogeneous data-
bases and 2) interaction between different actors and the data collection system. The second type of
interface is a user interface and varies based upon the actor (patient, doctor, nurse), type of disorder (eye
disorder, heart disorder, lung disorder, various types of fractures, and brain tumors) and type of interven-
tion (biopsy, surgery, simulation of surgery, etc.).
An important aspect missing in present day user interfaces is the lack of understanding of human–
computer interaction and the cognitive aspect of human comprehension in the information presentation
and collection. Cognitive science studies human comprehension and performance and will help in devel-
oping better user interfaces. The development of interfaces requires middleware tools that integrate well
with image-based representation and with the web-based and/or cloud-based databases.
transition is derived by statistical analysis of the large sample set of examples with a known outcome.
HMM is described in Section 3.7.3. It has been used to model many phenomena where time-series data
involving periodic measurement of values is known. Some applications of computational health informat-
ics are ECG analysis, recovery response to medication, speech recognition, natural language understand-
ing model, gene detection during genome analysis, etc.
requires that healthcare software should be able to convert natural language summary into structured
data into the database, generate natural language from structured data and find equivalence between
two textual summaries.
Textual analysis for extracting information requires detection of health-domain specific words and
the corresponding values (including fuzzy values). Text generation uses various templates to generate
natural language, and equivalence of the two sentences is found using concept similarity and ontology.
The technique has been described in Section 3.10.
1.6.8.1 Clustering
Clustering is an automated unsupervised learning technique for the classification of the data ele-
ments based upon modeling data in an N-dimensional space where each feature of a data element is a
24 Introduction to Computational Health Informatics
dimension. Points are grouped together if the distance between the coordinate-vector for each point in
a group is less than a threshold. The underlying assumption is that two entities having similar feature-
values have other common attributes and behavior. Clustering has been used to automatically learn the
classes of entities exhibiting similar behavior. Many types of clustering techniques have been discussed
in Section 3.5.1.
Many notions of distance are used to derive the similarity between two feature-vectors. Popular
ones are: Euclidean distance, Manhattan distance and weighted Euclidean distance. Euclidean dis-
tance finds the shortest straight-line path between two points. Given two points in an N-dimensional
space as <x11, x12 , …, x1N> and <x21, x22 , …, x2N>, the Euclidean distance between the points is given
by Σ ii == 1N ( x1i − x 2i )2 . Manhattan distance finds out the sum of the absolute difference of values of the
same coordinates between two points. Given coordinate vectors <x11, x12, …, x1N> and <x21, x22, …, x2N>, the
Manhattan distance is given by Σ ii == 1N x1i − x 2i . Weighted Euclidean distance adds different weights to indi-
vidual distance components contributed by different dimensions. Given coordinate vectors <x11, x12, …, x1N>
and <x21, x22, …, x2N>, weighted Euclidean distance is given by Σ ii == 1N wi × ( x1i − x 2i )2 where wi is the weight
of the ith parameter. The rationale is that different parameters have different importance.
Example 1.6
Figure 1.9 illustrates the concept for a popular type of clustering called K-means clustering of nine
data elements. Each data element is a vector of two feature-values. Each feature has become a dimen-
sion. Thus, we mark the data elements as points in a two-dimensional plane. Two groups of points are
close to each other, and the distance between them is below a threshold. These two groups are called
clusters. The assumption is that all the points within the same cluster share common properties.
Example 1.7
A large sample size of patients is given the same medication. Some get side-effects. Suppose that
the ethnicity-based analysis establishes that 90% are Afro-Americans, there is an associative pattern
between side-effect and Afro-Americans.
In health care, a large amount of data is collected from sensors monitoring the patients, lab-results,
medication reports, diagnosis based on the lab results and symptoms. This data is data-mined to:
1) identify a new set of parameter values that cause diseases; 2) derive effective dosage for different
class of patients based upon age, gender and ethnicity; 3) identify new disease patterns; and 4) identify
biomarkers − biomolecule in internal fluids from the body that indicate the presence of diseases before
other detectable symptoms appear. Data Mining is described in Section 3.8.
FIGURE 1.11 An X-ray of a fractured ankle (Images provided by Siemens Healthcare, used with written
permission).
spine injuries; tuberculosis; cancer; hidden cysts and tumors in vital organs such as brain, lungs, heart,
liver, pancreas and kidney; malfunctioning internal organs; and state of the fetus during pregnancies.
These medical images could be a two-dimensional still image as in X-ray of a bone fracture, a cascaded
sequence of images of different slices of an organ used to computationally create a 3D structure as in CAT
scan and MRI, or a video of images to record or model the motion such as heart motions to understand
the problems in heart wall motion during blood pumping. Before archiving, an image is preprocessed to
remove the noise and enhance the image quality using image processing techniques.
Example 1.8
Figure 1.11 shows an X-ray of the fractured ankle that has been operated upon. The X-ray shows if
there is any remaining problem in the healing process without performing any surgery. Similarly,
Figure 1.12 shows the MRI scan of a brain that can check for any abnormality such as tumor in the
brain without performing any invasive surgery. These images can be used to plan future surgery
more accurately.
FIGURE 1.12 MRI scan of a brain (Images provided by Siemens Healthcare, used with written permission).
1 • Introduction 27