Baytar C. The Future of Data Mining 2022
Baytar C. The Future of Data Mining 2022
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or
by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no
expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of information
contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in
rendering legal, medical or any other professional services.
Research Methodology and Data Analysis
All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted
in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying,
recording or otherwise without the written permission of the Publisher.
We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to
reuse content from this publication. Simply navigate to this publication’s page on Nova’s website and
locate the “Get Permission” button below the title description. This button is linked directly to the
title’s permission page on copyright.com. Alternatively, you can visit copyright.com and search by
title, ISBN, or ISSN.
For further questions about using the service on copyright.com, please contact:
Copyright Clearance Center
Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: [email protected].
Independent verification should be sought for any data, advice or recommendations contained in this
book. In addition, no responsibility is assumed by the Publisher for any injury and/or damage to persons
or property arising from any methods, products, instructions, ideas or otherwise contained in this
publication.
This publication is designed to provide accurate and authoritative information with regard to the subject
matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in
rendering legal or any other professional services. If legal or any other expert assistance is required,
the services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS
JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A
COMMITTEE OF PUBLISHERS.
Additional color graphics may be available in the e-book version of this book.
The purpose of this book is to discuss data mining, which is a subset of data
science, from a variety of perspectives. With the technological advances of
recent years, new software and hardware-based systems are available in most
business environments. With these systems, data production continues to
increase in personal, corporate, commercial and many other areas. Information
systems convert raw data, which alone are not so meaningful, into information
after the processes are applied. Database systems are necessary for the storage
and management of the information generated. Revealing meaningful
relationships hidden in a stack of high-volume data shows the function of data
mining. Processing big data has become important to produce information that
will support business decisions and be a strategic tool in today’s competitive
environment. In this context, the effectiveness of data mining applications is
increasing day by day as a decision support system to develop marketing
strategies in every sector by identifying customer behavior and target groups.
The wide perspective of book chapters is likely to provide a rich source
of information for those interested in broadening their understanding of the
multiple facets associated with data mining. The primary audience for this
book includes business professionals, practitioners, data scientists as well as
researchers who dedicate their work to issues associated with data mining (or
data science). An important motivation for editing this book was the need to
create an organized framework for providing various points of view about the
future of data mining.
Chapter 1 provides details about the relationship between the concept of
analytics and the field of human resources. It also explains the opportunities
waiting for human resources. Chapter 2 presents toxicogenomic data mining
as a promising functional annotation-based prioritization tool, with a concise
summary of potentially useful software and tools, including their major
advantages and limitations. Chapter 3 introduces applications of data mining
algorithms (decision tree and random forest) for customer recommendations
in retail marketing. Chapter 4 is an analysis of customer churn in the banking
viii Cem Ufuk Baytar
August 2022
Acknowledgments
Abstract
*
Corresponding Author’s Email: [email protected].
Introduction
Data Analytics
The use of analytics can be traced back to the 1910s when scientific
management emerged (Mortenson, Doherty and Robinson, 2015). It can be
considered as the continuation of optimization and simulation techniques that
could achieve maximum output with limited resources in the 1940s (Delen and
Zolbanin, 2018). At the end of the 1960s, the use of computer systems to
analyze data and support decision-makers led to the use of analytics.
Applications called decision support systems were used in activities such as
product planning, investment portfolio management, and transportation route
(Davenport and Harris, 2007). These decision support systems were named
Operations Research and Management Science in general. In addition to the
Data Analytics Applied to the Human Resources Industry 3
• Data cleaning: Removal of data with the highest level of noise and
inconsistency
• Data integration: Connecting multiple data sources
• Data selection: Removing data, which need analysis, from the
database
• Data conversion: Data collection or conversion to appropriate formats
• Data mining: Creating a data pattern by using appropriate methods
• Pattern evaluation: Identification of patterns representing information
based on certain criteria
• Information representation: Representation of information by using
visualization techniques
into two groups. The first of these groups consists of predictive methods. In
the other group, there are descriptive methods. The main predictive methods
are classification and regression models. Descriptive methods perform
definitions of patterns hidden among the data. For example, revealing whether
there is a similarity between the purchasing habits of married men and single
women during shopping may be an example of the descriptive model.
Clustering methods and association rules are among the descriptive methods.
According to Özdemir et al. (2018), in predictive methods, a model is created
using data with all the details previously known. These model results are
applied to unknown data to achieve the results (Özdemir et al., 2018).
The classification method is the process of adding new data to
predetermined classes by applying the results to this model after the creation
of a model through known data (Silahtaroğlu, 2013). In this context, regression
analysis commonly used in classical statistics is a kind of classification
method (Gamgan and Altunkaynak, 2017). Only numeric values are used in
the regression model. Some of the classification and regression models are
decision trees, artificial neural networks, genetic algorithms, k-nearest
neighbor, and bayes classifiers (Cömert and Kaymaz, 2019).
Decision trees form a tree-shaped structure for decision-making (Sayad,
2022). Decision trees are made up of branches and leaves. In this context,
decision nodes can be divided into two or more branches. Decision nodes are
used for decision-making or classification purposes depending on the data set.
Leaf nodes in decision trees represent decisions. The root node is at the top of
the decision tree. To reach a decision, a specific path is traced from the root of
the tree to the leaf nodes. Algorithms such as ID3, C4.5, and CART are used
to create decision trees (Balaban and Kartal, 2018).
Clustering is the job of assigning data to groups of objects with similar
characteristics (Ceylan et al., 2017). This analysis provides the formation of
data groups with common attributes. This method plays a role in determining
the correct varieties. At the same time, clustering analysis is used for different
purposes such as hypothetical testing, prediction for groups of objects, and
detection of contrary values. The clustering of data contributes to the analysis
of the data, while it also causes the details to be overlooked. Each group has a
homogeneous structure in its own right (Berkhin, 2006). There are clustering
models such as K-means, Fuzzy C-Means (FCM), Kohonen Networks,
Hierarchical Clustering, and so on. The similarity between the data is
measured by measuring the distances of the data from each other.
One of the best-known methods for clustering is the k-means algorithm,
developed by Mac Queen in 1967 (Arai and Barakbah, 2007). This algorithm
Data Analytics Applied to the Human Resources Industry 7
is based on the idea that a midpoint can represent a cluster (Steinbach et al.,
2000). The algorithm is one of those clustering algorithms that automatically
create clusters from similar things and is one of those types of unsupervised
learning without any reference or preliminary information. It is called k-means
because it creates the k original set and the center of each set is the average of
the values in the cluster (Harrington, 2012).
The process of human resources analytics has some steps listed as follows:
(a) Understanding the organization’s business goals, (b) Identifying the
metrics to be analyzed to achieve those goals, (c) Collecting and analyzing the
relevant data, (d) Obtaining insights into this data, and (e) Communicating
how this data affects the organization.
There are different industrial applications of human resources analytics.
Some of them have been summarized as follows:
General Electric (GE) experts collected different types of data relate to
working environments, i.e., from power plants to hospital equipment. GE's
analytics team used this data to make the machines working within it more
efficient. They stated that such business analyzes may increase productivity in
the US by 1.5% and increase the average national income by 30% over 20
years (Sağıroğlu and Koç, 2017).
FedEx, which has 155,000 employees worldwide, is one of the institutions
that have achieved success in human resources analytics. The company has
increased its use of analytics and has become able to access all the data needed.
“We're always looking at HR metrics.” says Bob Bennett, FedEx Executive
Vice President of Human Resources, who also said, “When evaluating our
ability to support business goals, we also look at business metrics”
(Hattangadi, 2019).
eQuest, a company that provides recruitment consultancy services to
financial companies, found in an analysis that 175,000 dollars had been spent
every year on 48 job websites, and 45 of them did not give timely feedback.
In addition, the company determined the words and phrases that candidates
use the most while searching and decided to change the titles and definitions
of job postings accordingly. As a result, candidate traffic increased by 175%
and the budget spent on job postings decreased by 50% (Doğan, 2019).
Conclusion
References
Anger, O., Tessema, M., Craft, J., Tsegai, S., 2021. A framework for assessing the
effectiveness of hr metrics and analytics: the case of an American healthcare
institution. Global Journal of Human Resource Management 9(1), 1-19.
Arai, K., Barakbah, A., 2007. Hierarchical k-means: An algorithm for centroids
initialization for k-means. Faculty of Science and Engineering, Saga University.
Balaban, M. E., Kartal, E., 2018. Veri madenciliği ve makine öğrenmesi temel algoritmaları
ve R dili ile uygulamaları. [Data mining and machine learning basic algorithms and
applications with R language.] İstabul: Çağlayan Publishing.
Berkhin, P., 2006. A survey of clustering data mining techniques in grouping
multidimensional data. Springer Berlin Heidelberg, 25-71.
Ceylan, Z., Gürsev, S., Bulkan, S., 2017. İki aşamalı kümeleme analizi ile bireysel
emeklilik sektöründe müşteri profilinin değerlendirilmesi. Bilişim Teknolojileri
Dergisi [Evaluation of customer profile in private pension sector with two-stage
cluster analysis. Journal of Information Technologies]10(4), 475-485.
Cömert, N., Kaymaz, M., 2019. Araç sigortası hilelerinde veri madenciliğinin kullanımı.
Marmara Üniversitesi, İktisadi ve İdari Bilimler Dergisi [The use of data mining in
car insurance cheats.] Marmara University, Journal of Economics and Administrative
Sciences] 41(2), 364-390.
Davenport, T. H., Harris, J.G., Morison, R., 2010. Analytics at Work: Smarter Decisions,
Better Results. Boston: Harvard Business Press.
Delen, D., Ram, S., 2018. Research Challenges and Opportunities in Business Analytics.
Journal of Business Analytics 1(1), 2-12.
Delen, D., Zolbanin, H.M., 2018. the Analytics Paradigm in Business Research. Journal of
Business Research 90, 186-195.
Data Analytics Applied to the Human Resources Industry 11
Doğan, G., 2019. Büyük Veri İnsan Kaynaklarına Ne Sunuyor? [What Does Big Data
Offer to Human Resources?] https://kaynakbaltas.com/kultur/buyuk-veri-insan-
kaynaklarina -ne-sunuyor/.
Dursun, B., 2013. İnsan Kaynakları Yönetimi. [Human Resources Management] Istanbul:
Beta publishing house.
Falletta, S.V., Combs, W. L., 2021. The HR analytics cycle: a seven-step process for
building evidence-based and ethical HR analytics capabilities. Journal of Work-
Applied Management 13(1), 51-68.
Fitz-Enz, J., Mattox, J., 2014. Predictive Analytics for Human Resources. New Jersey:
Wiley and SAS Business Series.
Gamgam, H., Altunkaynak, B., 2017. SPSS uygulamalı regresyon analizi [SPSS applied
regression analysis]. (2 ed.).
Han, J., Kamber, M., 2006. Data mining concepts and techniques. San Francisco: Elsevier,
Morgan Kaufmann Publishers.
Harrington, P., 2012. Machine learning in action. Manning Publications Co.
Hattangadi, V., 2019. Firms are turning to people analytics to achieve better efficiencies.
https://www.financialexpress.com/opinion/firms-are-turning-to-people-analytics-to-
achieve-better-efficiencies/1734443/
Heuvel, S., Bondarouk, T., 2016. The Rise (and Fall) of HR Analytics: A Study into the
Future Applications, Value, Structure, and System Support. Journal of Organizational
Effectiveness People and Performance 4(2), 127-148.
Jayanthi, R., 2020. A Study on Effectiveness of HR Metrics. International Journal of
Science and Research (IJSR) 9(2), 1630-1635.
Keçecioğlu, T., Oktay, S., 2010. İnsan Sermayesinin Sayısallaştırılmasının Dayanılmaz
Çekim Gücü [The Irresistible Attraction of the Digitization of Human Capital]. Dokuz
Eylül Üniversitesi, Sosyal Bilimler Enstitüsü Dergisi [Journal of Social Sciences
Institute] 12(3), 67-86.
Krishnamoorthi, S., Mathew, S.K., 2018. Business analytics and business value: A
comparative case study. Information & Management 55(5), 643-666.
Lalwani, P., 2022. What Is HR Analytics? Definition, Importance, Key Metrics, Data
Requirements, and Implementation. Accessed June 20. https://www.spiceworks.com/
hr/hr-analytics/articles/what-is-hr-analytics/.
Mortenson, M. J., Doherty, N. F., Robinson, S., 2015. Operational Research from
Taylorism to Terabytes: A Research Agenda for the Analytics Age. European Journal
of Operational Research, 241(3), 583-595.
Olavsrud, T., 2022. What is data analytics? Analyzing and managing data for decisions.
Accessed June 15. https://www.cio.com/article/191313/what-is-data-analytics-
analyzing-and-managing-data-for-decisions.html.
Özdemir, A., Sağlam, R., Bilen, B. B., 2018. Eğitim Sisteminde Veri Madenciliği
Uygulamaları ve Farkındalık Üzerine Bir Durum Çalışması. Atatürk Üniversitesi
Sosyal Bilimler Enstitüsü Dergisi [A Case Study on Data Mining Applications and
Awareness in Education System. Journal of Atatürk University Institute of Social
Sciences] 22 (Özel Sayı), 2159-2172.
Sağıroğlu, Ş., Koç, O., 2017. Büyük Veri ve Açık Veri Analitiği: Yöntemler ve
Uygulamalar. Ankara: Grafiker publishing house.
12 Cem Ufuk Baytar
Abstract
*
Corresponding Author’s Email: [email protected].
†
The authors have equally contributed the research.
Introduction
and/or secretion, which might suggest the ability of the investigated chemical
to disrupt different molecular pathways/processes in the organism (Cronin et
al., 2009; Živančević et al., 2019).
The in-depth understanding of the selected gene-data can be achieved by
the integration of the data from CTD with other mining tools/databases such
as ToppGene Suite, The Database for Annotation, Visualization and
Integrated Discovery (DAVID), DisGeNET or Cytoscape, open source
platform for visualization of complex networks and their integration with any
type of attribute data.
ToppGene Suite (https://toppgene.cchmc.org/) is a publicly available
website that can be used for gene list functional enrichment and candidate gene
prioritization. This is accomplished by using either functional annotations or
network analysis, identification and prioritization of novel candidate genes in
interactome, a protein-protein interaction network (Chen et al., 2009). This is
particularly useful for gaining the mechanistic knowledge of substance-
induced toxicity or adverse outcome. Similarly, DAVID (https://david.
ncifcrf.gov/) was developed with an idea to provide a comprehensive set of
functional annotation tools used for elucidating biological meaning of genes.
Users can easily add large gene-lists, change background populations, select
species and categories and reset functional parameters for data analysis.
Simultaneous application of all available tools during the analysis is also
possible, as well as formatting the final output (Jiao et al., 2012). In other
words, DAVID facilitates the transition from genome-scale datasets collection
to biological meaning (Dennis et al., 2003).
Another useful resource designed as the largest publicly available
collection of genes and variants associated to human diseases is DisGeNET
(https://www.disgenet.org/). The current version of DisGeNET (v7.0)
contains 1,134,942 gene-disease associations, curated for 21,671 genes and
30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes.
Moreover, 369,554 variant-disease associations between almost 195,000
variants and more than 14,000 diseases, traits, and phenotypes are described
in the latest version (Piñero et al., 2020). Thus, combining CTD data-mining
with DisGeNET can be viewed as a useful way for predicting chemically-
induced abnormal human phenotypes (Baralić et al., (2022)).
Finally, one of the most famous software used for the genome-related
biological research is Cytoscape - a network visualization tool integrated with
different plug-ins (Shannon et al., 2003). For example, GeneMANIA plug-in
can be used for identification of the genes most related to a query gene set by
using a guilt-by-association approach (Montojo et al., 2014), while
Toxicogenomics Data Mining as a Promising Prioritization Tool … 17
processes. Glutathione metabolic process was listed as the most important for
ALS, cellular response to toxic substance for PD, and neuron death for AD
(Živančević et al., 2021).
In order to develop a feasible set of genomic biomarkers, Baralić et al.,
(2022) used in silico toxicogenomic data mining to investigate the link
between phthalates and bisphenol A (BPA) co-exposure and obesity, as well
as its comorbid conditions. For this purpose, CTD database was used as the
main in silico tool, along with GeneMania, ToppGene Suite and DisGeNET.
As a result, 7 mutual genes (6 relevant to humans: CCL2, IL6, LPL, PPARG,
SERPINE1, and TNF) were identified in all the investigated obesity
comorbidities. Additionally, of all the extracted genes, PPARG and LPL were
found most closely linked to obesity (Baralić et al., 2022). The same group of
authors investigated the connections between DEHP, DBP and BPA co-
exposure and type 2 diabetes mellitus (T2DM), male infertility and asthma
(Baralić et al., 2021a, 2021b, 2021c). In the aforementioned studies, CTD and
ToppGene Suite were used as the main data mining tools, along with the
Cytoscape software. The analysis revealed 44, 20 and 24 genes common to all
the investigated substances and involved in the development of T2DM, male
infertility and asthma, respectively. Apoptosis and oxidative stress were
singled out as the most important mechanisms of both DMT2 and asthma
(Baralić et al., 2021a, 2021c), while inflammation was additionally
highlighted as a particularly important pathway in asthma development
(Baralić et al., 2021a). Proposed mechanisms of DEHP, DBP and BPA
mixture induced testicular toxicity were effects on sex differentiation,
metabolism, nuclear receptors, aryl hydrocarbon receptors, apoptosis, and
oxidative stress (Baralić et al., 2021b). Since oxidative stress was highlighted
as one of the main mechanisms of phthalate and BPA mixture for of all of
these disorders, data mining results were further validated by additional in vivo
experiments, further suggesting additive effects of the investigated substances
(Baralić et al., 2021a, 2021b, 2021c).
Another study which combined in silico with in vivo results was
performed by Dong et al., (2018). As a follow-up to the in vivo zebrafish
experiment, these authors performed CTD data mining for BPA and DBP. In
their study, by conducting CTD analysis, they discovered 4826 and 14737
interactions with various genes/proteins for DBP and BPA, respectively.
When orthologous human genes in BPA- and DBP-treated embryos were
compared to curated BPA- and DBP-interacting proteins in CTD, 18 and 13
mutual proteins were found, respectively, while 7 and 9 of the 18 and 13
proteins were part of the BPA- and DBP-predicted pathways, respectively.
Toxicogenomics Data Mining as a Promising Prioritization Tool … 19
The results of CTD and proteomic analysis further highlighted the possible
impacts of BPA and DBP on the hypothesized networks (Dong et al., 2018).
Furthermore, Baralić et al., (2020) assessed both risks and benefits of the
COVID-19 treatment with candidate drug combinations: lopinavir/ritonavir
and chloroquine/hydroxychloroquine + azithromycin, by using CTD,
Cytoscape software and ToppGene Suite portal as a foundation in their
research. Lopinavir/ritonavir increased the expression of specific genes
involved in the immune response and lipid metabolism (IL6, ICAM1, CCL2,
TNF, APOA1, etc.). Chloroquine/hydroxychloroquine + azithromycin
interacted with six genes (CCL2, CTSB, CXCL8, IL1B, IL6 and TNF),
whereas chloroquine + azithromycin acted on two extra genes (BCL2L1 and
CYP3A4) (Baralić et al., 2020).
Using the immunomodulator sulforaphane (SFN) as a case study, Bozic
et al., (2021) applied toxicogenomic data mining (CTD, ToppGene Suite
portal and Reactome Knowledgebase) to investigate molecular mechanisms
and pathways which might be targeted in cancer treatment. Sulforaphane
interacted with 1896 different proteins, the most important being NFE2L2,
NQO1, HMOX1, GCLC, TXNRD1, IL1B, IFNG, AGT, KEAP1, and CASP3.
A direct evidence was found in CTD that SFN interacted with a total of 169
genes connected to its therapeutic effect against various forms of cancer,
including hepatocellular carcinoma, colorectal neoplasms, uterine cervical
neoplasms, and adenomatous polyposis coli (Bozic et al., 2021). The same
group of authors further explored SFN induced adverse effects in colorectal
carcinoma patients. In this study, gene network analysis was performed using
Cytoscape plug-in STRING. Functional annotation clustering was performed
by the Cytoscape plug-in BINGO, while SFN interacting genes were obtained
from CTD. SFN was found to increase the expression of TIMP1, AURKA,
and CEP55, and decrease the expression of CRYAB, PLCE1, and MMP28,
which may contribute to the advancement of colorectal carcinoma (CTD).
According to the pathway enrichment analyses, SFN enhanced RUNX2 and
AURKA activation through TPX2 and IL-10 signaling (Bozic et al., 2022).
Advantages
Limitations
Some of the challenges in the mixture toxicity analysis are that detected
signals must be linked to the predefined response chains. It is also necessary
to understand the crosstalk and convergence of the pathways, considering that
joint responses might switch between independent and concentration additive,
or non-interactive and interactive. Since quantitative mixture assessment
typically relies on monotonous changes in response represented in sigmoidal
concentration response relationships, U-shaped curves present specific
challenges (Altenburger et al., 2012). Elucidation of the sequences of
biological responses, as well as separation of toxicokinetic from
toxicodynamic responses could also represent a challenge in interpretation of
toxicogenomic data (Altenburger et al., 2012). “Omics” techniques, are
generally complex to compute and understand. Thus, toxicogenomics data
mining could be demanding, especially in the integration of different data
formats (Martins et al., 2019), especially considering the variability of
available data and measured endpoints (Liu et al., 2020). The reproducibility
of toxicogenomic data, including both biological and technical sides, could
also make drawing conclusions difficult (Liu et al., 2020). Applicability
domain of toxicogenomic data is also an important factor which may limit the
analysis. A false positive result could be generated by biologically-based read-
across, which relies on toxicogenomic data (Liu et al., 2020). The fast
development of toxicogenomic approach has greatly expanded its impact on
toxicity data testing. However, aforementioned challenges should be
addressed before toxicogenomic data mining can meet its full potential as a
promising prioritization tool in toxicity testing (Liu et al., 2020).
Conclusion
Acknowledgments
References
Altenburger, R., Scholz, S., Schmitt-Jansen, M., Busch, W., Escher, B. I., 2012. Mixture
toxicity revisited from a toxicogenomic perspective. Environmental Science and
Technology 46, 2508-2522. https://doi.org/10.1021/es2038036.
Ball, N., Cronin, M. T. D., Shen, J., Blackburn, K., Booth, E. D., Bouhifd, M., Donley, E.,
Egnash, L., Hastings, C., Juberg, D. R., Maertens, A., Marty, S., Naciff, J. M., Palmer,
J., Pamies, D., 2016. Toward Good Read-Across Practice (GRAP) Guidance. Altex 33,
149-166.
Baralić, K., Bozic, D., Živančević, K., Milenković, M., Javorac, D., Marić, Đ., Miljaković,
E. A., Djordjevic, A. B., Vukomanović, P., Ćurčić, M., Bulat, Z., Antonijević, B.,
Đukić-Ćosić, D., 2021a. Integrating in silico with in vivo approach to investigate
phthalate and bisphenol A mixture-linked asthma development: Positive probiotic
intervention. Food and Chemical Toxicology 158, 112671. https://doi.org/10.1016/
j.fct.2021.112671.
Baralić, K., Jorgovanović, D., Živančević, K., Antonijević Miljaković, E., Antonijević, B.,
Buha Djordjevic, A., Ćurčić, M., Đukić-Ćosić, D., 2020. Safety assessment of drug
combinations used in COVID-19 treatment: in silico toxicogenomic data-mining
approach. Toxicology and Applied Pharmacology 406, 115237. https://doi.org/10.
1016/j.taap.2020.115237.
Baralić, K., Jorgovanović, D., Živančević, K., Buha Djordjević, A., Antonijević
Miljaković, E., Miljković, M., Kotur-Stevuljević, J., Antonijević, B., Đukić-Ćosić, D.,
2021b. Combining in vivo pathohistological and redox status analysis with in silico
toxicogenomic study to explore the phthalates and bisphenol A mixture-induced
testicular toxicity. Chemosphere 267, 129296. https://doi.org/10.1016/j.chemosphere.
2020.129296.
Baralić, K., Živančević, K., Božić, D., Jennen, D., Buha Djordjevic, A., Antonijević
Miljaković, E., Đukić-Ćosić, D., 2022. Potential genomic biomarkers of obesity and
its comorbidities for phthalates and bisphenol A mixture: In silico toxicogenomic
approach. Biocell 46, 519-533. https://doi.org/10.32604/biocell.2022.018271.
Baralić, K., Živančević, K., Jorgovanović, D., Javorac, D., Radovanović, J., Gojković, T.,
Djordjevic, A. B., Ćurčić, M., Mandinić, Z., Bulat, Z., Antonijević, B., Đukić-Ćosić,
24 Katarina Živančević, Dragica Bozic, Katarina Baralić et al.
D., 2021c. Probiotic reduced the impact of phthalates and bisphenol A mixture on type
2 diabetes mellitus development: merging bioinformatics with in vivo analysis. Food
and Chemical Toxicology 154, 112325. https://doi.org/10.1016/j.fct.2021.112325.
Basile, A. O., Yahi, A., Tatonetti, N. P., 2019. Artificial Intelligence for Drug Toxicity and
Safety. Trends in Pharmacological Sciences 40, 624-635. https://doi.org/10.1016/
j.tips.2019.07.005.
Bindea, G., Galon, J., Mlecnik, B., 2013. CluePedia Cytoscape plugin: Pathway insights
using integrated experimental and in silico data. Bioinformatics 29, 661-663.
https://doi.org/10.1093/bioinformatics/btt019.
Boverhof, D. R., Zacharewski, T. R., 2006. Toxicogenomics in risk assessment:
Applications and needs. Toxicological Sciences 89, 352-360. https://doi.org/10.1093/
toxsci/kfj018.
Bozic, D., Baralić, K., Živančević, K., Miljaković, E. A., Ćurčić, M., Antonijević, B.,
Djordjević, A. B., Bulat, Z., Zhang, Y., Yang, L., Đukić-Ćosić, D., 2022. Predicting
sulforaphane-induced adverse effects in colon cancer patients via in silico
investigation. Biomedicine & Pharmacotherapy 146, 112598. https://doi.org/10.1016/
j.biopha.2021.112598.
Bozic, D., Živančević, K., Baralić, K., Javorac, D., 2021. Applying in silico toxicogenomic
data mining to predict molecular mechanisms and pathways against carcinoma:
immunomodulator sulforaphane as a case study, in: 1st International Conference on
Chemo and Bioinformatics, October 26-27, Kragujevac, Serbia. pp. 470-473.
Breda, S. G. J. Van, Claessen, S. M. H., Lo, K., Herwijnen, M. Van, Gaj, S., Kok, T. M. C.
M. De, Kleinjans, J. C. S., 2014. Epigenetic mechanisms underlying arsenic ‑
associated lung carcinogenesis. Archives of Toxicology 89, 1959-1969. https://doi.org/
10.1007/s00204-014-1351-2.
Chen, J., Bardes, E. E., Aronow, B. J., Jegga, A. G., 2009. ToppGene Suite for gene list
enrichment analysis and candidate gene prioritization. Nucleic Acids Research 37,
305-311. https://doi.org/10.1093/nar/gkp427.
Chepelev, N. L., Moffat, I. D., Labib, S., Bourdon-Lacombe, J., Kuo, B., Buick, J. K.,
Lemieux, F., Malik, A. I., Halappanavar, S., Williams, A., Yauk, C. L., 2015.
Integrating toxicogenomics into human health risk assessment: Lessons learned from
the benzo[a]pyrene case study. Critical Reviews in Toxicology 45, 44-52. https://doi.
org/10.3109/10408444.2014.973935.
Chin, C. H., Chen, S. H., Wu, H. H., Ho, C. W., Ko, M. T., Lin, C. Y., 2014. cytoHubba:
Identifying hub objects and sub-networks from complex interactome. BMC Systems
Biology 8, S11. https://doi.org/10.1186/1752-0509-8-S4-S11.
Cronin, M. T. D., Bajot, F., Enoch, S. J., Madden, J. C., Roberts, D. W., Schwöbel, J., 2009.
The in chemico-in silico interface: Challenges for integrating experimental and
computational chemistry to identify toxicity. ATLA Alternatives to Laboratory
Animals 37, 513-521. https://doi.org/10.1177/026119290903700508.
Daina, A., Michielin, O., Zoete, V., 2017. SwissADME: A free web tool to evaluate
pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small
molecules. Scientific Reports 7, 1-13. https://doi.org/10.1038/srep42717.
Davis, A. P., Murphy, C. G., Rosenstein, M. C., Wiegers, T. C., Mattingly, C. J., 2008. The
Comparative Toxicogenomics Database facilitates identification and understanding of
Toxicogenomics Data Mining as a Promising Prioritization Tool … 25
Piñero, J., Ramírez-Anguita, J. M., Saüch-Pitarch, J., Ronzano, F., Centeno, E., Sanz, F.,
Furlong, L. I., 2020. The DisGeNET knowledge platform for disease genomics: 2019
update. Nucleic Acids Research 48, D845-D855. https://doi.org/10.1093/nar/gkz1021.
Prior, H., Casey, W., Kimber, I., Whelan, M., Sewell, F., 2019. Reflections on the progress
towards non-animal methods for acute toxicity testing of chemicals. Regulatory
Toxicology and Pharmacology 102, 30-33. https://doi.org/10.1016/j.yrtph.2018.12.
008.
Sewell, F., Gellatly, N., Beaumont, M., Burden, N., Currie, R., de Haan, L., Hutchinson, T.
H., Jacobs, M., Mahony, C., Malcomber, I., Mehta, J., Whale, G., Kimber, I., 2018.
The future trajectory of adverse outcome pathways: a commentary. Archives of
Toxicology 92, 1657-1661. https://doi.org/10.1007/s00204-018-2183-2.
Shannon, P., Markiel, A., Ozier, O., Shannon, P., Markiel, A., Ozier, O., Baliga, N. S.,
Wang, J. T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T., 2003. Cytoscape : A
Software Environment for Integrated Models of Biomolecular Interaction Networks
2498-2504. https://doi.org/10.1101/gr.1239303.
Shegokar, R., 2020. Preclinical testing-Understanding the basics first, Drug Delivery
Aspects. Elsevier Inc. https://doi.org/10.1016/b978-0-12-821222-6.00002-6.
Tung, C. W., Jen, H., Chia, C., Wang, C., Shan, S., Pinpin, W., 2020. Leveraging
complementary computational models for prioritizing chemicals of developmental
and reproductive toxicity concern : an example of food contact materials. Archives of
Toxicology 1-10. https://doi.org/10.1007/s00204-019-02641-0.
Uddin, M., Mustafa, F., Rizvi, T. A., Loney, T., Suwaidi, H. Al, Al-Marzouqi, A. H. H.,
Eldin, A. K., Alsabeeha, N., Adrian, T. E., Stefanini, C., Nowotny, N., Alsheikh-Ali,
A., Senok, A. C., 2020. SARS-CoV-2/COVID-19: Viral Genomics, Epidemiology,
Vaccines, and Therapeutic Interventions. Viruses 12, 526. https://doi.org/10.4324/978
1351118422-5.
Ulrich, R., Friend, S. H., 2002. Toxicogenomics and drug discovery: Will new technologies
help us produce better drugs? Nature Reviews Drug Discovery 1, 84-88. https://doi.
org/10.1038/nrd710.
Vachon, J., Campagna, C., Rodriguez, M. J., Sirard, M. A., Levallois, P., 2017. Barriers to
the use of toxicogenomics data in human health risk assessment: A survey of Canadian
risk assessors. Regulatory Toxicology and Pharmacology 85, 119-123. https://doi.org/
10.1016/j.yrtph.2017.01.008.
Yauk, C. L., Cheung, C., Barton-Maclaren, T. S., Boucher, S., Bourdon-Lacombe, J.,
Chauhan, V., Gagné, M., Gillespie, Z., Halappanavar, S., Honeyman, M., Jones, S. R.,
Jones-McLean, E., Labib, S., MacAulay, J., Moore, J., Paquette, M., Petronella, N.,
Semalulu, S., Slot, A., Vespa, A., Woodland, C. L. A., 2019. Toxicogenomic
applications in risk assessment at Health Canada. Current Opinion in Toxicology 18,
34-45. https://doi.org/10.1016/j.cotox.2019.02.005.
Živančević, K., Baralić, K., Jorgovanović, D., Buha Djordjević, A., Ćurčić, M., Antonijević
Miljaković, E., Antonijević, B., Bulat, Z., Đukić-Ćosić, D., 2021. Elucidating the
influence of environmentally relevant toxic metal mixture on molecular mechanisms
involved in the development of neurodegenerative diseases: In silico toxicogenomic
data-mining. Environmental Research 194. https://doi.org/10.1016/j.envres.2021.
110727.
28 Katarina Živančević, Dragica Bozic, Katarina Baralić et al.
Živančević, K., Baralić, K., Jorgovanović, D., Đukić-Ćosić, D., 2019. The Comparative
Toxicogenomics database: the influence of environmental chemicals on genes (In
Serbian). MD-Medical Data 11(3-4):159-164.
Chapter 3
Abstract
*
Corresponding Author’s Email: [email protected].
Introduction
Along with the developing world, globalization, which has occurred with the
contribution of technology, has led to some changes on a sectoral basis and
has led to an increase in competition. The most important change that has
occurred is the evolution of customers’ purchasing tendencies and
expectations in a different direction over time. The differentiations, starting
from the customer, did not only stay with the customer, but also caused
different major effects in the continuation of the process with the domino
effect. The most critical point in the flow is to be able to manage the effects of
change by adopting an agile structure, considering the business vision and
mission. An agile management style, on the other hand, is directly proportional
to grasping the differences and keeping up with the change. Since the customer
is at an important point, efficient management of customer relations will have
a direct impact on the market share and will enable institutions to stay
competitive in the emerging competitive environment (Larson and Chang,
2016). At this point, data mining methods, which are frequently encountered
in applications in terms of their increasing importance and convenience in
recent years, attract the attention of researchers in sectoral as well as in
different fields and facilitate the management of processes.
Data mining is an analysis technique that is supported by algorithms to
reveal the information hidden in large amounts of data, plays a role in
Applications of Data Mining Algorithms … 31
Literature Review
Orange tool showed lower accuracy, while the Knime tool predicted better
accuracy compared to its predecessor. And also, Wu et al. (2017) examined
the purchasing behavior of more than 500,000 customers of the insurance
company from China Life Insurance Company for 3 years. Researchers used
SMOTE-based algorithms to analyze business data, in this way, they proposed
the Random Forest algorithm using the parallel computing capability and
memory caching mechanism optimized by Spark.
There are crises in the stock market that directly affect the economy.
Therefore, it is very important to identify early warning indicators. Elagamy,
Stanier and Sharp (2018) applied text mining and data mining applications to
identify these critical indicators. Random Forest, Rotation Forest, Bagging,
J48, Bayes Net, Decision table, Decision stump algorithms were used to
classify the critical indicators of the stock market. The best performing
algorithm was Random Forest algorithm. Random Forest, which was used in
the classification of the dataset, provided 98.34% accuracy and correctly
classified 535 of 544 articles. Likewise, Tan, Yan and Zhu (2019) aimed to
predict the long-term and short-term price trend by using the data of the
Chinese stock market between February 8, 2013 and August 8, 2017. In the
same context, Livari and Ghalam (2020) aimed to group customers for a food
manufacturing company using the variables of 2021 new purchases (R),
frequency of purchases (F) and monetary value of purchase (M). By
examining these variables, they used the K-Means algorithm and the Davies-
Bouldin Index algorithms to cluster the customer group of the seedling
production company. Customers were divided into three groups and marketing
strategies were suggested on the basis of each group by calculating the
customer lifetime value of each group.
In recent years, the number of studies has increased and data mining
methods have become an important tool in predicting the future.
Arminarahmah et al. (2021) aimed to view and map the spread of the COVID-
19 virus in Asia with a dataset containing information on the basis of total
cases, total deaths, total recovered and active cases from 2020, in 49 countries.
They completed the research using data from the site such as WHO, CDC,
NHC and worldometer. They used X-Means clustering method using
RapidMiner. It was interpreted that the X-Means algorithm could be used to
map the spread of COVID-19 in Asia, and it was divided into 4 regions
according to the results of the spread mapping analysis. In different
perspectives, there are some other studies conducted. Influencers promote
their products using social media accounts such as Twitter, Facebook, and
Instagram. Using the data collected from the marketing agency in Korea
34 Elif Delice, Lütviye Özge Polatlı, İrem Düzdar Argun et al.
best performance
Algorithm with
Program used
algorithms
Compared
Author(s)
Accuracy
Problem
(%)
Haghanikha Finding the highest Naïve Bayes, Decision 97 Orange,
meneh et al. performing tool Decision Tree, Tree Tanagra,
(2012) Support Vector KNIME,
Machine, K Nearest WEKA
Neighbor, One
Rule, Zero Rule
Alsultanny Forecasting the Naïve Bayes, Decision 100 -
(2013) needs of the labor Decision Tree, Tree
market Decision Rules
Thomas, Classify spam Naive Bayes, Random 98,73 WEKA
Vinod, and documents by Random Forest, Forest
Raj (2014) developing a Random Tree, J48,
computationally Adaboosting
efficient classifier
Naik and Classifying people Decision Tree, K-Nearest 99,70 WEKA,
Samant with and without Decision Stump, K- Neighbor RapidMiner,
(2016) liver Nearest Neighbor, Tanagra,
Naïve Bayes Orange,
Knime
Elagamy, Classification of Random Forest, Random 98,34 -
Stanier, and critical indicators of Rotation Forest, Forest
Sharp (2018) stock markets Bagging, J48,
Bayes Net,
Decision Table,
Decision Stump
Abdulkareem Observing COVID- Decision Tree, K Decision 99.9 WEKA
et al. (2021) 19 vaccine progress Nearest Neighbor, Tree
around the world Random Forest,
using machine Naive Bayes
learning
classification
algorithms
As a result, it has been observed that many studies in the literature are
supported by different methods under the umbrella of data mining in many
different fields. It has been observed that the studies have increased greatly in
2021, and that the methods under the umbrella of data mining are frequently
examined by researchers and used to facilitate processes. In this direction, it
has been determined that help is received from different software programs,
but RapidMiner and WEKA are the programs that come to the fore the most.
36 Elif Delice, Lütviye Özge Polatlı, İrem Düzdar Argun et al.
Methodology
Decision trees, one of the tree-based learning algorithms, are the most used
supervised learning algorithms. In general, they can be adapted to the solution
of all the problems (classification and regression) considered. Therefore, it is
very important for data analysts to learn and use these algorithms. The first
cells of the decision trees are called the root or root node. Each observation is
classified as “Yes” or “No” according to the root condition. Below the stem
cells are interval nodes or nodes. Each observation is classified with the help
of nodes. As the number of nodes increases, the complexity of the model also
increases. At the bottom of the decision tree is the leaf nodes or leave. The
leaves give us the result.
The Random Forest algorithm is a supervised classification algorithm.
There are two stages in the Random Forest algorithm. The first of these stages
is to create a Random Forest, and the other is to make predictions on the
Random Forest classifier created in the first stage. The K-means algorithm
provides an advantage in terms of being used in large data sets and providing
ease of application. In this method, the number of clusters should be defined
at the beginning, and it is not recommended to be used in categorical data sets.
One of the disadvantages of the K-means algorithm is that it cannot give
accurate results in data sets that differ in density and size. In addition, when
factors such as unusual data and noise occur, k-means remains weak compared
to other algorithms (Douzas, Bacao and Last, 2018). When using the K-means
method, a local optimum point is calculated, and the method ends at this point
(Ahmadyfard and Modares, 2008). Although the method is an iterative
method, it aims to divide the square of the distance between the elements into
the specified set using the minimization function. In each refresh, the data is
sent to a different cluster and the optimum result is tried to be found. At this
stage, a permutational operation is performed (Omran, Salman and
Engelbrecht, 2006).
The X-means algorithm answers the question of how to find the number
of k clusters by finding the optimum number. For this, it should be ensured
that the number of K clusters is defined in a reasonable range. The number k
in the specified range is scored by a model such as the Bayesian information
criterion. In summary, it is examined from the lower limit to the upper limit
of the determined range and works by adding a new center at the required
point. The centroid with the best score among the k numbers scored in this
process is considered optimum (Pelleg and Moore, 2000). The fact that the
Applications of Data Mining Algorithms … 37
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁
𝑇𝑃
Precision = 𝑇𝑃 + 𝐹𝑃
(2)
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (3)
𝑇𝑁 + 𝐹𝑁
(𝑇𝑃 + 𝑇𝑁)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 = (𝑇𝑃 (4)
+ 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁)
(𝐹𝑁+𝐹𝑃)
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = (𝐹𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁)
(5)
True Positives (TP): These are instances where the true value is 1 and the
predicted value is 1.
38 Elif Delice, Lütviye Özge Polatlı, İrem Düzdar Argun et al.
True Negatives (TN): These are instances where the true value is 0 and
our predicted value is 0.
False Positives (FP): These are instances where the true value is 0 but the
predicted value is 1.
False Negatives (FN): These are instances where the true value is 1 but
our predicted value is 0.
It is desired to measure the true negative performance with the specificity
criterion calculated in the third equation. It is the case that data that does not
belong to class C are not defined in class C.
Precision refers to the ratio of positive samples classified in the correct
category by the model to the total number of positive samples. In other words,
it is the rate at which the data estimated as C is actually C. The precision
measure measures the performance of the model in eliminating false positive
data. The nominal value is the ratio of correctly classified samples belonging
to the C class to the total C class. The rated measure measures the performance
of eliminating false negative values, unlike the precision criterion.
𝑇𝑃
Recall = 𝑇𝑃 + 𝐹𝑁
(6)
It is requested and expected that these two criteria give results close to 1.
The harmonic means of precision and recall measures is expressed as F
and calculated as follows.
2 𝑥 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 = (7)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
the data of the C1 class in the training set is defined as C1 again. (B field false
negative) Represents data that is not defined as C1 in the test set, although it is
in the C1 class in the training set. The C false positive area is the area not
defined as C1 but classified as C1 when estimating. The D negative field is the
field that is not in the C1 class and is not predicted as C1. By using these values,
calculations related to accuracy and error rate were made.
In Table 4 and Figure 3, the estimation performances of the model applied
on the test data are given. In Figure 3, correctly predicted data are marked in
orange and incorrect ones in gray. Class precision and class recall values in
row and column subtotals are given in Table 3. From these values, with class
precision, the correct classification percentage of the Decision Tree algorithm
is calculated as 81.13% and the average of Recall values is 71.70%. In the
Random Forest algorithm, the percentage of correct classification is 85.71%
and the average of Recall values is 88.33%. As can be understood from here,
the Random Forest model performed better in estimation than the Decision
Tree model.
With the Decision Tree, we see that the first node is separated according
to the total weight, while in the Random Forest method, we see that the first
node is the purchase time. Images of decision trees obtained by Decision Tree
and Random Forest algorithms are given in Figure 4. Although the Random
Forest model, from the way it is expressed with colors under the clusters in
the images, performs better in estimating, it is seen that the criterion of the
homogeneity of the subset in the decision trees stands out more in the Decision
Tree. As a result, with all these findings, it was seen that the Random Forest
algorithm performed more successfully than the Decision Tree algorithm in
the analysis on the data set processed and specified in the RapidMiner
program.
Conclusion
Since 2019, consumers’ shopping habits have changed due to limited living
conditions around the world. The data set used in this chapter consists of data
collected under pandemic conditions in August 2020. The aim was to form a
basis for the online shopping activities of retail businesses under pandemic
conditions. By examining the local customer base and assuming that the same
customer group will exhibit similar behaviors in online or tele-ordering
methods, it was desired to work on customer identification and order
estimation in order to follow an effective sales policy.
At the beginning of the chapter, first of all, clustering algorithms were
used for customer segmentation, and it was aimed to group the customers with
similar characteristics. Thus, it is aimed to reach the result in a shorter time by
including the customers with certain characteristics, rather than each customer
being a value on its own.
In order to make predictions on the customers divided into clusters,
decision tree algorithms were tried, and it was investigated which algorithm
would perform higher in the relevant data set. The aim here was to develop
effective sales strategies by looking at the branches of the algorithm with the
highest performance.
The effect of the existing differences in the functioning of the selected
methods on the result was observed, and the reasons for which method was
appropriate were investigated. At this point, the output of the chapter is an
exemplary chapter, in addition to the existing information in the literature, in
terms of reference to studies that will be compared in performance on similar
data sets in which integer and string values are handled in a mixed manner.
Applications of Data Mining Algorithms … 47
References
Abdulkareem, N. M., Abdulazeez, A. M., Zeebaree, D. Q., and Hasan, D. A. (2021).
COVID-19 World Vaccination Progress Using Machine Learning Classification
Algorithms. Qubahan Academic Journal, 1 (2), 100–105.
Ahmadyfard, A., and Modares, H. (2008). Combining PSO and K-Means to Enhance Data
Clustering. 2008 International Symposium on Telecommunications, IST 2008, 688–
691.
Alsultanny, Y. A. (2013). Labor Market Forecasting by Using Data Mining. Procedia
Computer Science, 18, 1700–1709.
Arminarahmah, N., Daengs, A., Bhawika, G. W., Dewi, M. P., and Wanto, A. (2021).
Mapping the Spread of Covid-19 in Asia Using Data Mining X-Means Algorithms.
IOP Conference Series: Materials Science and Engineering, 1071 (1), 012018.
Balabantaray, R. C., Sarma, C., and Jha, M. (2015). Document Clustering Using K-Means
and K-Medoids. http://arxiv.org/abs/1502.07938.
Berrar, D. (2018). Cross-Validation. Encyclopedia of Bioinformatics and Computational
Biology: ABC of Bioinformatics, 1–3, 542–545.
Chandrashekar, M., and Lee, Y. (2019). MCDD: Multi-Class Distribution Model for Large
Scale Classification. Proceedings - 2018 IEEE International Conference on Big Data,
Big Data, 2018, 4906–4914.
Douzas, G., Bacao F., and Last, F. (2018). Improving Imbalanced Learning through a
Heuristic Oversampling Method Based on K-Means and SMOTE. Information
Sciences, 465, 1–20.
Duda, S., Miller, A. R., Statnikov, A., and Johnson, K. (2005). Extracting Drug-Drug
Interaction Articles from MEDLINE to Improve the Content of Drug Databases. AMIA
Annual Symposium Proceedings/AMIA Symposium. AMIA Symposium, 216–220.
48 Elif Delice, Lütviye Özge Polatlı, İrem Düzdar Argun et al.
Elagamy, M. N., Stanier, C., and Sharp, B. (2018). Stock Market Random Forest-Text
Mining System Mining Critical Indicators of Stock Market Movements. 2nd
International Conference on Natural Language and Speech Processing, ICNLSP
2018, 1–8.
Emel, G. G., and Taşkın, Ç. (2005). Veri Madenciliğinde Karar Ağaçları ve Bir Satış
Analizi Uygulaması [Decision Trees in Data Mining and a Sales Analysis
Application], 6 (2), 221–239.
Evdokimova, S. A. (2021). Segmentation of Store Customers to Increase Sales Using ABC-
XYZ-Analysis and Clustering Methods. Journal of Physics, Conference Series, 2032
(1).
Fithri, F. A., and Wardhana, S. (2021). Cluster Analysis of Sales Transaction Data Using
K-Means Clustering At Toko Usaha Mandiri. Jurnal PILAR Nusa Mandiri, 17 (2),
113–118.
Garcia, C. D. S., Meincheim, A., Junior, E. R. F., Dallagassa, M. R., Sato, D. M. V.,
Carvalho, D. R., Santos, E. A. P., and Scalabrin, E. E. (2019). Process Mining
Techniques and Applications – A Systematic Mapping Study. Expert Systems with
Applications, 133, 260–295.
Gibert, K., Izquierdo, J., Sànchez-Marrè, M., Hamilton, S. H., Rodríguez-Roda, I., and
Holmes, G. (2018). Which Method to Use? An Assessment of Data Mining Methods
in Environmental Data Science. Environmental Modelling and Software, 110, 3–27.
Guftar, M., Ali, S. H., Raja, A. A., and Qamar, U., (2015). A Novel Framework for
Classification of Syncope Disease Using K-Means Clustering Algorithm. IntelliSys
2015 - Proceedings of 2015 SAI Intelligent Systems Conference, 127–132.
Haghanikhameneh, F., Panahy, P. H. S., Khanahmadliravi, N., and Mousavi, S. A. (2012).
A Comparison Study between Data Mining Algorithms over Classification
Techniques in Squid Dataset. International Journal of Artificial Intelligence, 9 (12 A),
59–66.
Kale, B., and Yüksel, G., (2020). Veri Madenciliği Sınıflandırma Algoritmaları Ile E-Posta
Önemliliğinin Belirlenmesi. Ç.Ü Fen ve Mühendislik Bilimleri Dergisi
[Determination of E-Mail Importance with Data Mining Classification Algorithms.
CU Journal of Science and Engineering Sciences], 39 (9).
Kim, S., and Lee, H. (2021). Customer Churn Prediction in Influencer Commerce: An
Application of Decision Trees. Procedia Computer Science, 199, 1332–1339.
Larson, D., and Chang, V. (2016). A Review and Future Direction of Agile, Business
Intelligence, Analytics and Data Science. International Journal of Information
Management, 36 (5), 700–710.
Livari, R. T. and Ghalam, N. Z. (2020). Customers Grouping Using Data Mining
Techniques in the Food Distribution Industry (A Case Study). SRPH Journal of
Applied Management and Agile Organisation.
Naik, A., and Samant, L. (2016). Correlation Review of Classification Algorithm Using
Data Mining Tool: WEKA, Rapidminer, Tanagra, Orange and Knime. Procedia
Computer Science, 85, 662–668.
Omran, M. G. H., Salman, A., and Engelbrecht, A. P. (2006). Dynamic Clustering Using
Particle Swarm Optimization with Application in Image Segmentation. Pattern
Analysis and Applications, 8 (4), 332–344.
Applications of Data Mining Algorithms … 49
Pelleg, D., and Moore, A. (2000). X-Means: Extending K-Means with Efficient Estimation
of the Number of Clusters. ICML '00: Proceedings of the Seventeenth International
Conference on Machine Learning.
Tan, Y., Chen, H., Zhang, J., Tang, R., and Liu, P. (2022). Early Risk Prediction of Diabetes
Based on GA-Stacking. Applied Sciences (Switzerland), 12 (2).
Tan, Z., Yan, Z., and Zhu, G. (2019). Stock Selection with Random Forest: An Exploitation
of Excess Return in the Chinese Stock Market. Heliyon, 5 (8), e02310.
Thomas, J. P. V., and Raj, N. S. (2014). Towards Spam Mail Detection Using Robust
Feature Evaluated with Feature Selection Techniques. International Journal of
Engineering and Technology, 6 (5), 2144–2158.
Utomo, W. (2021). The Comparison of K-Means and k-Medoids Algorithms for Clustering
the Spread of the Covid-19 Outbreak in Indonesia. ILKOM Jurnal Ilmiah, 13 (1), 31–
35.
Wei, C. P., and Chiu, I. T. (2002). Turning Telecommunications Call Details to Churn
Prediction: A Data Mining Approach. Expert Systems with Applications, 23 (2), 103–
112.
Wu, Z., Lin, W., Zhang, Z., Wen, A., and Lin, L. (2017). An Ensemble Random Forest
Algorithm for Insurance Big Data Analysis. Proceedings - 2017 IEEE International
Conference on Computational Science and Engineering and IEEE/IFIP International
Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, 1, 531–
36.
Chapter 4
Özge Doğuç*
Management Information Systems, Medipol University,
Istanbul, Turkey
Abstract
Corresponding Author’s Email: [email protected].
Introduction
Customer relationship management has become an area that benefits from and
is fed by data mining analysis due to the increasing technology, competitive
environment, and information age. Customer relationship management aims
to increase customer loyalty, to be in constant communication with the
customer and to provide the right service at the right time. These positive
relations with the customer are important in terms of reducing customer loss
rates and acquiring new customers in the company's competitive environment.
Using information systems and algorithms to make sound decisions about the
future will bring significant advances in all areas of the company (Payne,
2005).
As a result of today's technological developments, increasing data
collection and storage possibilities, producing meaningful information from
these data has become one of the issues that companies attach importance to.
The usage methods of this data may vary in each sector. The problem to be
solved, the data used, the algorithms used may be different for each company
and sector. The study by Doğan et al. (Doğan, Erol and Buldu, 2014) includes
the use of data analysis algorithms in terms of customer relationship
management in the insurance industry. This study belongs to the various
insurance policy sales made by one of the most important insurance companies
in Turkey between 2010 and 2012. The data consists of approximately 12
thousand lines and 9 columns containing various insurance information such
as motor insurance, fire, earthquake. To ensure personal information security,
the customer ID and number are hidden. Apriori algorithm was used. The
letters "T and F" are used to represent whether there is a fuse or not. After the
data pre-preparation, the analysis phase was started. The association was
analyzed, and significant results were revealed such as “47% of those who
purchased compulsory earthquake insurance also bought fire insurance.” In
the light of this information, it will be possible to organize campaigns that will
attract the attention of the customer in future campaigns. In the insurance
sector, it is possible to obtain meaningful information through data analysis
by using customer data and to use this information in customer relationship
management and marketing activities. As a result of this study, it has been
shown that by looking at the preferred policy types, strategies such as which
customer will be directed to which campaign in the future, and thus efficiency
can be increased.
Hsieh's study (Hsieh, 2004) includes analyzing bank customer data with
data mining and drawing conclusions. The databases are large as banks have
Analysis of Customer Churn in the Banking Industry Using Data Mining 55
daily and monthly transaction surpluses, registered accounts and a wide range
of customers. Data analysis studies are also challenging. This study was also
conducted on credit card customers and a behavioral scoring model was
created by analyzing the historical data of existing customers. Three main
profitable customer groups were separated by using variables such as
repayment behavior, frequency of use, duration of use and demographic
information. Apriori association algorithm was used. This study shows that
identifying customers with the behavioral scoring model provides useful
information and facilitates marketing strategy development. As a result, it has
been shown that credit and behavior scoring models are useful and this study
will be useful in making a more accurate decision on whether to give credit to
newly made loan applicants. It helped marketers determine efficient strategies
with customer profiles and conduct more personalized marketing customer
relations.
Gürsoy's (Gürsoy, 2010) study titled “Customer Dropout Analysis in the
Telecommunication Sector” studied churn customer loss with data analysis
algorithms in the telecommunication sector. If a customer cancels their
membership agreement with a company and becomes a customer of another
competitor, that customer is called a lost customer or a Churn customer. Loss
of customers is closely related to customer loyalty. Price advantages are not
enough to keep the customer. Adding new value-added services to products is
an important point to ensure customer loyalty. The main purpose of customer
churn analysis is to find a customer that is likely to be lost and to take
marketing or other strategic steps to avoid losing those customers. The lost
customers to be considered in the analysis differ according to the sectors or
data. For example, a credit card customer can easily start using another bank's
credit card without canceling the current bank's credit card. In this case, one
can look at the rate of reduction in expenses to understand the loss of the
customer. Loss of customers can be a big problem in competitive banks,
insurance, and telecommunications companies. The cost of acquiring new
customers for companies is increasing day by day. Instead of organizing
campaigns to gain new customers, companies care about higher customer
loyalty. Having a good model that handles large amounts of data will help
companies to plan competitive advantage and more accurate strategies. In the
study of Şimşek and Oman (Şimşek, 2018), customer loss estimation of a
telecommunication company in Turkey was made with SPSS. Identifying
which customers are likely to lose, determining which customers should
retain, and developing strategies to retain profitable customers are matters of
customer retention. Firstly, it is very important to determine the churn ratio.
56 Özge Doğuç
By using this ratio, companies can make predictions about the future behavior
of new customers and develop appropriate strategies beforehand.
Logistic regression and decision tree algorithms were used in this study.
The target variable in the data set, customer leaving and not leaving, is given
as 0, 1. After preprocessing the data, the split ratios of 0 and 1 were balanced
as 49% to 50%. The next step is to eliminate variables that have no effect on
the target variable. By examining the correlation between the variables, the
determined variables are eliminated. As a result of the decision tree model,
meaningful information such as “If the average of local and long-distance calls
is higher than 218, 42% of the subscribers will go to churn” was obtained. The
prediction accuracy rate of the logistic regression model that they did not
churn was 74%, and the accuracy rate of prediction that they did not churn
was 66%. As a result, the information obtained from this study can be used in
marketing activities. When campaigns are organized for products and services,
it can prioritize subscribers who are more likely to churn, thus reducing
customer churn (Gürsoy, 2010).
The study by Savaşçı and Tatlıdil (Savaşçı and Tatlıdil, 2006) includes
the evaluation of the practices of banks in customer relations in terms of
customer loyalty. Banks in the financial services sector aim to increase
customer loyalty, maintain the number of customers, and ensure satisfaction
by giving importance to customer relations. A customer data warehouse needs
to be created. In this study, customer impression is provided through credit
cards. Credit card strategies used for customer loyalty were evaluated. A
survey method was determined for 500 card users in İzmir district. The reasons
for using the most preferred credit cards by the consumers participating in the
survey were examined. When asked whether they use the installment facility
with their credit cards, 91% stated that they pay their purchases in installments.
The advantage with the highest level of satisfaction of the card(s) they use the
most is that it is wide and easy to use at a rate of 84%. Banks attach great
importance to customer relationship management to retain their customers and
increase customer loyalty and satisfaction in today's technology and
competitive environment. For this, it is important to keep customer data, to
obtain meaningful results by using data mining methods, and to use
communication technologies effectively and efficiently.
In the study of Gülpınar (Gülpinar, 2013), estimating customer loss in the
Turkish telecommunications sector with Artificial Neural Network analysis
and considering incoming outgoing calls, because of Social Network analysis,
it shows appropriate marketing activities for effective customers. A 14-
question survey was conducted with 100 different GSM users. For the Social
Analysis of Customer Churn in the Banking Industry Using Data Mining 57
Network analysis, the phone numbers of the 5 people they talked to last time
were asked and a survey was conducted with them. Thus, a wide
communication network was created. A high iteration model was created, and
the margin of error was reduced below 0.05. This study has added a different
approach to customer loss analysis by examining the communication network
structures of customers using social network analysis. What influences
customer value is not only their own behavior, but also the customer's network.
In the telecommunication sector, besides his personal characteristics, the
people he is connected to have affected the loss of customers. In other fields
such as banking, this approach to customer analysis will support the
emergence of better forecasting methods.
In a competitive telecommunications market, the company's goal of
minimizing customer loss by increasing customer loyalty is as important as
gaining customers. As studies have shown, spending on acquiring customers
is 20 times more costly than retaining existing customers. Customer
relationship management (CRM) establishes and manages strong, long-term
customer relationships with high loyalty. For this purpose, customer loss
analysis tools are developed and benefited from. (Vafeiadis et al., 2015)
In this study, Decision Tree, Random Forest, and Artificial Neural Networks
data mining algorithms were applied on a data set containing bank customer
data. Data mining methodology, which stands for CRIPS-DM, is Cross-
Industry Standard Process for Data Mining. This is the most widely used
model in data mining and data science and has 6 stages.
In this study, the data set consisting of 10,000 customers obtained from
open sources for customer churn analysis includes 14 variables. One of them
is the exit (Exited) target (class) variable that is tried to be estimated. These
are the variables;
explanation It is seen that the extreme data is It is seen that the extreme data is
after the age of 60. before 400.
Analysis of Customer Churn in the Banking Industry Using Data Mining 61
explanation There is a total of 60 people with 4 It is seen that the estimated salary
product ownership, seen as extreme distribution is balanced.
data. 1 and 2 have agglomeration.
Figure 4. Distribution of number of years the customer has been a customer of the
bank by loss of customers.
Figure 6 shows that the rate of leaving the bank by active customers (1) is
lower. In addition, it is seen that the number of active customers is higher.
It is seen that the loss of customers is approximately equal in every salary
range in Figure 7.
There are 60 people with a used bank product of 4 and they are all
abandoned customers. Figure 8 shows that customer loss is the highest
proportionally, after 3, customers using 1 and 2 bank products.
Analysis of Customer Churn in the Banking Industry Using Data Mining 65
Data Preparation
The next stage of data analysis is data cleaning, making it usable for the model.
At this stage, the missing data in the data set is completed according to the
determined rules. Unnecessary, unwanted customer number, surname, etc.
variables are discarded. In this study, Knime was used, and the two variables
mentioned with the Column Filter node were cleaned from the dataset.
The Exited variable, which is our target (Class) variable, has been
converted into a string, that is, a categorical variable (string) with Number to
String node so that we can use it in the analysis. With the same node, the
variables of activity and credit card ownership were made categorical (string).
For a better analysis, new variables that are thought to be related to the
data set can be derived and added to the data set. In this study, the T/NOP
variable was produced, which indicates the ratio of the Usage time to the
Number of Products Used variables (Tenure/NumOfProducts). A Math
Formula node is used. Variables with continuous numerical values can be used
in the model by dividing them into categories. Rule Engine node is used.
In the data set, it was seen that the number of customers who did not leave
the bank was 7963, while the number of those who did was 2037. The big
difference between these two numbers is something that can make the model
difficult to learn. Therefore, the number of rows of the data set has been
increased by generating and adding data with the same statistical information,
without disturbing the structural features of the SMOTE node and the data set.
SMOTE does not reproduce existing values. Creates new values based on the
distance between existing values.
Analysis of Customer Churn in the Banking Industry Using Data Mining 67
Data Modeling
After the Dataset Understanding and Preparation stages comes the Model
creation stage. Various data mining algorithms that we will apply to the data
we have prepared are selected, the parameters are determined, and the model
is applied. When the model is ready, the accuracy value is calculated to
measure the model success. The model emerges because of processing the data
set with machine learning algorithms. Input information are hyperparameters
that may differ for each algorithm that we have given to the model beforehand.
The image shows its structure. In this study, Decision Tree, Random Forest,
and Artificial Neural Networks algorithms were used.
The Comparison Matrix given in Table 4, the model left the bank in the
test data set, that is, the model correctly predicted 1044 data with a value of 1
for the Churn variable. Didn't leave the bank The model correctly predicted
6621 pieces of data with a value of 0 as the Churn variable. Model 1 predicted
756 with a true value of 1 and model 1 with a true value of 0 and 579 with a
true value of 0. Based on this table, it is possible to easily measure the success
of the model with various values.
variable (class) and compared it with the actual values. Thus, model success
metrics and Confusion Matrix were found. Table 3 shows the matrix.
The confusion matrix in Table 5 shows that the model left the bank in the
test data set, that is, 762 data with a value of 1 for the Churn variable were
correctly predicted as model 1. In terms of not to leave the ban, the model
correctly predicted 7017 pieces of data with a value of 0 for the Churn variable.
An incorrect estimation has been made by estimating 1038 with a true value
of 1 and model 0 with a true value of 183 with the model 1.
When we look at the Confusion Matrix given in Table 6, the model left
the bank in the test data set, that is, the model correctly predicted 889 data with
a value of 1 for the Churn variable. The model did not leave the bank. The
model correctly predicted 6864 data with a value of 0. Model 0 with a true
value of 1, 944 with a true value of 0, and 303 with a true value of 0, predicted
model 1.
Conclusion
In the study, data mining models were created that predict which bank
customers will leave the bank and which will not. The data comprehension
phase was done with Python programming language, and then the model was
created on the Knime platform. The data set used was divided into two. While
the model was created with the training part, the accuracy values of the model
were found with the test part. Algorithms used in modeling; Decision Tree
74 Özge Doğuç
References
Alaybeg, F., 2019. Data Mining Introduction, Methods and Methodologies. Medium.
https://furkanalaybeg.medium.com/veri-madencili%C4%9Fi-ve-y%C3%B6ntemleri-
d0e2fd238e44.
Çınar, U., 2018. Neural Networks and Application with R Program. https://www.
veribilimiokulu.com/yapay-sinir-aglari/.
Doğan, B., Erol B. and Buldu, A., 2014. Using the Association Rule for Customer
Relationship Management in the Insurance Industry. International Journal of
Advances in Engineering and Pure Sciences 3, 105-114.
Gulpinar, V., 2013. Customer Loss Analysis in Turkish Telecommunications Market with
the Help of Artificial Neural Networks and Social Network Analysis. Marmara
University Journal of Economics and Administrative Sciences 34 (1), 331-350.
76 Özge Doğuç
Gupta, D. and Kamilla, U., 2014. Cyber Banking in India: A Cross-Sectional Analysis
Using Structural Equation Model. IUP Journal of Bank Management 13 (2).
Gürsoy, U. Ş., 2010. Customer Churn Analysis in Telecommunication Sector. Journal of
Istanbul University Faculty of Business 39 (1), 35–49.
Hsieh, N. C., 2004. An Integrated Data Mining and Behavioral Scoring Model for
Analyzing Bank Customers. Expert Systems with Applications 27 (4), 623–633.
Ivanchenko, O., Mirgorodskaya, O. N., Baraulya, E. V. and Putilina T. I., 2019. Marketing
Relations and Communication Infrastructure Development in the Banking Sector
Based on Big Data Mining. International Journal of Economics and Business
Administration 7, 176-184.
Maryani, I. and Riana, D., 2017. Clustering and Profiling of Customers Using RFM for
Customer Relationship Management Recommendations. In 2017 5th International
Conference on Cyber and IT Service Management (CITSM), 1-6.
Mukherjee, A. and Nath, P., 2003. A Model of Trust in Online Relationship Banking.
International Journal of Bank Marketing 5, 5-15.
Parviainen, P., Tihinen M., Kääriäinen J. and Teppola, S. 2017. Tackling the Digitalization
Challenge: How to Benefit from Digitalization in Practice. International Journal of
Information Systems and Project Management 5 (1), 63-77.
Payne, A., 2005. Handbook of CRM. Routledge.
Sassen, S., 2000. Globalization and Its Discontents: Essays on the New Mobility of People
and Money. New York: NewPress.
Savaşçı, I. and Tatlıdil R., 2006. The Effect of the Crm (Customer Relationship
Management) Strategy Applied by the Banks in the Credit Card Market on Customer
Loyalty. Aegean Journal of Academic Perspective 6 (1), 62-73.
Şimşek, H., 2018. Machine Learning Lessons 5a: Random Forest (Classification). Veri
Bilimi Türkiye. [Data Science Turkey].https://medium.com/data-science-en/machine-
%C3%B6%C4%9Frenmesi-desleri-5-bagging-ve-random-forest-2f803cf21e07.
Vafeiadis, T., Diamantaras, K. I., Sarigiannidis, G. and Chatzisavvas K. C., 2015. A
Comparison of Machine Learning Techniques for Customer Churn Prediction.
Simulation Modeling Practice and Theory 55, 1-9.
Wijaya, A. and Girsang, A. S., 2015. Use of Data Mining for Prediction of Customer
Loyalty. CommIT (Communication and Information Technology) Journal 10 (1), 41-
47.
Chapter 5
Abstract
*
Corresponding Author’s Email: [email protected].
Introduction
Increasing energy demand and the threat of global warming lead to the
exploitation of additional and cleaner energy sources. A consequence of this
trend is the growing penetration of SSDES. These installations may now
belong to microgrid consumers, which become prosumers, i.e., both producers
and consumers of energy. Photovoltaic (PV) panels, in particular, had
significant growth in recent years, with incentives given by EU countries like
Romania (see Law no. 184/2018), motivating its adoption and turning it into
a business case. The support of bidirectional power flows resulting from the
transactive energy in the local communities, as well as the need to decrease
the power loss lead to changes of the microgrids, namely in Low Voltage (LV)
distribution, pointing in the direction of a more responsive and efficient Smart
Grids (SGs).
Energy efficiency is a tool for saving money and resources, representing
a necessity for flexible adaptation to users' demands. Because the use of
electricity plays a fundamental role in modern lifestyles; users' load
characteristics must reflect the people's lives at work and leisure. The volume
of available information and necessary for the operation, management,
planning, and security of the microgrids have gradually increased with
technological development, requiring the introduction of the calculation
technique and intelligent solutions, Piccinini et al., (2015). Even if the
The Crowdsourcing Concept-Based Data Mining Approach … 79
prosumers have high benefits, their behavior is intermittent, and, for this
reason, the DNOs must consider a comprehensive microgrid planning
strategy. Moreover, they should be able to function both connected and
independently (autonomously) to the grid Andoni et al., (2010). Developed for
communities, the microgrids integrate local electricity production to satisfy
the exact demand of the consumers. The active consumers and prosumers are
defined in the context of recent paradigms of energy independence, energy
policy, and distributive management.
The microgrids is seen as the instrument to perform a coordinate structure
for fair integration of SSDES, which will be an edifying challenge for DNOs
that will require another operation plan. The prosumers form the active cells
of the microgrids in which any cell can supply installations to the DNO to
optimize the capacity of grid hosting and hence SSRES implementation in a
profitable and efficient operating way Espe et al., (2018).
In Romania, energy efficiency is improved from the increasing electricity
amounts generated by small producers (prosumers) with installed power up to
27 kW. These amounts are traded directly to the provider with which a
bilateral contract has already been signed Neagu et al., (2020a).
The prosumer's behavior must be a balanced one to store power surplus
when low demand spells and deliver it during the demand increases
Diahovchenko et al., (2020). An up-to-date report published by the European
Union - Smart Grids Task Force, Final report (2020), adds an explicit direction
to build a smart meter (SM) roadmap to satisfy the necessities of the future
energy markets through a modular and resilient structure for the metering
architecture. Thus, the SM should ensure helpful information on the shape of
the prosumers' load Chicco et al., (2020).
The proliferation of SSDES changes both the operating conditions and
management requirements of the microgrids, which must now integrate new
technologies and procedures. In recent years, the SG concept has been
implemented in several scales and initiatives in Romania, with much of the
hardware already tested and approved, albeit in isolation. In this context, an
assumption of a framework or unifying architecture of interfaces and protocols
based on norms and standards is essential. Thus, a reference architecture for
the data exchange between devices and electrical systems must be defined,
allowing the products, services, protocols, and interfaces to interact Kazmi et
al., (2017); Deng et al., (2017). Figure 1 illustrates the transition from classical
electricity grids to microgrids.
80 B. C. Neagu, M. Gavrilaș, O. Ivanov et al.
The mitigation of energy poverty Neagu et al., (2020d) can be achieved using
crowdsourcing, a concept first introduced in 2005 by Surowiecki et al., (2005).
Crowdsourcing, Maxim et al., (2016), represents an emerging trend that
integrates contributions from users and the collective wisdom of the crowd,
Howe et al., (2006). It is very important to build a service-based approach to
make these crowdsourced sensor cloud data available. It can also be an
effective means to enable the crowd to provide a service-sharing community
within a geographical area by using their smartphones, Alt et al., (2010). Users
can take advantage of services from their neighborhood users through this
crowdsourced service community. Since the crowd (i.e., service providers) is
mobile, the availability of crowdsourced services to users is limited to its
spatio-temporal adjacency, i.e., both service providers and users should be
within a spatial region at a particular time.
A key issue is selecting and composing services from such a large number
of everchanging crowdsourced sensor cloud services to fulfill users’
requirements in a real-time fashion and based on spatio-temporal features. As
The Crowdsourcing Concept-Based Data Mining Approach … 83
domains Mika et al., (2020). Nevertheless, not all fields are fully ready to
assimilate blockchain technology. The current technological opportunities
must be analyzed in each particular case together with the challenges that the
end-users face and with how a new decentralized architecture could create
value for them Wang et al., (2019).
Fortunately, the electricity industry is a suitable candidate for innovation
through blockchain technology. It integrates a complex supply chain with
needs to increase transparency and improved data management. In addition, it
contains a highly transactional trading market that would benefit from instant
settlement Unguru, (2018). The clarity and immutability of the blockchain can
empower end-users of this industry and consumers.
The solution for the optimal location of small-scale renewable energy sources
(SSRES) uses a simulation algorithm based on iterative computation of steady
state, in order to minimize losses from the microgrid. The main objective is
the optimal location of generation sources minimize energy losses in a
microgrid. In essence, for a radial distribution network with N nodes, knowing
its structure, the optimization variables refer to the capacity of the distributed
generators known a priori, located in the nodes of the microgrid. In a first
phase, all the generation sources are located in each node of the microgrid.
The optimization variables are included in the column vector x whose length
is equal to the number of nodes, x = [x1, x2,…, xN]t where t is used for transpose
the vector. The inputs xi, (I = 1,…,N), represent a discrete value of the of
capacity size of the SSRES connected in a random node i (zero if the SSRES
is not present).
The optimization model considers the variation of the SSRES generated
power and the active and reactive load consumed. The considered analysis
timescale is divided into H time intervals Δth, for any h = 1,…,H. Neagu et al.,
(2019).
The Crowdsourcing Concept-Based Data Mining Approach … 87
H D
min = F ( x ) = P( ) ( x ) t
h =1 d =1
h
d
h (1)
U h( ,min
)
U i , h U h( ,max
)
i i
(3)
(i ) (i )
where U h,min and U h,max are the minimum and maximum effective value of
the voltage for bus I = 1,…, N at h=1,…, H.
2. Branch thermal limits: the effective value of the current over the brach
d = 1,…, D, at hour h= 1,…, H, denoted as I h( d ) , must be less than
(d )
the maximum admissible current for the line, I h,min :
I h( d ) I h( d,max
)
(4)
Qh(
i , SSRES )
Qh(i,min
, SSRES )
Qh( i,max
, SSRES )
(6)
( i , SSRES ) ( i , SSRES )
where Qh ,min and Qh,max are minimum/maximum allowable reactive
power of SSRES, and Qh(i , SSRES ) is the reactive power of SSRES in bus I =
1,…, N at hour h = 1,…, H.
PV panels. The buses where prosumers are present are depicted with blue light
on the house roof. The objective function for the optimal case (Scenario 0),
when all PV panels are connected to the microgrid, was about 96.92 kWh.
In order to carry out a deeper analysis regarding the impact of prosumers
operation on microgrid power losses, four scenarios were considered, namely:
Scenario 1 (only one prosumer is connected – five cases), Scenario 2 (two
prosumers are connected simultaneously – ten cases), Scenario 3 (three
prosumers are connected simultaneously - ten cases), and Scenario 4 (four
prosumers are connected simultaneously –five cases). All of the proposed
scenarios are reported to the initial case and the optimal scenario (Scenarios
0). For instance, the daily energy losses in the Scenario 1 are depicted in Figure
6.
The minimum value of the losses is obtained in the case when only the
prosumer no. 15 is in operation. The reduction is 20.23% compared to the
reference case. In the second scenario, there are ten cases as is shown in Figure
7. The minimum value of losses is obtained if the prosumers 7 and 15 are in
operation, with a reduction of the daily energy losses of 26.07% compared to
the reference case. For the last two Scenarios, the results are indicated in
Figures 8 and 9. In Scenario 3 the minimum value of losses is obtained if the
prosumers 7, 15 and 21 are in operation on the network, lower by 30.71%
compared to the reference case, while in Scenario 4 the energy losses is
reduced to 34.37% (the Pros. 27 is not in operation), significantly close to the
optimal scenario (35.06%).
90 B. C. Neagu, M. Gavrilaș, O. Ivanov et al.
Conclusion
References
Alt, F., Shirazi, A. S., Schmidt, A., Kramer, U., and Nawaz, Z., (2010). Location-based
crowdsourcing: extending crowdsourcing to the real world. 6th Nordic Conference on
Human-Computer Interaction: Extending Boundaries, 13–22.
Andoni, M., Robu, V., Flynn, D., Abram, S., Geach, D., Jenkins, D., McCallum, P., and
Peacock, A., (2010). Blockchain technology in the energy sector: A systematic review
of challenges and opportunities, Renewable and Sustainable Energy Reviews, 143-
174.
Chicco, G., Labate, D., Notaristefano, A., and Piglione, F., (2020). Unveil the Shape: Data
Analytics for Extracting Knowledge from Smart Meters, Energia Elettrica
Supplement Journal, 1-16.
The Crowdsourcing Concept-Based Data Mining Approach … 93
Deng, S., Huang, L., Taheri, J., Yin, J., Zhou, M., and Zomaya, A. Y., (2017). Mobility-
Aware Service Composition in Mobile Communities, IEEE Transactions on Systems,
Man, and Cybernetics: Systems, 555–568.
Diahovchenko, I., Kolcun, M., Čonka, Z., Savkiv, V., and Mykhailyshyn, R., (2020).
Progress and Challenges in Smart Grids: Distributed Generation, Smart Metering,
Energy Storage and Smart Loads. Iran Journal Science Technology Transmission
Electrical Engineering, 1-15.
Espe, E., Potdar, V., and Chang, E., (2018). Prosumer Communities and Relationships in
Smart Grids: A Literature Review, Evolution and Future Directions. Energies, 2528.
European Smart Grids Task Force - Expert Group 3, Demand Side Flexibility - Perceived
barriers and proposed recommendations, Final Report, Apr. 2019.
Howe, J., (2006). The rise of crowdsourcing. Wired magazine, 1–4.
Kazmi, S. A. A., Shahzad, M. K., Khan, A. Z., and Shin, D. R., (2017). Smart Distribution
Networks: A Review of Modern Distribution Concepts from a Planning Perspective.
Energies, 501.
Maxim, A., Mihai, C., Apostoaie, C.-M., Popescu, C., Istrate, C., and Bostan, I., (2016).
Implications and Measurement of Energy Poverty across the European Union.
Sustainability, 483.
Mika, B., and Goudz, A., (2020). Blockchain-technology in the energy industry: blockchain
as a driver of the energy revolution? With focus on the situation in Germany. Energy
Systems.
National Regulatory Authority for Energy. The 228 Order for the Approval of the Technical
Norm Technical Conditions for Connection to the Public Electrical Networks of the
Prosumers; National Regulatory Authority for Energy: Bucharest, Romania, 2018.
Neagu B. C., and Grigoras, G., (2020a). A Fair Load Sharing Approach Based on Microgrid
Clusters and Transactive Energy Concept, 12nd Int. Conf. Electr. Comp. and Artif.
Intell., Bucharest, Romania, 1-4.
Neagu B. C., and Grigoras, G., (2020b). A Data-Mining-Based Methodology to Identify
the Behavioural Characteristics of Prosumers within Active Distribution Networks,
International Symposium on Fundamentals of Electrical Engineering 2020 (ISFEE),
Bucharest, Romania, 1-4.
Neagu, B. C., Ivanov, O., Grigoras, G., and Gavrilas, M., (2020c). A New Vision on the
Prosumers Energy Surplus Trading Considering Smart Peer-to-Peer Contracts.
Mathematics, 235.
Neagu, B.-C., Ivanov, O., Grigoras, G., Gavrilas, M., and Istrate, D.-M. (2020d). New
Market Model with Social and Commercial Tiers for Improved Prosumer Trading in
Microgrids. Sustainability, 7265.
Neagu, B. C., Grigoras, G., Ivanov, O., (2019). An Efficient Peer-to-Peer Based Blokchain
Approach for Prosumers Energy Trading in Microgrids. International Conference on
Modern Power Systems (MPS), Cluj Napoca, Romania, 1-4.
Piccinini, E., Gregory, R. W., and Kolbe, L. M., (2015). Changes in the producer-consumer
relationship-towards digital transformation. Changes, 1634-1648.
Surowiecki, J., (2005). The Wisdom of Crowds; Anchor: San Diego, CA, USA.
The Romanian Parliament, “Law no. 184/2018 for approving the Government Emergency
Ordinance no. 24/2017 regarding the modification and updating of Law no. 220/2008
94 B. C. Neagu, M. Gavrilaș, O. Ivanov et al.
for determining the incentive system for producing energy from renewable energy
sources and the modification of other normative acts,” Official Gazette, Part I, No.
635/20.07.2018.
Unguru, M., (2018). Blockchain technology: opportunities for the energy sector,
EUROINFO, 53-58.
Wang, S., Taha, A. F., Wang, J., Kvaternik, K., and Hahn, A., (2019), Energy Crowd-
sourcing and Peer-to-Peer Energy Trading in Blockchain-Enabled Smart Grids. IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 1612–1623.
Chapter 6
Active Learning
Abstract
Introduction
*
Corresponding Author’s Email: [email protected].
(in a batch scenario, after incorporating it into the labeled dataset). We depict
the aforementioned elements in Figure 2. Three aspects must be considered
when looking for the most valuable samples (Wu, 2019): informativeness
(contains rich information that would benefit the objective function),
representativeness (how many other samples are similar to it), and diversity
(the samples do not concentrate in a particular region, but rather are scattered
across the whole space).
Figure 2. Schematic diagram of active learning. The oracle provides the data directly
to the ML model in an online setting. In contrast, in an offline setting, such instances
are persisted into a dataset so that the model can leverage them when retrained. The
diagram holds for most cases. An exception could be an oracle being directly asked
for a particular kind of demonstration, and therefore data selection would not be
needed.
Query Strategies
(Settles, 2010), the least confidence criterion (selects the sample whose most
probable label has the least probability value) (Lewis, 1995), and the Gibbs
error criterion (selects the sample with the largest Gibbs error, which measures
the expected error of a Gibbs classifier predicting the label by sampling the
current label distribution) (Nguyen et al., 2013). Furthermore, they introduced
two greedy algorithms (Hamming loss and F1 loss), which achieved the best
performance under worst-case and average-case scenarios.
One of the most popular approaches is the query-by-committee (QBC)
framework (Seung et al., 1992), based on the idea that given a committee of
machine learning models, the committee’s variance can be indirectly
measured by examining the disagreement between their predictions.
Therefore, candidate data instances among the pool of unlabeled data are
selected based on the committee’s disagreement. Furthermore, multiple
strategies were proposed to measure such disagreement; e.g., Dagan et al.
(1993) proposed measuring disagreement using the vote entropy, considering
only the final class label regardless of the prediction scores. A variation to the
approach mentioned above is to measure the vote entropy by considering the
prediction scores of each committee member instead of the predicted labels.
Furthermore, a more sophisticated approach would consider measuring the
confidence of the committee members by measuring their divergence to the
mean and taking into account the class distribution of each committee member
(Kee et al., 2018).
Another approach is to consider the disagreement margin, which measures
how unanimous the committee is about the given prediction. Abe and
Mamitsuka (1998) proposed measuring the disagreement margin by
considering the difference of votes between the first and second most
confident label estimates for a given instance. Weigl et al., (2016) reframed
the aforementioned approach by considering the prediction scores. Another
approach was proposed by Borisov et al., (2011), who determined the
disagreement based on the standard deviation of the weighted prediction
scores of each committee member. Mccallum (1998) modified the QBC
approach using Expectation-Maximization to model the density over the
unlabeled data instances and use it to weight the disagreement scores to select
the unlabeled instances of interest to submit to the oracle.
Gammelsæter (2015) explored a different approach to QBC for neural
networks by applying a dropout layer to a Multi-Layer Perceptron (MLP) to
form a committee. Nevertheless, adding only one sample per time created an
unbalanced weight among the training samples. This issue was addressed by
Ducoffe and Precioso (2015), who described a QBC technique to train a
100 Jože M. Rožanec, Blaž Fortuna and Dunja Mladenić
extracts rules from an SVM model to gain insight into its rationale and then
uses active learning to enhance the discrimination rules close to the SVM
decision boundary. A different take on the problem was developed by Guo and
Greiner (2007), who proposed querying unlabeled data instances, selecting
those providing the maximum conditional mutual information given the
labeled data, and using an optimistic guess for the unlabeled instances. There
are two possible scenarios given the optimistic guess: either the guess was
correct or not.
When the guess is incorrect, the authors propose selecting the most
uncertain unlabeled data instance, which helps to clarify the model’s decision
boundary. In the same line, Beygelzimer, Dasgupta and Langford (2009)
introduced the Importance Weighted Active Learning (IWAL), advocating the
relevance of importance weights to ensure a correct sampling bias. The learner
uses a specific strategy to estimate a rejection probability threshold based on
the unlabeled data instance and history of unlabeled samples queried to the
oracle. The weight of the particular instance is determined as the inverse of
the rejection probability, and used only if such data instance is queried against
the oracle. Another method for weighting instances was developed by Ganti
and Gray (2012), who devised the Unbiased Pool Based Active Learning
(UPAL) algorithm for pool-based active learning. The algorithm proceeds in
rounds, putting a probability distribution over the entire pool to then sample a
point from the pool. The probability distribution in each round depends on the
active learner and is obtained by minimizing the importance of weighted risk
over the hypothesis space. The authors demonstrated that UPAL outperforms
the Batch Mode Active Learning (BMAL) algorithm Hoi et al., (2006) in terms
of discriminative power, scalability, and execution time.
Xu et al., (2003) developed an approach to take into account the diversity
of data samples. To that end, the authors performed representative sampling
by clustering unlabeled data instances close to the decision boundary of an
SVM model and then selecting the medoid instances, which are expected to
preserve the density distribution information of the whole cluster. A similar
approach was followed by Nguyen and Smeulders (2004), who proposed using
clustering over all unlabeled data and then finding candidate samples
considering unlabeled data instances close to the decision boundary and
medoids of the highest density clusters. The one that contributes most to the
current error is chosen from the initial set of candidates. The authors argue that
while such a choice does not guarantee to achieve the smallest future error, it
is highly probable that such a choice would lead to a significant error decrease.
102 Jože M. Rožanec, Blaž Fortuna and Dunja Mladenić
Yang, Hanneke and Carbonell (2013) explored the use of active learning
in a transfer learning setting, where target concepts are sampled from an
unknown distribution. Furthermore, the authors explored the benefits of
transfer learning to know whether to stop the active learning procedure (self-
verifying active learning). In their research, they contrasted their work with
research from Baxter (1997) regarding the role of the sample size and loss
observed in learning multiple tasks. Yang, Hanneke and Carbonell (2013)
consider that the results are related only to the number of tasks. In contrast,
the number of samples per task remains bounded given non-altruistic learners
and the fact that the desired number of labeled samples should never be
significantly larger than the number of samples required to solve the task. The
authors consider that transfer learning can be used in a setting where all
problems cannot be solved by a single individual but rather by a coalition of
cooperating individuals who are willing to share the data used to learn a
particular concept to help others to learn another task. Furthermore, they
demonstrate that transfer learning can lead to a sample complexity close to the
one achieved by algorithms that directly know the targets’ distribution. Shao
(2019) also addressed the problem of actively selecting informative instances
aided by transferred knowledge from related tasks. To that end, the authors
proposed using a diverse committee with members from both domains to
select the most informative instances based on the maximum vote entropy,
improve the classification accuracy and evaluate each member over multiple
iterations.
A whole different view of the active learning problem was developed by
Zhu, Ghahramani and Lafferty (2003), who envisioned combining semi-
supervised learning and Active Learning by transforming labeled and
unlabeled data into vertices of a weighted graph connected by edges indicating
the similarity between data instances. While the semi-supervised learning
problem was framed in terms of a Gaussian random field model on the graph,
active learning was used to query unlabeled data and minimize the expected
classification error of the semi-supervised model. Active learning was also
applied to graph-specific problems. Ma, Garnett and Schneider (2013)
developed a graph-specific criterion called Σ-optimality, querying the nodes
that minimize the sum of the elements in the predictive covariance. Ostapuk,
Yang and Cudré-Mauroux (2019) developed an active learning framework for
knowledge graphs that takes into account the model uncertainty and the
underlying structure of the knowledge graph to improve the sampling
effectiveness and incrementally train deep learning models. A different
approach was developed by Li, Yin and Chen (2021), who proposed a
Active Learning 103
target label and therefore leading to faster convergence. Sinha, Ebrahimi, and
Darrell (2019) introduced the variational adversarial active learning, sampling
instances using an adversarially trained discriminator to predict whether the
instance is labeled or not based on the latent space of the variational auto-
encoder. A weakness of this approach is that it can end up selecting instances
that correspond to the same class, regardless of the proportion of labeled
samples per class. This issue was addressed by Laielli et al. (2020), who
developed a semi-supervised mini-max entropy-based active learning
algorithm that leverages uncertainty and diversity in an adversarial manner. A
similar approach was described by Liu et al. (2019), who proposed the Single-
Objective Generative Adversarial Active Learning (SO-GAAL) technique,
which generates outliers with informative potential based on the mini-max
game between a generator and a discriminator to assist a classifier on
describing a boundary that separates outliers from normal data.
The overview of query methods provided above is by no means
exhaustive. We refer the reader to the following surveys to learn more about
active learning in particular settings. The surveys by Fu, Zhu and Li (2013),
and Kumar and Gupta (2020) provide great insights about query strategies for
classification, regression, and clustering in a batch setting. To learn about
techniques applied with deep learning models, we recommend the survey by
Ren et al. (2021). Finally, the survey by Lughofer (2017) provides an in-depth
introduction to active learning in an online setting.
Use Case
While there is currently a research void regarding the use of active learning in
the manufacturing domain (Meng et al., 2020), the increasing digitalization
makes such approaches ever more relevant. Active learning has been
recognized to alleviate the manual labeling workload. Such quality has been
exploited in the manufacturing setting too. Furthermore, active learning is
considered one of the pillars of human-machine collaboration, key to the
Industry 5.0 paradigm (Rožanec et al., 2022a).
It can be argued that the manufacturing process begins with acquiring the
raw material and components required to manufacture the product. Accurate
demand forecasting is key to avoiding distortions across the supply chain.
Active Learning 105
gather enough labeled data samples. The authors described an active learning
approach to select the wafer maps to alleviate this issue, ensuring no effort is
invested into data samples that would not provide enough information to the
classifier to learn. Dai et al., (2018) described a use case regarding automatic
optical inspection for the recognition of solder joint defects in printed circuit
boards. Active learning is applied to enlarge an initially labeled dataset. The
authors developed an original method combining semi-supervised learning
and active learning. In particular, they performed k-means clustering on
labeled and unlabeled data and trained a classifier on labeled data. They later
used the predictions issued by the classifier to analyze which clusters had the
least skewed distributions when considering the predicted classes of the
unlabeled samples and sample unlabeled data from them. A different use case
for visual inspection was described by Rožanec et al., (2022d); Trajkova et al.,
(2021), where multiple active learning approaches were considered to reduce
the manual visual inspection and labeling effort when inspecting the quality
of printed logos on manufactured shavers.
Active learning addresses the problem of which data to request to the
oracle to maximize the learning of a given agent. This idea strongly connects
with research on how robots are taught from human demonstrations and how
a physical robot can improve its skills by asking for such demonstrations. For
example, Maeda et al., (2017) describes an active learning approach that
enables robots to decide whether they have the skill to deal with an unknown
task or must ask for demonstrations to learn movement primitives
incrementally. In the same line of research, Conkey and Hermans (2019)
explored the use of active learning to learn a library of probabilistic movement
primitives, while Koert et al., (2019) used active learning to rely on fewer
demonstrations while improving the generalization capabilities when learning
a specific set of tasks. For a more in-depth overview of the use of active
learning in robotics, we refer the reader to the survey by Taylor, Berrueta, and
Murphey (2021).
Healthcare
(2011), who used weak decision tree classifiers to determine carcinoma and
non-carcinoma regions on prostate histopathological images. The authors used
a QBC active learning to annotate the images with the highest discrepancy
between the classification trees in a bagging setting. Padmanabhan et al.,
(2014) used artificial intelligence to detect and analyze the chemical and
physical characteristics of human cells in the context of renal carcinoma
detection. The authors reported using active learning to quantify the
information contribution of unlabeled samples by estimating the determinant
of the Fisher information matrix. The samples expected to provide the greatest
contribution were selected for manual labeling. Das, Nair, and Peter (2020)
described the use of artificial intelligence to analyze histopathological tissue
sample images and determine variations in size and shape that correspond to
malignant lesions associated with breast cancer. Given that the annotation of
such images requires a considerable effort of domain experts and experienced
pathologists, the authors used Active Learning to select batches of instances
based on the kernelized Riemannian distance measures, such as the Jeffrey
and Stein divergences. Among the results, they report using only 20% of the
labeled data required to train a supervised classifier model without the active
learning setting. A different approach was developed by Doyle and
Madabhushi (2010), who used a consensus of ambiguity to identify images
considered ambiguous by multiple algorithms and prioritize them for manual
labeling. Active learning has been also successfully applied to image
classification models related to the detection of colorectal cancer (Zhao et al.,
2019), skin lesions (Shi et al., 2019) or other diseases, such as COVID-19 or
pneumonia (Wu et al., 2021; Nguyen et al., 2021).
Accurate diagnostics require the right level of medical expertise to assess
the evidence. Mu et al., (2021) reported using deep learning to map semi-
structured and unstructured text data from pathology synopses to provide
relevant semantic diagnostic labels. Such a system attempts to avoid
bottlenecks that result from the limited number of specialists available to
interpret the pathology synopses and provide relevant diagnostic information.
In addition, the authors used active learning to prioritize the labeling of
unlabeled samples considering those that are underrepresented in the dataset
or the ones that could enhance the models’ performance.
Once a disease is discovered, it is crucial to understand how critical it is
to ensure the best treatment possible for the patient while also meeting the
overall patient scheduling constraints. This problem was addressed by Nissim
et al., (2017), who developed the CAESAR-ALE (Classification Approach for
Extracting Severity Automatically from Electronic Health Records - Active
108 Jože M. Rožanec, Blaž Fortuna and Dunja Mladenić
Cybersecurity
ROC change by an ad-hoc XGBoost model. Concept drift can make machine
learning models vulnerable to attackers who progressively distill attack traces
into models and make them hard to distinguish from the values change related
to the concept drift. Dey et al., (2020) explored using active learning to
leverage human expertise and maintain effective detection capabilities even in
the face of concept drift regarding behavioral models.
Active learning was also used to timely annotate data for other
cybersecurity-related problems. Das Bhattacharjee et al., (2017), who describe
the use of active learning to create a semi-supervised solution to annotate data
regarding Phishing categorization. Among the particular challenges of the
domain is the need for real-time detection (the user must be warned about the
potential danger before (s)he clicks on the URL) and the ability to generalize
beyond the blacklisted URLs. The authors used the uncertainty sampling
strategy to achieve greater clarity near the classifier’s decision boundaries and
increase the model’s performance.
Conclusion
References
Abe, N., Mamitsuka, H., 1998. Query Learning Strategies Using Boosting and Bagging. In
ICML ’98: Proceedings of the Fifteenth International Conference on Machine
Learning, 1–9.
Baxter, J., 1997. A Bayesian/Information Theoretic Model of Learning to Learn via
Multiple Task Sampling. Machine Learning 28 (1), 7–39. https://doi.org/10.1023/A:
100 73 27622663.
Beygelzimer, A., Dasgupta, S., Langford, J., 2009. Importance Weighted Active Learning.
Proceedings of the 26th International Conference On Machine Learning, ICML 2009,
no. 2008, 49–56.
Bhattacharjee, S. D., Talukder, A., Al-Shaer, E., Doshi, P., 2017. Prioritized Active
Learning for Malicious URL Detection Using Weighted Text-Based Features. 2017
IEEE International Conference on Intelligence and Security Informatics: Security and
Big Data, ISI 2017, 107–12. https://doi.org/10.1109/ISI.2017.8004883.
Borisov, A., Tuv, E., Runger, G., 2011. Active Batch Learning with Stochastic Query-by-
Forest (SQBF). Workshop on Active Learning and Experimental Design 16, 59–69.
Botcha, B., Iquebal, A. S., Bukkapatnam, S. B., 2021. Efficient Manufacturing Processes
and Performance Qualification via Active Learning: Application to a Cylindrical
Plunge Grinding Platform. Procedia Manufacturing 53 (2020), 716–725. https://doi.
org/10.1016/j.promfg.2021.06.070.
Brinker, K,, 2003. Incorporating Diversity in Active Learning with Support Vector
Machines. In Proceedings of the Twentieth International Conference on Machine
Learning (ICML-2003). https://doi.org/10.1039/C19680000233.
Cai, W., Muhan Z., Ya Z., 2017. Batch Mode Active Learning for Regression with
Expected Model Change. IEEE Transactions on Neural Networks and Learning
Systems 28 (7), 1668–1681. https://doi.org/10.1109/TNNLS.2016.2542184.
Cai, W., Zhang, Y., Zhou, J., 2013. Maximizing Expected Model Change for Active
Learning in Regression. Proceedings - IEEE International Conference on Data
Mining, ICDM, 51–60. https://doi.org/10.1109/ICDM.2013.104.
Chung, M-H., Chignell, M., Wang, L., Jovicic, A., Raman, A., 2020. Interactive Machine
Learning for Data Exfiltration Detection: Active Learning with Human Expertise.
Conference Proceedings - IEEE International Conference on Systems, Man and
Cybernetics 2020-October: 280–287. https://doi.org/10.1109/SMC42975.2020.928
2831.
Conkey, A., Hermans, T., 2019. Active Learning of Probabilistic Movement Primitives.
IEEE-RAS International Conference on Humanoid Robots 2019-October, 425-432.
https://doi.org/10.1109/Humanoids43949.2019.9035026.
Cuong, N. V., Lee, W. S., Ye, N., 2014. Near-Optimal Adaptive Pool-Based Active
Learning with General Loss. Uncertainty in Artificial Intelligence - Proceedings of the
30th Conference, UAI 2014, 122–131.
Dagan, I., Engelson, S. P., and Gan, R., 1993. Committee-Based Sampling For Training
Probabilistic Classifiers. Science.
112 Jože M. Rožanec, Blaž Fortuna and Dunja Mladenić
Dai, W., Mujeeb, A., Erdt, M., Sourin, A., 2018. Towards Automatic Optical Inspection of
Soldering Defects. Proceedings - 2018 International Conference on Cyberworlds, CW
2018, 375–382. https://doi.org/10.1109/CW.2018.00074.
Dang, Q. V., 2020. Proceedings - 2020 RIVF International Conference on Computing and
Communication Technologies, RIVF 2020. In 2020 RIVF International Conference
on Computing and Communication Technologies (RIVF), 13.
Danziger, S. A., Baronio, R., Ho, L., Hall, L., Salmon, K., Hatfield, G. W., Kaiser, P.,
Lathrop, R. H., 2009. Predicting Positive P53 Cancer Rescue Regions Using Most
Informative Positive (MIP) Active Learning. PLoS Computational Biology 5 (9).
https://doi.org/10.1371/journal.pcbi.1000498.
Das, A., Nair, M.S., Peter, D.S., 2020. Batch Mode Active Learning on the Riemannian
Manifold for Automated Scoring of Nuclear Pleomorphism in Breast Cancer.
Artificial Intelligence in Medicine 103, 101805. https://doi.org/10.1016/j.artmed.
2020.101805.
Dey, A., Velay, M., Fauvelle, J-P., Navers, S., 2020. Adversarial vs Behavioural-Based
Defensive AI with Joint, Continual and Active Learning: Automated Evaluation of
Robustness to Deception, Poisoning and Concept Drift, 1–25. http://arxiv.org/abs/
2001.11821.
Doyle, S., Madabhushi, A., 2010. Consensus of ambiguity: theory and application of active
learning for biomedical image analysis. In IAPR International Conference on Pattern
Recognition in Bioinformatics, 313–324.
Doyle, S., Monaco, J., Feldman, M., Tomaszewski, J., Madabhushi, A., 2011. An Active
Learning Based Classification Strategy for the Minority Class Problem: Application
to Histopathology Annotation. BMC Bioinformatics 12, 1–14. https://doi.org/
10.1186/1471-2105-12-424.
Duchi, J., Namkoong H., 2019. Variance-Based Regularization with Convex Objectives.
Journal of Machine Learning Research 20 (3), 1–10.
Ducoffe, M., Precioso, F., 2015. QBDC: Query by Dropout Committee for Training Deep
Supervised Architecture, no. 2013, 1–10. http://arxiv.org/abs/1511.06412.
Fu, Y., Zhu, X., Li, B., 2013. A Survey on Instance Selection for Active Learning.
Knowledge and Information Systems 35 (2), 249–283. https://doi.org/10.1007/s10115-
012-0507-8.
Gammelsæter, M., 2015. A Committee of One - Using Dropout for Active Learning in Deep
Networks | Semantic Scholar. 2015.
Ganti, R., Gray, A., 2012. UPAL: Unbiased Pool Based Active Learning. Journal of
Machine Learning Research 22 (422), 422–431.
Garderen, K. V., 2017. Active Learning for Overlay Prediction in Semi-Conductor
Manufacturing 6 (0), 1–18.
Guo, Y., Greiner, R., 2007. Optimistic Active Learning Using Mutual Information. IJCAI
International Joint Conference on Artificial Intelligence, 823–829.
Hoi, S. C. H., Jin, R., Zhu, J., Lyu, M. R., 2006. Batch Mode Active Learning and Its
Application to Medical Image Classification. ACM International Conference
Proceeding Series 148, 417–424. https://doi.org/10.1145/1143844.1143897.
Jin, Q., Tan, C., Chen, M., Yan, M., Huang, S., Zhang, N., Liu, X., 2020. Aliababa DAMO
academy at TREC precision medicine 2020: State-of-the-art evidence retriever for
Active Learning 113
Ren, P., Xiao, Y., Chang, Huang., P. Y., Li,, Z., Gupta, B. B., Chen, X., Wang, X., 2021.
A Survey of Deep Active Learning. ACM Computing Surveys 54 (9).
https://doi.org/10.1145/3472291.
Rožanec, J. M., Fortuna, B., Mladenić, D., 2022b. Knowledge Graph-Based Rich and
Confidentiality Preserving Explainable Artificial Intelligence (XAI). Information
Fusion 81 (December 2020), 91–102. https://doi.org/10.1016/j.inffus.2021.11.015.
Rožanec, J.M.; Fortuna, B.; Mladenić, D. 2022e. Reframing Demand Forecasting: A Two-
Fold Approach for Lumpy and Intermittent Demand. Sustainability 14 (15):9295.
(July 2022) https://doi.org/10.3390/su14159295
Rožanec, J. M., Novalija, I., Zajec, P., Kenda, K., Tavakoli, H., Suh, S., Veliou, E.,
Papamartzivanos, D., Giannetsos, T., Menesidou, S.A., 2022a. Human-Centric
Artificial Intelligence Architecture for Industry 5.0 Applications. http://arxiv.org/
abs/2203.10794.
Rožanec, J. M., Trajkova, E., Dam, P., Fortuna, B., Mladenić, D., 2022d. Streaming
Machine Learning and Online Active Learning for Automated Visual Inspection.
IFAC-PapersOnLine 55. Association for Computing Machinery. https://doi.org/
10.1016/j.ifacol.2022.04.206.
Rožanec, J. M., Trajkova, E., Novalija, I., Zajec, P., Kenda, K., Fortuna, B., Mladenić, D.,
2022c. Enriching Artificial Intelligence Explanations with Knowledge Fragments.
Future Internet 14 (5), 1–11. https://doi.org/10.3390/fi14050134.
Settles, B., 2010. Active Learning Literature Survey. Computer Sciences Technical Report
1648. https://doi.org/10.1016/j.matlet.2010.11.072.
Seung, H. S., Opper, M., Sompolinsky, H., 1992. Query by Committee, 287–294.
Shao, H., 2019. Query by Diverse Committee in Transfer Active Learning. Frontiers of
Computer Science 13 (2), 280–291. https://doi.org/10.1007/s11704-017-6117-6.
Shen, Y., Yun, H., Lipton, Z. C., Kronrod, Y., Anandkumar, A., 2017. Deep Active
Learning for Named Entity Recognition. Proceedings of the 2nd Workshop on
Representation Learning for NLP, Rep4NLP 2017 at the 55th Annual Meeting of the
Association for Computational Linguistics, ACL 2017, 252–.
Shi, X., Dou, Q., Xue, C., Qin, J., Chen, H., Heng, P-A., 2019. An Active Learning
Approach for Reducing Annotation Cost in Skin Lesion Analysis. Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics) 11861 LNCS 628–636. https://doi.org/10.1007/978-
3-030-32692-0_72.
Shim, J., Kang, S., Cho, S., 2020. Active Learning of Convolutional Neural Network for
Cost-Effective Wafer Map Pattern Classification. IEEE Transactions on
Semiconductor Manufacturing 33 (2), 258–266. https://doi.org/10.1109/TSM.2020.
2974867.
Sinha, S., Ebrahimi, S., Darrell, T., 2019. Variational Adversarial Active Learning’
Proceedings of the IEEE International Conference on Computer Vision 2019-October:
5971–5980. https://doi.org/10.1109/ICCV.2019.00607.
Sugiyama, M., Kawanabe, M., 2013. Machine Learning in Non-Stationary Environments.
https://doi.org/10.7551/mitpress/ 9780262017091.001.0001.
116 Jože M. Rožanec, Blaž Fortuna and Dunja Mladenić
Taylor, A. T., Berrueta, T. A., Murphey, T. D., 2021. Active Learning in Robotics: A
Review of Control Principles. Mechatronics 77. https://doi.org/10.1016/j.mecha
tronics.2021.102576.
Tong, S., Chang, E., 2001. Support Vector Machine Active Learning for Image Retrieval.
Proceedings of the Ninth ACM International Conference on Multimedia -
MULTIMEDIA ’01 54 (C), 1–12.
Trajkova, E., Rožanec, J. M., Dam, P., Fortuna, B., Mladenić, D., 2021. Active Learning
for Automated Visual Inspection of Manufactured Products. Ljubljana ’21: Slovenian
KDD Conference on Data Mining and Data Warehouses, October, 2021, Ljubljana,
Slovenia. Association for Computing Machinery. http://arxiv.org/abs/2109.02469.
Wang, Z., Ye, J., 2015. Querying Discriminative and Representative Samples for Batch
Mode Active Learning. ACM Transactions on Knowledge Discovery from Data 9 (3),
17. https://doi.org/10.1145/2700408.
Weigl, E., Wolfgang, H., Lughofer, E., Radauer, T., Eitzinger, C., 2016. On Improving
Performance of Surface Inspection Systems by Online Active Learning and Flexible
Classifier Updates. Machine Vision and Applications 27 (1), 103–27. https://doi.org/
10.1007/s00138-015-0731-9.
Wu, D., 2019. Pool-Based Sequential Active Learning for Regression. IEEE Transactions
on Neural Networks and Learning Systems 30 (5), 1348–1359. https://doi.org/
10.1109/TNNLS.2018.2868649.
Wu, D., Lawhern, V. J., Gordon, S., Lance, B. J., Lin, C. T., 2016. Offline EEG-Based
Driver Drowsiness Estimation Using Enhanced Batch-Mode Active Learning
(EBMAL) for Regression. 2016 IEEE International Conference on Systems, Man, and
Cybernetics, SMC 2016 - Conference Proceedings, 730–736. https://doi.org/
10.1109/SMC.2016.7844328.
Wu, X., Chen, C., Zhong, M., Wang, J., Shi, J., 2021. COVID-AL: The Diagnosis of
COVID-19 with Deep Active Learning. Medical Image Analysis 68, 101913.
https://doi.org/10.1016/j.media.2020.101913.
Xie, B., Shen, G., Guo, C., Cui, Y., 2021. The Named Entity Recognition of Chinese
Cybersecurity Using an Active Learning Strategy. Wireless Communications and
Mobile Computing 2021. https://doi.org/10.1155/2021/6629591.
Xu, Z., Yu, K., Trespo, V., Xu, X., Wang, J., 2003. Representative Sampling for Text
Classification. Advances in Information Retrieval. ECIR 2003. Lecture Notes in
Computer Science 2633, 1–15.
Yang, L., Hanneke, S., Carbonell, J., 2013. A Theory of Transfer Learning with
Applications to Active Learning. Machine Learning 90 (2), 161–189. https://doi.org/
10.1007/s10994-012-5310-y.
Yu, H., Kim, S., 2010. Passive Sampling for Regression. Proceedings - IEEE International
Conference on Data Mining, ICDM, 1151–1156. https://doi.org/10.1109/ICDM.
2010.9.
Zajec, P., Rožanec, J.M., Trajkova, E., Novalija, I., Kenda, K., Fortuna, B., Mladenić, D.,
2021. Help Me Learn! Architecture and Strategies to Combine Recommendations and
Active Learning in Manufacturing. Information (Switzerland) 12 (11), 1–27.
https://doi.org/10.3390/info12110473.
Active Learning 117
Zhang, T., Oles, F., 2000. The Value of Unlabeled Data for Classification Problems.
Proceedings of the Seventeenth International Conference on Machine Learning
(ICML’00), 1191-1198.
Zhao, Y., Chen, D., Xie, H., Zhang, S., Gu, L., 2019. Mammographic Image Classification
System via Active Learning. Journal of Medical and Biological Engineering 39 (4),
569–82. https://doi.org/10.1007/s40846-018-0437-3.
Zhu, D., Li, Z., Wang, X., Gong, B., Yang, T., 2019. A Robust Zero-Sum Game Framework
for Pool-Based Active Learning. AISTATS 2019 - 22nd International Conference on
Artificial Intelligence and Statistics 89.
Zhu, J. J., Bento, j., 2017. Generative Adversarial Active Learning, 1–11. http://arxiv.org/
abs/1702.07956.
Zhu, X., Ghahramani, Z., Lafferty, J., 2003. Semi-Supervised Learning Using Gaussian
Fields and Harmonic Functions. Proceedings, Twentieth International Conference on
Machine Learning 2 (2001), 912–919.
Chapter 7
Kevser Şahinbaş*
Management Information Systems, Istanbul Medipol University,
Istanbul, Turkey
Abstract
*
Corresponding Author’s Email: [email protected].
Introduction
Literature Review
Xiong et al. (Xiong et al. 2021) proposed a model that apply Ensemble based
Bayesian Neural Network to predict three disorder by YODA dataset that
achieved 0.90 accuracy performance for Social Anxiety Disorder. In the
(Chekroud et al., 2016), the clinical remission from a 12-week regimen of
citalopram was predicted using a ML algorithm. The data was gathered from
1949 patients that suffer from level 1 depression with 25 variables to achieve
good performance. Their model achieved 64.6% accuracy rate by applying the
gradient boosting approach.
Ahmed et al. (Ahmed et al., 2020) presented a model for distinguishing
intensity level of the anxiety and depression at an early age by psychological
testing and some commonly used ML algorithms, namely CNN, KNearest
Neighbour, SVM and linear discriminant analysis. CNN obtained the best
classification performance with 96.8% for depression and 95% for anxiety.
Hilbert et al. (Hilbert et al., 2017) detected difficult cases from healthy cased
and GAD disorders from major depression by applying ML algorithms. They
used Binary SVM algorithm and achieved 90.10% accuracy.
Sau and Bhakta (Sau & Bhakta, 2017) proposed a model for detecting the
anxiety and depression in elderly patient by using ML algorithms such as
Naive Bayes, Random Forest, Bayesian network, Logistic regression and etc.
They achieved 89% accuracy rate with Random Forest algorithm by using 510
geriatric patients. Tat et al. designed a framework to predict mental health
problems in middle adolescence by using 474 predictors report and 7.638
twin’s data from the Child and Adolescent Twin Study in Sweden. The authors
applied ML algorithms, i.e., SVM, RF, XGBoost and Neural Network by
SMOTEBoost that is one of the imbalanced learning approaches and obtained
0.754 accuracy rate with RF algorithm.
Ćosić et al. (2020) addressed the issue of mental health disorders
prevention for healthcare professionals to predict a higher risk of chronic
mental health disorders during the COVID-19 pandemic. An objective
assessment of the intensity of exposure to stress and a self-report assessment
122 Kevser Şahinbaş
of health care workers was requested by obtaining data from clinical records
and hospital archives. The final stage included developing multimodal
stimulation paradigms to provide neurophysiological responses. At this stage,
the response to audio-visual stimuli was measured. Sensors such as fNIRS,
EEG, EKG, EMG, EDA were used for reaction measurements. Built-in eye
tracker, microphone and webcam were also available. It was emphasized that,
unlike statistical methods, feature selection and ML classification discovered
more complex nonlinear interactions. The study demonstrated formulability in
terms of supervised learning, both regression and classification tasks. The
neuro-physiological features retrieved at stage 4 can be found in a properly
formulated supervised learning task such as RF, SVM, ANN etc. It presented
a perspective with propositions such as its integrability with a model.
Background
Anxiety disorders are one of the most prevalent medical illnesses. Anxiety
disorders are common at both the population level and hospital cohorts and
are associated with a significant economic burden, poor quality of life, and a
variety of adverse outcomes (Kanwar et al., 2013). Anxiety disorders are a
common mental problem among teenagers (Rapee, Schniering, & Hudson,
2009). In addition to anxiety disorders due to general medical conditions,
generalized anxiety disorder, panic disorder, particular phobia, social anxiety
disorder and post-traumatic stress disorder are the five basic anxiety disorders,
as well as their linked conditions (Karamustafalıoğlu & Yumrukçal, 2011).
Anxiety is a type of affect with negative characteristics that distinguishes it
from other types of affect. Difficulty breathing, palpitations, rapid breathing,
trembling in the hands and feet, and excessive sweating are physiological
symptoms that can be characterized as psychological characteristics including
discomfort, excitement, emotion, and the fear of something bad happening
unexpectedly. Some definitions separate anxiety from fear by limiting the
source to the anticipation of an unknown danger.
Generalized anxiety disorder (GAD) is defined by the DSM-IV as a
condition characterized by intense and pervasive anxiety accompanied by a
variety of physical symptoms that causes significant impairment in social or
occupational functions or significant stress in the patient (Karamustafalıoğlu
& Yumrukçal, 2011).
Worry occurs almost every day for at least 6 months. It is about many
events or activities and is extreme. The individual has difficulty controlling
Prediction of General Anxiety Disorder Using Machine … 123
The dataset used is described, the SVM, Decision Tree, ANN, RF and KNN
algorithms are explained, and the experimental results are presented in detail
in this section.
124 Kevser Şahinbaş
Questions Score
1. In the last two weeks, I am feeling nervous, anxious, or on edge.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
2. In the last two weeks, I am not being able to stop or control worrying.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
3. In the last two weeks, I am worrying too much about different things.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
4.In the last two weeks, I feel trouble in relaxing.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
5. In the last two weeks, I am being so restless that it’s hard to sit still.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
6. In the last two weeks, I becoming easily annoyed or irritable.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
7. In the last two weeks, I am feeling afraid as if something awful might happen.
Not at all 0
Several days 1
More than half of the days 2
Nearly every day 3
SVM
(1)
Decision Tree
In the algorithm created by Morgan and Songuist, sample data with known
classes is divided into small groups with simple decision-making steps. With
each division operation, similar data are grouped and classified by inductive
method (Safavian and Landgrebe, 1991). The basic step in decision trees is to
create decision nodes. While the decision nodes are being created, the best
attribute should be chosen as the node for the tree to branch in a balanced way
and for the classification process to be done correctly. For this, the expected
value in the whole system is calculated with the “Information Gain Theory”
revealed by Shannon and Weaver (Safavian and Landgrebe, 1991). The
information gain is calculated as in Equation 2:
(2)
Q: Sample space
𝑆𝑣: Sample space subset
126 Kevser Şahinbaş
ANN
If 1,
(3)
If 0
RF
KNN
The k-nearest neighbor method calculates the class label of the test sample
with the labels of the test sample’s nearest neighbours (Soucy & Mineau,
2001). While classifying, the distances of each data in the data set to other data
are calculated. In order to determine this distance, k number of other records
is taken into consideration for a record. k recording distances are the closest
to the calculated point compared to other records. It is important to determine
the k value because the k value is too small to affect the model very much.
Although they are points of the same class, it causes some points to be placed
in separate classes or to create a separate class for those points. Likewise,
being too large causes it to be as if there is only one class, and dissimilar points
are classified together. For these reasons, it is seen that the number of k
influences the classification. 3, 5, 7, which refer to majority voting, are the
most used k values .
K-nearest neighbor algorithm steps;
Performance Metrics
True Positive (TP) indicates that the actual class and the predicted class have
the same value. TP value is found when we classify people without anxiety
disorder as non-anxious disorder. True Negative (TN) are correctly predicted
128 Kevser Şahinbaş
negative values. This indicates that the value of the actual class and the
predicted class are the same. When we classify those with anxiety disorders as
anxiety disorders, the TN value is found. The False Positive (FP) value appears
when the actual class and the predicted class overlap. FP value is found when
we classify a patient with anxiety disorder as non-anxious disorder. A False
Negative (FN) value appears when the actual class conflicts with the predicted
class. FN value is found when we classify a person who does not have an
anxiety disorder as an anxiety disorder. While it is desired to increase the true
positive and true negative areas, reducing the false positive and false negative
areas shows that the classification performance is good. The following metrics
can be calculated with the confusion matrix.
Predicted Class
Confusion Matrix
Non-Anxiety Disorder Anxiety Disorder
Non-Anxiety Disorder TN TP
Actual Class
Anxiety Disorder FN FP
(4)
Recall is the ratio of correctly classified positive data to total positive data.
From the data of the model created using this metric, the rate of finding
positive class labels is determined. Formula shown in 5 is calculated with
(Larose, D.T., 2014):
(5)
(6)
Prediction of General Anxiety Disorder Using Machine … 129
(7)
Data Description
The dataset used in this study is is publicly available1. The dataset represents
a one-of-a-kind look at mental health in Bangladesh’s population during the
early stages of the COVID-19 pandemic. The data were obtained from April
15 to May 10, 2020, when the Bangladeshi government implemented a curfew
and quarantine. A total of 672 (381 men, 291 women) individuals from the
1
https://www.sciencedirect.com/science/article/pii/S2352340921006314?via%3Dihub.
130 Kevser Şahinbaş
Normalization Filter
Large differences between the data affect the learning accuracy of some
classification methods. The purpose of applying normalization is to facilitate
the comparison of data by eliminating the differences between mathematical
operations and data. In this study, the data were normalized using the Standard
Scaler in the Python programming language scikit-learn preprocessing library
in ANN, SVM, DT, KNN and RF algorithms.
Feature Importance
Figure 2 indicates the feature importance with RF algorithm. The results from
Figure 3 show that the question “In the last two weeks, I am feeling nervous,
anxious, or on edge” has the most impact on anxiety disorder prediction.
Marital status provides the least contribution.
ANN Results
The performance of the RF model is calculated, and the results are shown in
Table 6. Accuracy (0.95), precision (0.93), recall (0.9) and f1-score (0.91) are
obtained according to RF performance values.
During the analysing of SVM algorithm, the feature values that give the best
classification performance are adjusted (parameter tuning). The most ideal
kernel and attribute adjustment are set by the grid method as a result of the
findings. The Radial basis kernel is utilized as the kernel function in the SVM
model, and the gamma value () is set to 1 and C (cost parameter) is set to 1.
The results in Table 7 demonstrate that the SVM algorithm has an obvious
advantage for anxiety disorder prediction. The accuracy, precision, recall and
f1-score values of the data analysed with SVM are 0.97, 1.0, 0.91 and 0.95,
respectively. The findings in Table 7 illustrate that the SVM algorithm
achieves the best performance in predicting anxiety disorder. In Figure 8, ROC
curve of SVM is illustrated.
The findings in Figure 9 and Table 8 show that SVM algorithm achieves the
highest classification performance of predicting anxiety disorder with 0.97
accuracy performance.
Conclusion
References
Ahmed, A., Sultana, R., Ullas, M., Begom, M., Rahi, M., & Alam, M. A. (2020). A machine
learning approach to detect depression and anxiety using supervised learning. In 2020
IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE),
1-6.
Bland, R. C., Orn, H., & Newman, S. C. (1988). Lifetime prevalence of psychiatric
disorders in Edmonton. Acta Psychiatrica Scandinavica, 24-32.
Bodrud-Doza, M., Shammi, M., Bahlman, L., Islam, A. R., & Rahman, M. (2020).
Psychosocial and socio-economic crisis in Bangladesh due to COVID-19 pandemic: a
perception-based assessment. Frontiers in public health, 341.
Breiman, L., Random forests, 2001, Machine Learning, 45, 5–32.
Brooks, S. K., Webster, R. K., Smith, L. E., Woodland, L., Wessely, S., Greenberg, N., &
Rubin, G. J. (2020). The psychological impact of quarantine and how to reduce it:
rapid review of the evidenc. The lancet, 912-920.
Chekroud, A. M., Zotti, R. J., Shehzad, Z., Guerguieva, R., Johnson, M. K., & Corlett, P.
R. (2016). Cross-trial prediction of treatment outcome in depression: a machine
learning approach. The Lancet Psychiatry, 243-250.
138 Kevser Şahinbaş
Cortes, C. ve Vapnik, V., 1995, Support-vector networks, Machine Learning, 20, 273-97.
Fyfe, C., 2000, Artificial neural networks and information theory, http://index-of.co.uk/
InformationTheory/Neural%20Networks%20And%20Information%20Theory%20-
%20Colin%20Fyfe.pdf, [Access Date: 02 April 2022].
Hilbert, K., Lueken, U., Muehlhan, M., & Beesdo-Baum, K. (2017). Separating generalized
anxiety disorder from major depression using clinical, hormonal, and structural MRI
data: a multimodal machine learning study. Brain and behavior, e00633.
Kanwar, A., Malik, S., Prokop, L. J., Sim, L. A., & Murad, M. H. (2013). The association
between anxiety disorders and suicidal behaviors: A systematic review and meta‐
analysis. Depression and anxiety,, 917-929.
Karamustafalıoğlu, O., & Yumrukçal, H. (2011). Depresyon ve anksiyete bozuklukları.
Şişli Etfal Hastanesi Tıp Bülteni, 65-74.
Karaağaoğlu, E., Karakaya, J. ve Kılıçkap, M., 2016, Tanı Testlerinin Değerlendiril-
mesinde İstatistiksel Yöntemler/Statistical Methods in Evaluation of Diagnostic Tests,
Detay Publishing, Ankara.
Larose, D. T., 2014, Discovering knowledge in data: an introduction to data mining, 2nd
ed., John Wiley & Sons, Canada.
Rapee, R. M., Schniering, C. A., & Hudson, J. L. (2009). Anxiety disorders during
childhood and adolescence: Origins and treatment. Annual review of clinical
psychology, 311-341.
Sadock, B. J., & Sadock, V. A. (2008). Kaplan & Sadock’s concise textbook of clinical
psychiatry. Lippincott Williams & Wilkins.
Safavian, S. R. and D. Landgrebe, A survey of decision tree classifier methodology. IEEE
transactions on systems, man, and cybernetics, 1991, p. 660-674.
Sau, A., & Bhakta, I. (2017). Predicting anxiety and depression in elderly patients using
machine learning technology. Healthcare Technology Letters, 238-243.
Sokolova, M., Japkowicz, N. ve Szpakowicz, S., 2006, Beyond accuracy, f-score and roc:
a family of discriminant measures for performance evaluation, Australasian Joint
Conference on Artificial Intelligence, 1021.
Soucy, P., Mineau, G. W., 2001, A simple knn algorithm for text categorization, in: Data
Mining, ICDM, 647-648.
Spitzer, R. L., Kroenke, K., Williams, J. B., & Löwe, B. (2006). A brief measure for
assessing generalized anxiety disorder: the GAD-7. Archives of internal medicine,.
Archives of internal medicine, 1092-1097.
Witten, I. H., Frank, E., Hall, M. A. ve Pal, C. J., 2016, Data Mining: Practical machine
learning tools and techniques, 2nd ed., Morgan Kaufmann.
Xiong, H., Berkovsky, S., Romano, M., Sharan, R. V., Liu, S., Coiera, E., & McLellan, L.
F. (2021). Prediction of anxiety disorders using a feature ensemble based bayesian
neural network. Journal of Biomedical Informatics, 103921.
Editor’s Contact Information
A C
active learning, v, viii, 95, 96, 97, 98, 100, churn customers, 51
101, 102, 103, 104, 105, 106, 107, 108, classification(s), 6, 31, 32, 33, 34, 35, 36,
109, 110, 111, 112, 113, 114, 115, 116, 37, 40, 41, 42, 43, 47, 48, 52, 58, 67, 69,
117 76, 85, 97, 98, 100, 102, 104, 107, 108,
algorithm(s), v, vii, 4, 6, 10, 12, 17, 29, 30, 109, 112, 113, 114, 115, 116, 117, 119,
31, 32, 33, 34, 35, 36, 37, 39, 40, 41, 42, 121, 122, 124, 125, 126, 127, 128, 130,
43, 45, 46, 47, 48, 49, 51, 52, 54, 55, 56, 135, 136, 137
57, 58, 59, 67, 69, 70, 71, 72, 73, 74, 75, crowdsourcing, v, viii, 77, 78, 82, 83, 92,
84, 86, 88, 96, 97, 99, 101, 102, 103, 93
104, 107, 109, 110, 113, 119, 120, 121, customer loyalty, 4, 29, 51, 52, 53, 54, 55,
123, 125,126, 127, 129, 130, 131, 132, 56, 57, 58, 76
133, 135, 136, 137, 138 customer relations, 30, 51, 54, 55, 56, 57,
analysis, v, vii, 2, 3, 4, 5, 6, 8, 9, 10, 11, 58, 74, 75, 76
13, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, customer(s), v, vii, 4, 5, 10, 22, 29, 30, 31,
26, 30, 33, 34, 41, 46, 48, 51, 52, 53, 54, 33, 34, 39, 46, 48, 51, 52, 53, 54, 55, 56,
55, 56, 57, 58, 59, 66, 74, 75, 76, 82, 84, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 69,
85, 86, 89, 106, 112, 115, 116, 121, 130, 73, 74, 75, 76, 105
138 cybersecurity, 98, 109, 110, 116
anxiety disorder, 119, 120, 121, 122, 123,
124, 127, 128, 130, 131, 135, 136, 137, D
138
application(s), v, vii, 2, 9, 10, 11, 12, 14, data analysis, 3, 16, 49, 51, 52, 54, 55, 57,
15, 16, 21, 24, 25, 27, 29, 30, 31, 33, 36, 58, 66, 75, 84
41, 48, 49, 74, 75, 76, 85, 86, 95, 97, data analytics, v, 1, 2, 3, 4, 11, 12, 58, 92
109, 111, 112, 115, 116, 119 data management, 2, 84, 119
artificial neural network(s) (ANNs), 6, 52, data mining, v, vii, 2, 3, 5, 10, 11, 12, 13,
56, 58, 67, 71, 72, 73, 74, 75, 121, 126, 15, 17, 18, 19, 21, 22, 24, 25, 26, 29, 30,
130, 131, 138 31, 32, 33, 35, 37, 39, 47, 48, 49, 51, 52,
54, 56, 57, 58, 59, 67, 73, 75, 76, 77, 78,
82, 84, 85, 86, 92, 111, 116, 138
B
data preparation, 58, 66, 67
banking, v, vii, 51, 53, 57, 58, 74, 76 database, vii, 3, 5, 14, 15, 16, 18, 24, 25,
bioinformatics, 13, 14, 20, 24, 25, 26, 47, 28, 57, 75, 84
112, 114, 115
142 Index
decision tree, vii, 6, 12, 30, 31, 32, 34, 35, prioritization, v, vii, 13, 14, 15, 16, 20, 21,
36, 39, 42, 43, 44, 45, 46, 47, 48, 52, 56, 22, 24
58, 67, 68, 69, 70, 73, 74, 85, 107, 121, promising, v, vii, 13, 14, 15, 21, 22
123, 125, 126, 133, 134, 136, 138 prosumer, v, viii, 77, 78, 79, 82, 84, 89, 93
G R
general anxiety disorder, vi, viii, 119 random forest (RF), vii, 30, 32, 33, 35, 36,
42, 43, 44, 45, 46, 47, 48, 49, 52, 58, 67,
H 69, 70, 74, 75, 76, 121, 122, 123, 126,
130, 134, 135, 136, 137
healthcare, 10, 95, 106, 110, 120, 121, 138 robotic(s), 95, 104, 106, 110, 113, 116
human resources, vii, 1, 2, 7, 9, 11
human resources metrics, 2, 10
S
I software, vii, 3, 14, 15, 16, 17, 18, 19, 27,
32, 34, 35, 37, 39, 48, 52, 130, 131
in silico, 14, 17, 18, 22, 23, 24, 25, 26, 27 strategy(ies), vii, 1, 8, 10, 19, 21, 29, 33,
industry, v, viii, 1, 31, 48, 51, 52, 54, 58, 34, 46, 54, 55, 56, 58, 75, 76, 79, 92, 97,
75, 84, 93, 104, 105, 115 98, 99, 101, 103, 104, 105, 108, 109,
110, 111, 112, 113, 116, 120
K support vector machine (SVM), 35, 100,
k-means, 6, 10, 30, 31, 36, 103, 106 101, 108, 111, 114, 116, 119, 120, 121,
k-nearest neighbor (KNN), 6, 121, 123, 122, 123, 124, 130, 135, 136, 137
127, 130, 133, 136, 137
T
L technique(s), vi, viii, 2, 3, 4, 5, 10, 11, 12,
learning from demonstration, 95 13, 14, 20, 22, 30, 32, 41, 48, 49, 52, 76,
78, 84, 85, 86, 92, 95, 96, 99, 100, 103,
104, 119, 130, 138
M
tool(s), ii, v, vii, 4, 5, 7, 13, 14, 15, 16, 18,
machine learning, vi, viii, 4, 10, 11, 34, 35, 21, 22, 24, 25, 29, 33, 35, 48, 57, 58, 69,
47, 49, 67, 69, 76, 95, 96, 97, 99, 105, 77, 78, 105, 120, 138
109, 110, 111, 112, 114, 115, 116, 117, toxicogenomic(s), v, vii, 13, 14, 15, 18, 19,
119, 120, 126, 129, 136, 137, 138 20, 21, 22, 23, 24, 25, 26, 27, 28
manufacturing, 33, 95, 98, 104, 105, 110, true negative (TN), 38, 43, 68, 70, 73, 127
111, 112, 113, 114, 115, 116 true positive (TP), 37, 42, 43, 68, 70, 73,
microgrid(s), v, viii, 77, 78, 79, 80, 81, 82, 127, 128
85, 86, 88, 89, 90, 91, 92, 93
X
P
x-means, 30, 31
prediction, vi, viii, 6, 20, 22, 26, 32, 48, 49,
56, 76, 99, 105, 112, 114, 119, 120, 130,
131, 135, 137, 138