Faiml Unit 2
Faiml Unit 2
Machine Learning (ML) is broadly defined as a subfield of artificial intelligence that enables computers
to learn from data and improve over time without being explicitly programmed. In essence, ML
algorithms build models from data that can make predictions or decisions. For example, IBM defines
ML as a branch of AI focused on “enabling computers and machines to imitate the way that humans
learn” and improve performance through exposure to more data 1 . DataCamp similarly notes that ML
involves “algorithms that improve automatically through experience and by the use of data,” allowing
machines to learn patterns and make predictions on new data 2 . In practical terms, instead of writing
rules by hand, we give the computer many examples (input–output pairs) and let it generalize a
mapping from inputs to outputs.
ML’s power comes from learning from data. A machine‐learning model is trained by fitting a
mathematical function to a dataset: during training it adjusts internal parameters (e.g. weights in a
neural network) to minimize prediction error 3 4 . The training set comprises labeled examples (for
supervised learning) or just features (for unsupervised learning), and an optimization algorithm (like
gradient descent) tweaks the model so its predictions match the known answers 5 4 . Once trained,
the model can predict outputs for new, unseen inputs by applying the learned mapping. As Wikipedia
explains, ML “build[s] a mathematical model from input data” so that the model can make data-driven
predictions or decisions 6 .
Machine learning is just one approach within the broader field of Artificial Intelligence (AI). AI refers
to any computer program that performs tasks “similar to how humans solve problems,” such as
reasoning, perception, or language understanding 7 . ML is a subset of AI that focuses specifically on
learning from data. For instance, DataCamp states that “Machine Learning is a subset of AI” that uses
data-driven algorithms for predictions 2 8 . In practice, ML powers many of today’s AI applications
(like image recognition, recommendation systems, autonomous vehicles) by automatically finding
patterns in data, whereas other AI systems might use hand-coded rules or logic without learning from
data.
The 1960s and 70s saw symbolic AI and early neural models (e.g. the ELIZA chatbot in 1966) 13 , but
also a so-called “AI winter” when progress stalled. Research picked up again in the 1980s and 90s as
new algorithms were invented (e.g. backpropagation for training multi-layer networks in 1969 14 ,
convolutional networks for image recognition in 1989 15 , and Q-learning for reinforcement learning in
1989 16 ). Notable successes included IBM’s Deep Blue beating the world chess champion in 1997 and
1
advances in probabilistic methods. In the 2000s, deep learning emerged: Geoffrey Hinton coined “deep
learning” in 2006 to describe multi-layer neural networks 17 , and the creation of large datasets like
ImageNet sparked a rapid AI boom. Today’s ML builds on these foundations: massive computational
power, big data, and advanced algorithms combine to drive modern AI applications.
transformative technology of our time. It’s going to transform every single vertical” 24 . These examples
show that ML leverages data to optimize processes, uncover hidden patterns, and make predictions
that benefit sectors from retail to robotics.
2
Machine Learning and Statistics
Machine learning and statistics are closely related fields, both dealing with data and inference, but they
emphasize different goals. A concise way to see the distinction is: statistics aims for inference; machine
learning aims for prediction. As Nature Methods summarizes: *“Statistics draws population inferences from a
sample, and machine learning finds generalizable predictive patterns.” 28 . In practice, many ML algorithms
(e.g. linear regression, logistic regression, Bayesian models) originated in statistics. However, statistics
traditionally focuses on drawing conclusions about data (e.g. confidence intervals, hypothesis testing)
and understanding relationships within data. Machine learning, by contrast, emphasizes building
models that perform well on new data: its goal is often “making repeatable predictions by finding
patterns within data” 29 .
DataRobot also explains that statistics is about making inferences about a population from a sample,
whereas ML is about making predictive models that work on unseen data 29 . Importantly, modern ML
practitioners often borrow statistical tools (e.g. probability theory, distributions) but apply them at large
scale with iterative algorithms and automation. In summary, ML can be seen as a practical extension of
statistical learning: it uses statistical foundations but with the objective of maximizing predictive
accuracy and scalability on complex datasets 28 29 .
Thus, while all machine learning is a form of AI, not all AI systems use machine learning. Traditional AI
might use rule-based expert systems or search algorithms, whereas ML systems learn patterns
automatically. In practice today, many AI applications (like vision, NLP, robotics) use ML as their core,
since ML has proven highly effective at tasks like recognizing images or translating language.
• Supervised Learning: The algorithm is trained on a labeled dataset (each example has an input
and a known output). The model learns a mapping from inputs to outputs and can then predict
labels for new data. Common supervised tasks include classification (predicting categories) and
regression (predicting continuous values). For example, training a model on thousands of
labeled cat/dog images lets it classify new images as “cat” or “dog.” 31
• Unsupervised Learning: The algorithm is given unlabeled data and must find structure or
patterns on its own. Common techniques include clustering (grouping similar data points) and
dimensionality reduction. For instance, an unsupervised model can group customers into
segments based on their behavior without any predefined labels 32 .
3
• Semi-Supervised Learning: This lies between supervised and unsupervised. The model is
trained on a large amount of unlabeled data plus a small amount of labeled data. Typically, the
model is first trained on the labeled subset and then refined using the unlabeled data 33 . This is
useful when labeling is expensive: a few labeled examples can guide the learning on much more
unlabeled data.
Figure: Comparing Supervised and Unsupervised Learning. In supervised learning, models train on labeled
data (known input–output pairs), whereas unsupervised learning uses unlabeled data to discover
inherent structures 31 32 .
In supervised learning, the model learns a function from inputs to outputs using labeled examples
31 . Two main types of supervised tasks are:
• Classification: The output is a discrete class or category (e.g. “spam” vs. “not spam”, or “cat” vs.
“dog”). For classification problems, the model learns decision boundaries between classes. As
GeeksforGeeks explains, classification handles “discrete outcomes” such as yes/no or multi-class
labels 35 . For example, an email filter that flags spam uses classification.
• Regression: The output is a continuous value (e.g. price, temperature). Regression aims to find
the best-fit curve or line through the data. It “focuses on finding the best-fitting line to predict
numerical outcomes” 36 . For example, predicting tomorrow’s stock price or a house’s market
value uses regression.
Both classification and regression require labeled training data. In practice, many algorithms (like
decision trees or neural networks) can be applied to either task depending on how the output is
encoded (categorical vs. numeric).
Bayesian Methods
Bayesian approaches apply probability theory and Bayes’ theorem to learning. In ML, naive Bayes
classifiers are a simple yet effective example. These models compute the probability of each class given
the input features and choose the class with the highest posterior probability. Naive Bayes assumes
feature independence and uses the formula:
Despite its simplicity, naive Bayes often works well in practice. Wikipedia describes naive Bayes as a
family of “probabilistic classifiers which assumes that the features are conditionally independent” given
the class 37 . Because it relies on straightforward probability counts, naive Bayes scales to very large
datasets (counting feature occurrences) and is commonly used in text classification and spam filtering.
In general, Bayesian methods in ML combine prior beliefs and observed data to make inferences,
though many modern implementations are rooted in frequentist training for convenience 37 .
4
Clustering (Unsupervised Learning)
Clustering is a key unsupervised technique for grouping similar data. Given only features and no labels,
clustering algorithms partition the data into clusters so that points in the same cluster are more similar
to each other than to those in other clusters. A popular example is k-means clustering, which
iteratively assigns points to the nearest of k centroids and adjusts centroids to minimize within-cluster
variance. DataCamp notes that unsupervised learning is “often used for clustering,” with k-means as a
common algorithm 38 . Other clustering methods include hierarchical clustering, DBSCAN, and
Gaussian mixtures. Clustering helps uncover hidden structures, such as grouping customers by
purchasing behavior or segmenting images by content, without requiring labeled examples.
A decision tree is a versatile supervised learning model for both classification and regression 39 . It
builds a tree of decisions: each internal node tests a feature, branches to different child nodes based on
the feature value, and leaves represent predicted outputs 39 . For example, a decision tree for loan
approval might ask, “Is income > \$50K?” then branch accordingly. Trees are intuitive and interpretable,
as they mimic human decision rules. However, they can overfit if too deep. IBM notes that smaller trees
are easier to generalize, while larger trees risk “data fragmentation” and overfitting 40 . To avoid this,
trees are often pruned (removing low-importance branches) or ensembled (e.g. Random Forests or
boosting) 41 . Overall, decision trees are powerful because they capture nonlinear feature interactions
and require little data preprocessing.
Dimensionality Reduction
Dimensionality reduction techniques reduce the number of features while retaining important
information. This helps simplify models and avoid the “curse of dimensionality” when data has many
variables. Principal Component Analysis (PCA) is a common method: it finds a new set of orthogonal
axes (principal components) that capture the most variance in the data. As DataCamp explains,
dimensionality reduction “involves reducing the number of random variables under consideration by
obtaining a set of principal variables” 42 . Other methods include Linear Discriminant Analysis (LDA), t-
SNE, and autoencoders (neural-network-based). Reducing dimensions can improve visualization, speed
up training, and remove noise.
“Deep learning” refers to neural networks with many layers (deep architectures) 44 . DataCamp
describes deep learning as a subfield of ML using multi-layered artificial neural networks that can
“learn from enormous amounts of data” to achieve high accuracy 44 . Deep networks automatically
extract features: for example, early layers in a convolutional neural network might learn edges, while
later layers learn shapes and objects. Modern deep learning has led to breakthroughs in image and
speech recognition, natural language processing, and more.
5
Figure: Illustration of a neural network concept. Neural networks consist of layers of interconnected
neurons 43 . Deep learning uses many such layers (often hidden) to learn complex feature hierarchies
from data 44 .
Neural networks are typically trained by backpropagation and gradient descent, similar to other models
3 . They require large amounts of data and computation, but can achieve superior performance on
Next, the model is evaluated on a separate validation set 45 . The validation set simulates unseen data
and is used to tune hyperparameters (e.g. number of layers, learning rate) and guard against overfitting
45 . For instance, if validation error starts rising while training error keeps falling, the model is likely
overfitting (memorizing the training data). Techniques like early stopping (halting training when
validation error increases) and regularization (penalizing large weights) are used to improve
generalization 45 .
Finally, the model’s performance is measured on the test set, which the model has never seen 46 . This
provides an unbiased estimate of how the model will perform in the real world. Common evaluation
metrics include accuracy, precision/recall for classification, and mean-squared error for regression.
Cross-validation (resampling the data into different train/validation folds) is often used to get a robust
performance estimate when data is limited. Throughout training, engineers also perform feature
engineering, tuning, and error analysis.
In summary, training involves optimizing a model on data and rigorously testing its predictive ability. As
IBM describes, the ML process has a decision process (predicting/classifying), an error function
(measuring accuracy), and a model optimization step (adjusting weights) 47 5 . Good training yields
a model that not only fits historical data but also generalizes well to new data.
Sources: Definitions and insights above are drawn from authoritative ML references and recent surveys,
including IBM and MIT Sloan articles 1 8 , DataCamp and Doma articles 2 48 , and technical
sources on ML history and methods 49 11 3 .
6
1 4 43 47 What Is Machine Learning (ML)? | IBM
https://www.ibm.com/think/topics/machine-learning