FDS Module 1 Notes
FDS Module 1 Notes
Organized data: This refers to data that is sorted into a row/column structure, where every row
represents a single observation and the columns represent the characteristics of that
observation.
Unorganized data: This is the type of data that is in the free form, usually text or raw
audio/signals that must be parsed further to become organized.
Data science is the art and science of acquiring knowledge through data. Data science is all about how
we take data, use it to acquire knowledge, and then use that knowledge to do the following:
Make decisions
Predict the future
Understand the past/present
Create new industries/products
Machine learning: This refers to giving computers the ability to learn from data without explicit
"rules" being given by a programmer. Machine learning combines the power of computers with
intelligent learning algorithms in order to automate the discovery of relationships in data and
create of powerful data models. Speaking of data models, we will concern ourselves with the
following two basic types of data models:
Probabilistic model: This refers to using probability to find a relationship between elements that
includes a degree of randomness.
Statistical model: This refers to taking advantage of statistical theorems to formalize
relationships between data elements in a (usually) simple mathematical formula.
Exploratory data analysis (EDA) refers to preparing data in order to standardize results and gain
quick insights. EDA is concerned with data visualization and preparation.
Data mining is the process of finding relationships between elements of data. Data mining is the
part of data science where we try to find relationships between variables (think spawn-recruit
model).
Big Data: Big Data refers to datasets that are too large and complex for traditional data
processing applications. It involves the processing of vast amounts of data using parallel and
distributed computing.
Artificial Intelligence (AI): Artificial Intelligence is a broader concept that refers to the
development of computer systems capable of performing tasks that normally require human
intelligence. Machine Learning is a subset of AI.
Types of Data
We will look at the three basic classifications of data:
Structured (organized) data: This is data that can be thought of as observations and
characteristics. It is usually organized using a table method (rows and columns).
Unstructured (unorganized) data: This data exists as a free entity and does not follow any
standard organization hierarchy.
Here are a few examples that could help you differentiate between the two:
Most data that exists in text form, including server logs and Facebook posts, is unstructured
Scientific observations, as recorded by careful scientists, are kept in a very neat and organized
(structured) format
A genetic sequence of chemical nucleotides is unstructured
Quantitative versus qualitative data
Quantitative data: This data can be described using numbers, and basic mathematical
procedures, including addition, are possible on the set.
Qualitative data: This data cannot be described using numbers and basic mathematics. This data
is generally thought of as being described using "natural" categories and language.
Quantitative data can be broken down, one step further, into discrete and continuous quantities. These
can be defined as follows:
Discrete data: This describes data that is counted. It can only take on certain values. Examples of
discrete quantitative data include a dice roll, because it can only take on six values, and the
number of customers in a café, because you can't have a real range of people.
Continuous data: This describes data that is measured. It exists on an infinite range of values. A
good example of continuous data would be a person's weight because it can be 150 pounds or
197.66 pounds (note the decimals).
1. Nominal Level:
At the nominal level, data are categorized into distinct groups or classes with no inherent
order or ranking.
Nominal data are qualitative and involve the classification of items into discrete categories.
Arithmetic operations like addition or subtraction are not meaningful at this level.
2. Ordinal Level:
The ordinal level introduces the concept of order or rank among the categories, but the
intervals between them are not consistent or meaningful.
Data at the ordinal level represent relative positioning, but the differences between the
categories are not quantifiable.
3. Interval Level:
The interval level introduces a consistent and meaningful scale with equal intervals between
consecutive points.
Zero in interval data does not represent the absence of the attribute but is rather an
arbitrary point.
Arithmetic operations like addition and subtraction are meaningful, but multiplication or
division may not be appropriate.
4. Ratio Level:
The ratio level is the highest and most informative level of measurement.
It possesses all the properties of interval data, but with a true zero point that indicates the
absence of the attribute.
All basic arithmetic operations, including multiplication and division, are meaningful at the
ratio level.
Purpose:
Data science starts with formulating a clear and relevant question or problem to solve.
This step is crucial in guiding the entire data science process.
Key Actions:
Purpose:
Acquiring relevant and sufficient data to address the questions posed in the first step.
This involves collecting, accessing, or retrieving the necessary datasets.
Key Actions:
Purpose:
Understand the characteristics and patterns within the data. Exploratory Data Analysis
(EDA) is conducted to gain insights and inform subsequent steps.
Key Actions:
Descriptive statistics.
Data visualization.
Purpose:
Building a predictive model or statistical analysis to answer the questions posed initially.
This step involves selecting appropriate algorithms and techniques.
Key Actions:
Feature engineering.
Key Actions:
When looking at a new dataset, whether it is familiar to you or not, it is important to use the following
questions as guidelines for your preliminary analysis:
One of the reasons NumPy is so important for numerical computations in Python is because it is
designed for efficiency on large arrays of data. There are a number of reasons for this:
NumPy internally stores data in a contiguous block of memory, independent of other built-in
Python objects. NumPy arrays also use much less memory than built-in Python sequences.
NumPy operations perform complex computations on entire arrays without the need for Python
for loops, which can be slow for large sequences. NumPy is faster than regular Python code.
NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python
counterparts and use significantly less memory.
Creating ndarrays:
Creating NumPy ndarray involves several methods, and you can create arrays with different shapes,
sizes, and initial values. Here are some common ways to create NumPy arrays:
1. Using numpy.array():
The simplest way to create an array is by converting a Python list or tuple using the
numpy.array() function.
import numpy as np
arr = np.array([1, 2, 3])
2. Using numpy.zeros() and numpy.ones():
Create arrays filled with zeros or ones.
zeros_arr = np.zeros((3, 4)) # 3x4 array of zeros
ones_arr = np.ones((2, 2)) # 2x2 array of ones
3. Using numpy.arange():
Create an array with a range of values.
range_arr = np.arange(0, 10, 2) # Array from 0 to 8 with step 2
4. Using numpy.linspace():
Create an array with evenly spaced values over a specified range.
linspace_arr = np.linspace(0, 1, 5) # 5 evenly spaced values between 0 and 1
5. Using numpy.eye() and numpy.identity():
Create identity matrices.
pythonCopy code
identity_matrix = np.eye(3) # 3x3 identity matrix
identity_matrix_alt = np.identity(2) # 2x2 identity matrix
6. Using numpy.random module:
Create arrays with random values.
pythonCopy code
random_arr = np.random.rand(2, 3) # 2x3 array with random values between 0 and 1
7. Using numpy.full():
Create an array with a specified fill value.
pythonCopy code
filled_arr = np.full((2, 2), 7) # 2x2 array filled with the value 7
8. Using numpy.reshape():
Reshape an existing array.
pythonCopy code
original_arr = np.arange(1, 10)
reshaped_arr = original_arr.reshape((3, 3))
9. Using numpy.concatenate():
Concatenate multiple arrays along an existing axis.
pythonCopy code
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
concatenated_arr = np.concatenate((arr1, arr2))
10. Using numpy.stack():
Stack arrays along a new axis.
pythonCopy code
stacked_arr = np.stack((arr1, arr2))
import numpy as np
# Creating ndarray from a list
my_list = [1, 2, 3, 4, 5]
my_array = np.array(my_list)
print(my_array)
# Generating an array with NumPy
generated_array = np.arange(0, 10, 2) # Creates an array from 0 to 10 with step 2
print(generated_array)
In NumPy, every element of an ndarray is of the same data type. NumPy provides a variety of data
types, and you can specify the data type when creating an array or convert the data type later. Here are
some common data types in NumPy:
1. Integer Types:
np.int8, np.int16, np.int32, np.int64
np.uint8, np.uint16, np.uint32, np.uint64
2. Floating-Point Types:
np.float16, np.float32, np.float64
np.complex64, np.complex128
3. Other Types:
np.bool_: Boolean type
np.object_: Object type
np.string_, np.unicode_: String types
You can specify the data type when creating an array using the dtype parameter. For example:
pythonCopy code
import numpy as np
# Create an array with specified data type
arr_int32 = np.array([1, 2, 3], dtype=np.int32)
arr_float64 = np.array([1.1, 2.2, 3.3], dtype=np.float64)
If you don't explicitly specify a data type, NumPy will infer the data type based on the input values. You
can check the data type of an array using the dtype attribute:
pythonCopy code
arr = np.array([1, 2, 3])
print(arr.dtype) # Output: int64
You can also convert the data type of an existing array using the astype() method:
pythonCopy code
arr = np.array([1, 2, 3])
arr_float = arr.astype(np.float64)
NumPy also supports more advanced data types, such as structured data types and datetime types,
which are useful for working with structured data and time-related information.
It's important to choose the appropriate data type based on the nature of your data to ensure efficient
memory usage and appropriate precision for your computations. NumPy provides a high level of
flexibility in this regard.
NumPy arrays support element-wise arithmetic operations, broadcasting, and various mathematical
operations, making it convenient for numerical computations. Here are some common arithmetic
operations you can perform with NumPy arrays:
1. Element-wise Operations:
NumPy allows you to perform element-wise operations on arrays. The operation is applied
to each element in the array individually.
pythonCopy code
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# Element-wise addition
result_addition = arr1 + arr2
# [5, 7, 9]
# Element-wise multiplication
result_multiplication = arr1 * arr2
# [4, 10, 18]
2. Scalar Operations:
You can perform arithmetic operations with a scalar value, and the operation will be applied
to each element in the array.
pythonCopy code
arr = np.array([1, 2, 3])
# Scalar addition
result_scalar_addition = arr + 5
# [6, 7, 8]
# Scalar multiplication
result_scalar_multiplication = arr * 2
# [2, 4, 6]
Elements in an array can be accessed using indexing and sliced using the colon (:) notation.
Indexing starts at 0, and negative indices count from the end.
Combining Conditions:
You can use logical operators (AND, OR, NOT) to combine multiple conditions.
pythonCopy code
arr = np.array([1, 2, 3, 4, 5])
# Combined conditions
selected_elements = arr[(arr > 2) & (arr < 5)]
print(selected_elements)
# Output: [3 4]
Fancy Indexing:
Fancy indexing in NumPy refers to using arrays of indices, rather than single indices or slices, to access or
modify elements of an array. This technique provides a convenient and powerful way to select or modify
specific elements based on more complex criteria. Fancy indexing can be applied to one-dimensional
and multidimensional arrays.
1D Fancy Indexing:
pythonCopy code
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Using an array of indices for selection
indices = np.array([0, 3, 4])
selected_elements = arr[indices]
print(selected_elements)
# Output: [1 4 5]
2D Fancy Indexing:
pythonCopy code
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Using arrays of indices for row and column selection
row_indices = np.array([0, 1, 2])
col_indices = np.array([1, 2, 0])
selected_elements = arr_2d[row_indices, col_indices]
print(selected_elements)
# Output: [2 6 7]
Fancy indexing provides a more advanced and flexible way to index and modify arrays compared to basic
indexing and slicing. It is particularly useful when you need to select or modify elements based on
specific patterns, conditions, or sets of indices. However, keep in mind that fancy indexing often creates
a copy of the data rather than a view, so modifications may not affect the original array.
# Fancy indexing
arr = np.zeros((8, 4))
for i in range(8):
arr[i] = i
print(arr)
print(arr[[4, 3, 0, 6]])
print(arr[[-3, -5, -7]])
In NumPy, transposing an array involves swapping its rows and columns. The transpose of a 2D array is
obtained by switching its rows with columns. NumPy provides the .T attribute and the
numpy.transpose() function for this purpose.
Using .T attribute:
pythonCopy code
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
transposed_arr = arr.T
print(transposed_arr)
# Output:
# [[1 4]
# [2 5]
# [3 6]]
Using numpy.transpose() function:
pythonCopy code
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
transposed_arr = np.transpose(arr)
print(transposed_arr)
# Output:
# [[1 4]
# [2 5]
# [3 6]]
Swapping Axes:
For 2D arrays, the .T attribute is equivalent to swapping the axes using numpy.swapaxes().
pythonCopy code
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
swapped_arr = arr.swapaxes(0, 1)
print(swapped_arr)
# Output: # [[1 4] # [2 5] # [3 6]]
1. Square Root (sqrt): Imagine you have an array filled with numbers, and you want to find the square
root of each element. Enter the np.sqrt() ufunc. It magically computes the square root of each element
in your array, all in one go. Here's a quick example:
import numpy as np
result = np.sqrt(my_array)
print(result)
The np.sqrt() ufunc takes care of business, giving you the square roots of 4, 9, and 16.
2. Exponential (exp): Now, what if you want to calculate the exponential of each element in your array?
No worries, the np.exp() ufunc has got you covered. Check it out:
import numpy as np
result = np.exp(my_array)
print(result)
With a snap of its virtual fingers, np.exp() computes the exponential of 1, 2, and 3.
3. Addition (add): When you have two arrays and you want to add their corresponding elements
together, the np.add() ufunc is your go-to friend. It's as easy as pie:
import numpy as np
np.add() works its magic, summing up each pair of elements from array1 and array2.
4. Maximum (maximum): If you're on the hunt for the maximum values between two arrays, look no
further than the np.maximum() ufunc. Here's a quick example:
import numpy as np
print(result)
np.maximum() compares elements from array1 and array2 and delivers the maximum value at each
position.
5. Modulus and Fractional Part (modf): Ever wondered about the fractional and integer parts of
numbers in your array? Enter the np.modf() ufunc:
import numpy as np
With a wave of its virtual wand, np.modf() separates the fractional and integer parts of your array
elements.
Array-oriented Programming with arrays:
Array-oriented programming refers to a programming paradigm that emphasizes the use of array
operations for efficient and concise code. It is particularly relevant in the context of scientific computing,
data analysis, and numerical programming. The idea is to express operations in a way that can be
applied to entire arrays, avoiding the need for explicit loops over individual elements. Array-oriented
programming is closely associated with languages and libraries that support array operations efficiently,
and NumPy is a prime example in the Python ecosystem.
For example, suppose you have two arrays, array1 and array2, and you want to add their corresponding
elements together. Instead of a loop, you can use the '+' operator directly:
import numpy as np
print(result)
NumPy takes care of the element-wise addition without the need for explicit loops. It's like telling
NumPy to handle the math for you.
2. Broadcasting: Broadcasting is a feature that allows arrays of different shapes to be combined in a way
that makes sense. This enables operations between arrays with different dimensions without the need
for explicit reshaping.
import numpy as np
my_array = np.array([1, 2, 3])
result = my_array * 2
print(result)
Here, NumPy automatically broadcasts the scalar value 2 to match the shape of my_array, making the
multiplication happen effortlessly.
3. Universal Functions (ufuncs): We've already touched on this briefly. Universal functions in NumPy, or
ufuncs, are like special operators that perform element-wise operations on arrays. They are the
workhorses behind array-oriented programming, ensuring efficient and optimized computations.
Whether you're calculating square roots, exponentials, or any other element-wise operation, ufuncs
make it easy to apply the same operation to every element in an array without breaking a sweat.
In a nutshell, array-oriented programming with the ndarray in NumPy is about embracing a style that
simplifies your code, makes it more readable, and takes advantage of NumPy's efficient array
operations. It's like telling NumPy, "Handle the arrays as a whole, and I'll focus on the bigger picture."
This approach can significantly enhance your data science coding experience.
Imagine you have a dataset stored in a file, and you want to bring it into your Python environment as a
NumPy array. NumPy provides a straightforward way to achieve this using functions like np.loadtxt().
Here's a simple example. Let's say you have a file named "data.txt" containing numeric values arranged
in columns:
import numpy as np
my_data = np.loadtxt('data.txt')
print(my_data)
In this snippet, np.loadtxt() reads the data from the file and converts it into a NumPy array. You can now
work with my_data as a regular NumPy array in your Python code.
import numpy as np
np.savetxt('output.txt', result)
With this, your NumPy array result is saved to a file named "output.txt," and you can share or use it later
as needed.
NumPy's file I/O functions can handle various file formats, from simple text files to more complex ones.
For instance, you can read and write data in CSV format, which is a common way of storing structured
data:
import numpy as np
Here, np.genfromtxt() is used to read data from a CSV file, and np.savetxt() is employed to write data
back to a CSV file.
In essence, file input and output with NumPy's ndarray make it seamless to transfer data between your
Python environment and external files. Whether you're loading datasets for analysis or saving your
results for future use, NumPy simplifies the process, allowing you to focus on the data science tasks at
hand.
Linear algebra:
In the fundamentals of data science, we often encounter scenarios where we need to perform
mathematical operations on arrays, and this is where the linear algebra capabilities of NumPy's ndarray
come into play. Let's dive into the basics of how NumPy makes linear algebra operations smooth and
accessible.
1. Element-wise Operations:
First off, NumPy allows us to perform basic arithmetic operations element-wise on arrays. This means if
you have two arrays, you can add, subtract, multiply, or divide their corresponding elements effortlessly.
import numpy as np
# Two arrays
# Element-wise addition
# Element-wise multiplication
NumPy takes care of handling each element independently, making these operations concise and
readable.
2. Dot Product:
Moving a step further, the dot product is a fundamental operation in linear algebra. NumPy provides a
simple way to calculate the dot product of two arrays using the np.dot() function.
import numpy as np
# Two arrays
# Dot product
The dot product is the sum of the products of corresponding elements, and NumPy handles this
calculation seamlessly.
3. Matrix Multiplication:
Linear algebra often involves working with matrices, and NumPy excels in matrix multiplication. The @
operator or the np.matmul() function can be used for matrix multiplication.
import numpy as np
# Two matrices
# Matrix multiplication
print(result_matrix_multiply)
NumPy's matrix multiplication functions make it easy to handle complex linear algebra operations
without diving into intricate manual calculations.
For more advanced linear algebra tasks, such as finding eigenvalues and eigenvectors, NumPy has got
you covered. The np.linalg.eig() function can be employed to compute these essential properties of
matrices.
import numpy as np
# A matrix
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:")
print(eigenvectors)
NumPy simplifies these complex linear algebra concepts, making them accessible even for those not
deeply immersed in mathematical intricacies.
In a nutshell, NumPy's ndarray brings the power of linear algebra to data science by providing
convenient and efficient ways to perform essential operations. Whether you're dealing with simple
element-wise calculations or diving into more advanced matrix manipulations, NumPy's functionalities
streamline the process, allowing you to focus on the insights hidden in your data.
Random number generators in computer programs are typically pseudo-random. They generate
sequences of numbers that appear random, but they are actually determined by an initial value called a
seed. If you use the same seed, you will get the same sequence of random numbers. This property is
useful for reproducibility in scientific research, simulations, or any situation where you want to ensure
that the randomness is repeatable.
import numpy as np
np.random.seed(42)
By setting the seed to 42, you're starting the pseudorandom sequence at a specific point, making the
results reproducible.
2. Uniform Distribution:
NumPy provides functions to generate pseudorandom numbers following various distributions. For
example, np.random.rand() produces numbers from a uniform distribution between 0 and 1.
import numpy as np
uniform_numbers = np.random.rand(5)
These numbers appear evenly spread out between 0 and 1, giving you a sense of randomness.
3. Normal Distribution:
If you're into a normal distribution (bell curve), NumPy has got your back with np.random.randn().
import numpy as np
normal_numbers = np.random.randn(5)
These numbers tend to cluster around 0, resembling the typical bell-shaped curve of a normal
distribution.
4. Integer Values:
Sometimes, you might need pseudorandom integers. NumPy's got a function for that too –
np.random.randint().
import numpy as np
5. Shuffling Arrays:
Another cool use case is shuffling arrays. Let's say you have an array, and you want to randomize its
order.
import numpy as np
# An array
np.random.shuffle(my_array)
In essence, pseudorandom number generation using NumPy's ndarray is like having a controlled chaos.
You can create the appearance of randomness, set the starting point with a seed, and explore a variety
of distributions for your data science experiments. It's a powerful tool for generating diverse sets of
numbers with a touch of predictability.