0% found this document useful (0 votes)
41 views

FDS Module 1 Notes

This document provides an introduction to key concepts in data science. It defines data science as using data to acquire knowledge and make decisions. The three basic areas of data science are math/statistics, computer programming, and domain knowledge. It also defines common terminologies like machine learning, data mining, and big data. Additionally, it outlines the different types of data like structured vs unstructured, quantitative vs qualitative, and the four levels of data. Finally, it describes the five essential steps of data science: asking a question, obtaining data, exploring data, modeling data, and communicating results.

Uploaded by

Nanda Kishore. E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

FDS Module 1 Notes

This document provides an introduction to key concepts in data science. It defines data science as using data to acquire knowledge and make decisions. The three basic areas of data science are math/statistics, computer programming, and domain knowledge. It also defines common terminologies like machine learning, data mining, and big data. Additionally, it outlines the different types of data like structured vs unstructured, quantitative vs qualitative, and the four levels of data. Finally, it describes the five essential steps of data science: asking a question, obtaining data, exploring data, modeling data, and communicating results.

Uploaded by

Nanda Kishore. E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Module 1: INTRODUCTION TO DATA SCIENCE

Basic terminologies of data science


Whenever we use the word "data", we refer to a collection of information in either an organized or
unorganized format:

 Organized data: This refers to data that is sorted into a row/column structure, where every row
represents a single observation and the columns represent the characteristics of that
observation.
 Unorganized data: This is the type of data that is in the free form, usually text or raw
audio/signals that must be parsed further to become organized.

Data science is the art and science of acquiring knowledge through data. Data science is all about how
we take data, use it to acquire knowledge, and then use that knowledge to do the following:

 Make decisions
 Predict the future
 Understand the past/present
 Create new industries/products

Understanding data science begins with three basic areas:

 Math/statistics: This is the use of equations and formulas to perform analysis


 Computer programming: This is the ability to use code to create outcomes on the computer
 Domain knowledge: This refers to understanding the problem domain (medicine, finance, social
science, and so on)
Here are some common terminologies you are likely to come across:

 Machine learning: This refers to giving computers the ability to learn from data without explicit
"rules" being given by a programmer. Machine learning combines the power of computers with
intelligent learning algorithms in order to automate the discovery of relationships in data and
create of powerful data models. Speaking of data models, we will concern ourselves with the
following two basic types of data models:
 Probabilistic model: This refers to using probability to find a relationship between elements that
includes a degree of randomness.
 Statistical model: This refers to taking advantage of statistical theorems to formalize
relationships between data elements in a (usually) simple mathematical formula.
 Exploratory data analysis (EDA) refers to preparing data in order to standardize results and gain
quick insights. EDA is concerned with data visualization and preparation.
 Data mining is the process of finding relationships between elements of data. Data mining is the
part of data science where we try to find relationships between variables (think spawn-recruit
model).
 Big Data: Big Data refers to datasets that are too large and complex for traditional data
processing applications. It involves the processing of vast amounts of data using parallel and
distributed computing.
 Artificial Intelligence (AI): Artificial Intelligence is a broader concept that refers to the
development of computer systems capable of performing tasks that normally require human
intelligence. Machine Learning is a subset of AI.

Types of Data
We will look at the three basic classifications of data:

 Structured vs unstructured (sometimes called organized vs unorganized)


 Quantitative vs qualitative
 The four levels of data

Structured versus unstructured data

 Structured (organized) data: This is data that can be thought of as observations and
characteristics. It is usually organized using a table method (rows and columns).
 Unstructured (unorganized) data: This data exists as a free entity and does not follow any
standard organization hierarchy.

Here are a few examples that could help you differentiate between the two:

 Most data that exists in text form, including server logs and Facebook posts, is unstructured
 Scientific observations, as recorded by careful scientists, are kept in a very neat and organized
(structured) format
 A genetic sequence of chemical nucleotides is unstructured
Quantitative versus qualitative data

 Quantitative data: This data can be described using numbers, and basic mathematical
procedures, including addition, are possible on the set.
 Qualitative data: This data cannot be described using numbers and basic mathematics. This data
is generally thought of as being described using "natural" categories and language.

Quantitative data can be broken down, one step further, into discrete and continuous quantities. These
can be defined as follows:

 Discrete data: This describes data that is counted. It can only take on certain values. Examples of
discrete quantitative data include a dice roll, because it can only take on six values, and the
number of customers in a café, because you can't have a real range of people.
 Continuous data: This describes data that is measured. It exists on an infinite range of values. A
good example of continuous data would be a person's weight because it can be 150 pounds or
197.66 pounds (note the decimals).

The four levels of data

It is generally understood that a specific characteristic (feature/column) of structured data can be


broken down into one of four levels of data. The levels are:

 The nominal level


 The ordinal level
 The interval level
 The ratio level

1. Nominal Level:

 At the nominal level, data are categorized into distinct groups or classes with no inherent
order or ranking.

 Nominal data are qualitative and involve the classification of items into discrete categories.

 Examples include gender, ethnicity, colors, or types of fruits.

 Arithmetic operations like addition or subtraction are not meaningful at this level.

2. Ordinal Level:

 The ordinal level introduces the concept of order or rank among the categories, but the
intervals between them are not consistent or meaningful.

 Data at the ordinal level represent relative positioning, but the differences between the
categories are not quantifiable.

 Examples include education levels, customer satisfaction ratings, or socio-economic classes.


 While you can determine the order, mathematical operations like multiplication or division
are not meaningful.

3. Interval Level:

 The interval level introduces a consistent and meaningful scale with equal intervals between
consecutive points.

 Zero in interval data does not represent the absence of the attribute but is rather an
arbitrary point.

 Temperature measured in Celsius or Fahrenheit is a classic example of interval data.

 Arithmetic operations like addition and subtraction are meaningful, but multiplication or
division may not be appropriate.

4. Ratio Level:

 The ratio level is the highest and most informative level of measurement.

 It possesses all the properties of interval data, but with a true zero point that indicates the
absence of the attribute.

 Examples of ratio data include height, weight, income, and age.

 All basic arithmetic operations, including multiplication and division, are meaningful at the
ratio level.

The Five Steps of Data Science


The five essential steps to perform data science are as follows:

1. Asking an interesting question


2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and visualizing the results.

1. Asking an Interesting Question:

 Purpose:

 Data science starts with formulating a clear and relevant question or problem to solve.
This step is crucial in guiding the entire data science process.

 Key Actions:

 Identify the problem or goal.


 Formulate specific, answerable questions.

 Define success criteria.

2. Obtaining the Data:

 Purpose:

 Acquiring relevant and sufficient data to address the questions posed in the first step.
This involves collecting, accessing, or retrieving the necessary datasets.

 Key Actions:

 Identify data sources.

 Collect or access the data.

 Ensure data quality and reliability.

3. Exploring the Data:

 Purpose:

 Understand the characteristics and patterns within the data. Exploratory Data Analysis
(EDA) is conducted to gain insights and inform subsequent steps.

 Key Actions:

 Descriptive statistics.

 Data visualization.

 Identify outliers and missing values.

4. Modeling the Data:

 Purpose:

 Building a predictive model or statistical analysis to answer the questions posed initially.
This step involves selecting appropriate algorithms and techniques.

 Key Actions:

 Feature engineering.

 Selecting and training models.

 Evaluating model performance.

5. Communicating and Visualizing the Results:


 Purpose:

 Effectively communicate findings and insights to both technical and non-technical


stakeholders. Visualization plays a key role in conveying complex information.

 Key Actions:

 Prepare a clear and concise report.

 Create visualizations to support key findings.

 Communicate results to stakeholders.

Basic questions for data exploration

When looking at a new dataset, whether it is familiar to you or not, it is important to use the following
questions as guidelines for your preliminary analysis:

 Is the data organized or not?

 What does each row represent?

 Are there any missing data points?

 Do we need to perform any transformations on the columns?

NumPy Basics: Arrays and Vectorized Computation


NumPy, short for Numerical Python, is one of the most important foundational packages for numerical
computing in Python. Many computational packages providing scientific functionality use NumPy's array
objects as one of the standard interface data exchange. Here are some of the things you'll find in
NumPy:

 ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations


and flexible broadcasting capabilities.
 Mathematical functions for fast operations on entire arrays of data without having to write
loops.
 Tools for reading/writing array data to disk and working with memory-mapped files.
 Linear algebra, random number generation, and Fourier transform capabilities

One of the reasons NumPy is so important for numerical computations in Python is because it is
designed for efficiency on large arrays of data. There are a number of reasons for this:

 NumPy internally stores data in a contiguous block of memory, independent of other built-in
Python objects. NumPy arrays also use much less memory than built-in Python sequences.
 NumPy operations perform complex computations on entire arrays without the need for Python
for loops, which can be slow for large sequences. NumPy is faster than regular Python code.
NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python
counterparts and use significantly less memory.

The NumPy ndarray: A Multidimensional Array Object


NumPy is a powerful numerical computing library in Python, and one of its key features is the N-
dimensional array object, known as ndarray (short for N-dimensional array). The ndarray is a flexible
and efficient container for large datasets in Python, and it provides a wide range of functions for
performing mathematical operations on these arrays.

Creating ndarrays:
Creating NumPy ndarray involves several methods, and you can create arrays with different shapes,
sizes, and initial values. Here are some common ways to create NumPy arrays:
1. Using numpy.array():
 The simplest way to create an array is by converting a Python list or tuple using the
numpy.array() function.
import numpy as np
arr = np.array([1, 2, 3])
2. Using numpy.zeros() and numpy.ones():
 Create arrays filled with zeros or ones.
zeros_arr = np.zeros((3, 4)) # 3x4 array of zeros
ones_arr = np.ones((2, 2)) # 2x2 array of ones
3. Using numpy.arange():
 Create an array with a range of values.
range_arr = np.arange(0, 10, 2) # Array from 0 to 8 with step 2
4. Using numpy.linspace():
 Create an array with evenly spaced values over a specified range.
linspace_arr = np.linspace(0, 1, 5) # 5 evenly spaced values between 0 and 1
5. Using numpy.eye() and numpy.identity():
 Create identity matrices.
pythonCopy code
identity_matrix = np.eye(3) # 3x3 identity matrix
identity_matrix_alt = np.identity(2) # 2x2 identity matrix
6. Using numpy.random module:
 Create arrays with random values.
pythonCopy code
random_arr = np.random.rand(2, 3) # 2x3 array with random values between 0 and 1
7. Using numpy.full():
 Create an array with a specified fill value.
pythonCopy code
filled_arr = np.full((2, 2), 7) # 2x2 array filled with the value 7
8. Using numpy.reshape():
 Reshape an existing array.
pythonCopy code
original_arr = np.arange(1, 10)
reshaped_arr = original_arr.reshape((3, 3))
9. Using numpy.concatenate():
 Concatenate multiple arrays along an existing axis.

pythonCopy code
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
concatenated_arr = np.concatenate((arr1, arr2))
10. Using numpy.stack():
 Stack arrays along a new axis.

pythonCopy code
stacked_arr = np.stack((arr1, arr2))

import numpy as np
# Creating ndarray from a list
my_list = [1, 2, 3, 4, 5]
my_array = np.array(my_list)
print(my_array)
# Generating an array with NumPy
generated_array = np.arange(0, 10, 2) # Creates an array from 0 to 10 with step 2
print(generated_array)

Data Types for ndarrays:

In NumPy, every element of an ndarray is of the same data type. NumPy provides a variety of data
types, and you can specify the data type when creating an array or convert the data type later. Here are
some common data types in NumPy:
1. Integer Types:
 np.int8, np.int16, np.int32, np.int64
 np.uint8, np.uint16, np.uint32, np.uint64
2. Floating-Point Types:
 np.float16, np.float32, np.float64
 np.complex64, np.complex128
3. Other Types:
 np.bool_: Boolean type
 np.object_: Object type
 np.string_, np.unicode_: String types
You can specify the data type when creating an array using the dtype parameter. For example:

pythonCopy code
import numpy as np
# Create an array with specified data type
arr_int32 = np.array([1, 2, 3], dtype=np.int32)
arr_float64 = np.array([1.1, 2.2, 3.3], dtype=np.float64)

If you don't explicitly specify a data type, NumPy will infer the data type based on the input values. You
can check the data type of an array using the dtype attribute:

pythonCopy code
arr = np.array([1, 2, 3])
print(arr.dtype) # Output: int64

You can also convert the data type of an existing array using the astype() method:

pythonCopy code
arr = np.array([1, 2, 3])
arr_float = arr.astype(np.float64)

NumPy also supports more advanced data types, such as structured data types and datetime types,
which are useful for working with structured data and time-related information.
It's important to choose the appropriate data type based on the nature of your data to ensure efficient
memory usage and appropriate precision for your computations. NumPy provides a high level of
flexibility in this regard.

# Specifying data type while creating an array


int_array = np.array([1, 2, 3], dtype=int)
float_array = np.array([1.1, 2.2, 3.3], dtype=float)
print(int_array)
print(float_array)
Arithmetic with NumPy Arrays:

NumPy arrays support element-wise arithmetic operations, broadcasting, and various mathematical
operations, making it convenient for numerical computations. Here are some common arithmetic
operations you can perform with NumPy arrays:
1. Element-wise Operations:
 NumPy allows you to perform element-wise operations on arrays. The operation is applied
to each element in the array individually.
pythonCopy code
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# Element-wise addition
result_addition = arr1 + arr2
# [5, 7, 9]
# Element-wise multiplication
result_multiplication = arr1 * arr2
# [4, 10, 18]
2. Scalar Operations:
 You can perform arithmetic operations with a scalar value, and the operation will be applied
to each element in the array.
pythonCopy code
arr = np.array([1, 2, 3])
# Scalar addition
result_scalar_addition = arr + 5
# [6, 7, 8]
# Scalar multiplication
result_scalar_multiplication = arr * 2
# [2, 4, 6]

# Performing arithmetic operations on arrays


array_a = np.array([1, 2, 3])
array_b = np.array([4, 5, 6])
add_result = array_a + array_b
subtract_result = array_a - array_b
multiply_result = array_a * array_b
divide_result = array_a / array_b
print(add_result)
print(subtract_result)
print(multiply_result)
print(divide_result)

Basic Indexing and Slicing:

 Elements in an array can be accessed using indexing and sliced using the colon (:) notation.
 Indexing starts at 0, and negative indices count from the end.

# Basic indexing and slicing


my_array = np.array([0, 1, 2, 3, 4, 5])
# Accessing a specific element
print(my_array[2])
# Slicing to get a subset of the array
subset = my_array[1:4]
print(subset)
Boolean Indexing:
Boolean indexing in NumPy is a powerful technique that allows you to use boolean arrays to index and
select elements from an array based on specified conditions. Here's a more detailed explanation of
Boolean indexing:
Boolean Indexing Basics:
1. Creating a Boolean Array:
 You can create a boolean array by applying a condition to an existing array.
pythonCopy code
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
condition = arr > 2
print(condition)
# Output: [False False True True True]

2. Using Boolean Array for Indexing:


 You can use the boolean array to index the original array, selecting only the elements that
satisfy the condition.
pythonCopy code
selected_elements = arr[condition]
print(selected_elements)
# Output: [3 4 5]

Direct Boolean Indexing:


You can directly use a boolean condition inside the square brackets when indexing the array. This is a
concise way to perform boolean indexing.
pythonCopy code
arr = np.array([1, 2, 3, 4, 5])
# Direct boolean indexing
selected_elements = arr[arr > 2]
print(selected_elements)
# Output: [3 4 5]

Combining Conditions:
You can use logical operators (AND, OR, NOT) to combine multiple conditions.
pythonCopy code
arr = np.array([1, 2, 3, 4, 5])
# Combined conditions
selected_elements = arr[(arr > 2) & (arr < 5)]
print(selected_elements)
# Output: [3 4]

Modifying Values with Boolean Indexing:


You can also use boolean indexing to modify values in an array based on a condition.
pythonCopy code
arr = np.array([1, 2, 3, 4, 5])
# Modify values based on condition
arr[arr > 2] = 10
print(arr)
# Output: [ 1 2 10 10 10]
Boolean indexing is a flexible and concise way to filter, select, and modify elements in NumPy arrays
based on specified conditions. It is commonly used in data manipulation and cleaning operations.
# Boolean indexing
data = np.array([1, 2, 3, 4, 5])
# Creating a boolean mask
mask = data > 2
print(mask)
# Applying the mask to filter the array
filtered_data = data[mask]
print(filtered_data)

Fancy Indexing:

Fancy indexing in NumPy refers to using arrays of indices, rather than single indices or slices, to access or
modify elements of an array. This technique provides a convenient and powerful way to select or modify
specific elements based on more complex criteria. Fancy indexing can be applied to one-dimensional
and multidimensional arrays.
1D Fancy Indexing:
pythonCopy code
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Using an array of indices for selection
indices = np.array([0, 3, 4])
selected_elements = arr[indices]
print(selected_elements)
# Output: [1 4 5]

# Modifying values using fancy indexing


arr[indices] = 10
print(arr) # Output: [10 2 3 10 10]

2D Fancy Indexing:
pythonCopy code
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Using arrays of indices for row and column selection
row_indices = np.array([0, 1, 2])
col_indices = np.array([1, 2, 0])
selected_elements = arr_2d[row_indices, col_indices]
print(selected_elements)
# Output: [2 6 7]

# Modifying values using fancy indexing


arr_2d[row_indices, col_indices] = 10
print(arr_2d)
# Output:
# [[ 1 10 3]
# [ 4 5 10]
# [10 8 9]]

Using Boolean Arrays for Fancy Indexing:


You can also use boolean arrays for fancy indexing.
pythonCopy code
arr = np.array([1, 2, 3, 4, 5])
# Using a boolean array for selection
condition = arr > 2
selected_elements = arr[condition]
print(selected_elements)
# Output: [3 4 5]

# Modifying values using fancy indexing with a boolean array


arr[condition] = 10
print(arr)
# Output: [ 1 2 10 10 10]

Fancy indexing provides a more advanced and flexible way to index and modify arrays compared to basic
indexing and slicing. It is particularly useful when you need to select or modify elements based on
specific patterns, conditions, or sets of indices. However, keep in mind that fancy indexing often creates
a copy of the data rather than a view, so modifications may not affect the original array.
# Fancy indexing
arr = np.zeros((8, 4))
for i in range(8):
arr[i] = i
print(arr)
print(arr[[4, 3, 0, 6]])
print(arr[[-3, -5, -7]])

Transposing Arrays and Swapping Axes:

In NumPy, transposing an array involves swapping its rows and columns. The transpose of a 2D array is
obtained by switching its rows with columns. NumPy provides the .T attribute and the
numpy.transpose() function for this purpose.

Using .T attribute:
pythonCopy code
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
transposed_arr = arr.T
print(transposed_arr)
# Output:
# [[1 4]
# [2 5]
# [3 6]]
Using numpy.transpose() function:
pythonCopy code
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
transposed_arr = np.transpose(arr)
print(transposed_arr)
# Output:
# [[1 4]
# [2 5]
# [3 6]]

Transposing Higher-Dimensional Arrays:


Transposing higher-dimensional arrays involves rearranging their axes. The numpy.transpose() function
allows you to specify the order of axes through the axes parameter.
pythonCopy code
import numpy as np
arr_3d = np.random.random((2, 3, 4))
# Example 3D array
transposed_arr = np.transpose(arr_3d, axes=(1, 2, 0))
# The order of axes is changed from (0, 1, 2) to (1, 2, 0)
print(arr_3d.shape)
# Output: (2, 3, 4)
print(transposed_arr.shape)
# Output: (3, 4, 2)

Swapping Axes:
For 2D arrays, the .T attribute is equivalent to swapping the axes using numpy.swapaxes().
pythonCopy code
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
swapped_arr = arr.swapaxes(0, 1)
print(swapped_arr)
# Output: # [[1 4] # [2 5] # [3 6]]

Transposing is a common operation in linear algebra and various scientific computations. It is


particularly useful when dealing with matrix operations or rearranging data for different purposes.

# Transposing and swapping axes


original_array = np.array([[1, 2, 3], [4, 5, 6]])
# Transposing the array (flipping rows and columns)
transposed_array = np.transpose(original_array)
print(transposed_array)
# Swapping axes
swapped_axes_array = original_array.swapaxes(0, 1)
print(swapped_axes_array)
In a nutshell, the NumPy ndarray is your data companion, simplifying the creation, manipulation, and
transformation of numerical arrays. It's like having a set of magical tools at your disposal, making data
science feel like a captivating adventure. So, whether you're crafting arrays or reshaping them, the
NumPy ndarray is your trusty guide in the world of data magic.

Universal functions: Fast element-wise Array functions


Universal Functions (ufuncs) in NumPy are functions that operate element-wise on ndarrays, allowing
for fast and efficient numerical computations. Ufuncs are a key feature of NumPy and provide a way to
perform vectorized operations without the need for explicit looping over elements.

1. Square Root (sqrt): Imagine you have an array filled with numbers, and you want to find the square
root of each element. Enter the np.sqrt() ufunc. It magically computes the square root of each element
in your array, all in one go. Here's a quick example:

import numpy as np

my_array = np.array([4, 9, 16])

result = np.sqrt(my_array)

print(result)

The np.sqrt() ufunc takes care of business, giving you the square roots of 4, 9, and 16.

2. Exponential (exp): Now, what if you want to calculate the exponential of each element in your array?
No worries, the np.exp() ufunc has got you covered. Check it out:

import numpy as np

my_array = np.array([1, 2, 3])

result = np.exp(my_array)

print(result)

With a snap of its virtual fingers, np.exp() computes the exponential of 1, 2, and 3.

3. Addition (add): When you have two arrays and you want to add their corresponding elements
together, the np.add() ufunc is your go-to friend. It's as easy as pie:

import numpy as np

array1 = np.array([1, 2, 3])

array2 = np.array([4, 5, 6])

result = np.add(array1, array2)


print(result)

np.add() works its magic, summing up each pair of elements from array1 and array2.

4. Maximum (maximum): If you're on the hunt for the maximum values between two arrays, look no
further than the np.maximum() ufunc. Here's a quick example:

import numpy as np

array1 = np.array([2, 7, 1])

array2 = np.array([5, 3, 9])

result = np.maximum(array1, array2)

print(result)

np.maximum() compares elements from array1 and array2 and delivers the maximum value at each
position.

5. Modulus and Fractional Part (modf): Ever wondered about the fractional and integer parts of
numbers in your array? Enter the np.modf() ufunc:

import numpy as np

my_array = np.array([3.14, -2.71, 1.618])

fractional, integer = np.modf(my_array)

print("Fractional Part:", fractional)

print("Integer Part:", integer)

With a wave of its virtual wand, np.modf() separates the fractional and integer parts of your array
elements.
Array-oriented Programming with arrays:
Array-oriented programming refers to a programming paradigm that emphasizes the use of array
operations for efficient and concise code. It is particularly relevant in the context of scientific computing,
data analysis, and numerical programming. The idea is to express operations in a way that can be
applied to entire arrays, avoiding the need for explicit loops over individual elements. Array-oriented
programming is closely associated with languages and libraries that support array operations efficiently,
and NumPy is a prime example in the Python ecosystem.

1. Vectorized Operations: Vectorization is a core principle of array-oriented programming. Instead of


using explicit loops to iterate over each element of an array, operations are applied to entire arrays at
once. This takes advantage of optimized, low-level implementations of these operations, often provided
by specialized libraries like NumPy.

For example, suppose you have two arrays, array1 and array2, and you want to add their corresponding
elements together. Instead of a loop, you can use the '+' operator directly:

import numpy as np

array1 = np.array([1, 2, 3])

array2 = np.array([4, 5, 6])

result = array1 + array2

print(result)

NumPy takes care of the element-wise addition without the need for explicit loops. It's like telling
NumPy to handle the math for you.

2. Broadcasting: Broadcasting is a feature that allows arrays of different shapes to be combined in a way
that makes sense. This enables operations between arrays with different dimensions without the need
for explicit reshaping.

For instance, consider multiplying an array by a scalar value:

import numpy as np
my_array = np.array([1, 2, 3])

result = my_array * 2

print(result)

Here, NumPy automatically broadcasts the scalar value 2 to match the shape of my_array, making the
multiplication happen effortlessly.

3. Universal Functions (ufuncs): We've already touched on this briefly. Universal functions in NumPy, or
ufuncs, are like special operators that perform element-wise operations on arrays. They are the
workhorses behind array-oriented programming, ensuring efficient and optimized computations.

Whether you're calculating square roots, exponentials, or any other element-wise operation, ufuncs
make it easy to apply the same operation to every element in an array without breaking a sweat.

In a nutshell, array-oriented programming with the ndarray in NumPy is about embracing a style that
simplifies your code, makes it more readable, and takes advantage of NumPy's efficient array
operations. It's like telling NumPy, "Handle the arrays as a whole, and I'll focus on the bigger picture."
This approach can significantly enhance your data science coding experience.

File input and output with arrays:


Essentially, this is about how you can read data from files into NumPy arrays and vice versa, making your
life easier when working with datasets.

Reading Data from a File:

Imagine you have a dataset stored in a file, and you want to bring it into your Python environment as a
NumPy array. NumPy provides a straightforward way to achieve this using functions like np.loadtxt().

Here's a simple example. Let's say you have a file named "data.txt" containing numeric values arranged
in columns:

import numpy as np

# Assuming data.txt has columns of numbers

my_data = np.loadtxt('data.txt')

print(my_data)

In this snippet, np.loadtxt() reads the data from the file and converts it into a NumPy array. You can now
work with my_data as a regular NumPy array in your Python code.

Writing Data to a File:


Conversely, if you've performed some data analysis using NumPy and want to save your results, you can
write your NumPy array back to a file using functions like np.savetxt().

Let's say you have a NumPy array called result:

import numpy as np

# Assuming result is a NumPy array

result = np.array([[1, 2, 3], [4, 5, 6]])

# Save result to a file named output.txt

np.savetxt('output.txt', result)

With this, your NumPy array result is saved to a file named "output.txt," and you can share or use it later
as needed.

Dealing with Different File Formats:

NumPy's file I/O functions can handle various file formats, from simple text files to more complex ones.
For instance, you can read and write data in CSV format, which is a common way of storing structured
data:

import numpy as np

# Reading data from a CSV file

my_csv_data = np.genfromtxt('data.csv', delimiter=',')

# Writing data to a CSV file

np.savetxt('output.csv', result, delimiter=',')

Here, np.genfromtxt() is used to read data from a CSV file, and np.savetxt() is employed to write data
back to a CSV file.

In essence, file input and output with NumPy's ndarray make it seamless to transfer data between your
Python environment and external files. Whether you're loading datasets for analysis or saving your
results for future use, NumPy simplifies the process, allowing you to focus on the data science tasks at
hand.

Linear algebra:
In the fundamentals of data science, we often encounter scenarios where we need to perform
mathematical operations on arrays, and this is where the linear algebra capabilities of NumPy's ndarray
come into play. Let's dive into the basics of how NumPy makes linear algebra operations smooth and
accessible.
1. Element-wise Operations:

First off, NumPy allows us to perform basic arithmetic operations element-wise on arrays. This means if
you have two arrays, you can add, subtract, multiply, or divide their corresponding elements effortlessly.

import numpy as np

# Two arrays

array1 = np.array([1, 2, 3])

array2 = np.array([4, 5, 6])

# Element-wise addition

result_addition = array1 + array2

# Element-wise multiplication

result_multiplication = array1 * array2

print("Addition Result:", result_addition)

print("Multiplication Result:", result_multiplication)

NumPy takes care of handling each element independently, making these operations concise and
readable.

2. Dot Product:

Moving a step further, the dot product is a fundamental operation in linear algebra. NumPy provides a
simple way to calculate the dot product of two arrays using the np.dot() function.

import numpy as np

# Two arrays

array1 = np.array([1, 2, 3])

array2 = np.array([4, 5, 6])

# Dot product

dot_product = np.dot(array1, array2)

print("Dot Product:", dot_product)

The dot product is the sum of the products of corresponding elements, and NumPy handles this
calculation seamlessly.
3. Matrix Multiplication:

Linear algebra often involves working with matrices, and NumPy excels in matrix multiplication. The @
operator or the np.matmul() function can be used for matrix multiplication.

import numpy as np

# Two matrices

matrix1 = np.array([[1, 2], [3, 4]])

matrix2 = np.array([[5, 6], [7, 8]])

# Matrix multiplication

result_matrix_multiply = matrix1 @ matrix2

print("Matrix Multiplication Result:")

print(result_matrix_multiply)

NumPy's matrix multiplication functions make it easy to handle complex linear algebra operations
without diving into intricate manual calculations.

4. Eigenvalues and Eigenvectors:

For more advanced linear algebra tasks, such as finding eigenvalues and eigenvectors, NumPy has got
you covered. The np.linalg.eig() function can be employed to compute these essential properties of
matrices.

import numpy as np

# A matrix

matrix = np.array([[1, 2], [3, 4]])

# Eigenvalues and eigenvectors

eigenvalues, eigenvectors = np.linalg.eig(matrix)

print("Eigenvalues:", eigenvalues)

print("Eigenvectors:")

print(eigenvectors)

NumPy simplifies these complex linear algebra concepts, making them accessible even for those not
deeply immersed in mathematical intricacies.
In a nutshell, NumPy's ndarray brings the power of linear algebra to data science by providing
convenient and efficient ways to perform essential operations. Whether you're dealing with simple
element-wise calculations or diving into more advanced matrix manipulations, NumPy's functionalities
streamline the process, allowing you to focus on the insights hidden in your data.

pseudorandom number generation:


In the vast landscape of data science basics, there's a fascinating aspect that involves creating
pseudorandom numbers using NumPy's ndarray. This isn't about true randomness – it's a way to
generate numbers that seem random but follow a predictable pattern. Let's unravel how NumPy makes
this pseudorandom dance possible.

1. Seed for Reproducibility:

Random number generators in computer programs are typically pseudo-random. They generate
sequences of numbers that appear random, but they are actually determined by an initial value called a
seed. If you use the same seed, you will get the same sequence of random numbers. This property is
useful for reproducibility in scientific research, simulations, or any situation where you want to ensure
that the randomness is repeatable.

import numpy as np

# Setting the seed

np.random.seed(42)

# Generating pseudorandom numbers


random_numbers = np.random.rand(3)

print("Pseudorandom Numbers:", random_numbers)

By setting the seed to 42, you're starting the pseudorandom sequence at a specific point, making the
results reproducible.

2. Uniform Distribution:

NumPy provides functions to generate pseudorandom numbers following various distributions. For
example, np.random.rand() produces numbers from a uniform distribution between 0 and 1.

import numpy as np

# Generating pseudorandom numbers from a uniform distribution

uniform_numbers = np.random.rand(5)

print("Uniform Distribution Numbers:", uniform_numbers)

These numbers appear evenly spread out between 0 and 1, giving you a sense of randomness.

3. Normal Distribution:

If you're into a normal distribution (bell curve), NumPy has got your back with np.random.randn().

import numpy as np

# Generating pseudorandom numbers from a normal distribution

normal_numbers = np.random.randn(5)

print("Normal Distribution Numbers:", normal_numbers)

These numbers tend to cluster around 0, resembling the typical bell-shaped curve of a normal
distribution.

4. Integer Values:

Sometimes, you might need pseudorandom integers. NumPy's got a function for that too –
np.random.randint().

import numpy as np

# Generating pseudorandom integers

random_integers = np.random.randint(1, 10, size=5)

print("Pseudorandom Integers:", random_integers)


In this example, you get integers between 1 and 10.

5. Shuffling Arrays:

Another cool use case is shuffling arrays. Let's say you have an array, and you want to randomize its
order.

import numpy as np

# An array

my_array = np.array([1, 2, 3, 4, 5])

# Shuffling the array

np.random.shuffle(my_array)

print("Shuffled Array:", my_array)

np.random.shuffle() rearranges the elements, providing a shuffled version of your array.

In essence, pseudorandom number generation using NumPy's ndarray is like having a controlled chaos.
You can create the appearance of randomness, set the starting point with a seed, and explore a variety
of distributions for your data science experiments. It's a powerful tool for generating diverse sets of
numbers with a touch of predictability.

You might also like