Full Stack Data Science Roadmap
Full Stack Data Science Roadmap
📝 1. Introduction
🎯 Goal
This notebook aims to give you an overview and help you explore the fundamental skills
required for end-to-end data science projects. Resources on the Internet are abundant, but it is
also hard to know what you should use. So in this notebook I also included my recommended
resources for you to start learning those skills.
✍️ Final notes
Visit my Youtube video on Full Stack Data Science to get a walk-through of the skills.
If you want to become a collaborator of this roadmap, please reach out to me via email
([email protected]).
If you are looking for a friendly data science community and like-minded buddies to study
with, you can join my Discord server to enjoy the companionship of almost 3,000 members.
Making great stuff takes time and $$. Some links included in this notebook are affiliate links.
By using those links, you help support me to continue sharing (for free) data science related
content like this, at zero costs to you.
Source: https://www.kaggle.com/code/lynnxy/a-deep-dive-into-the-kaggle-survey-from-2017-
2021#1.-Introduction
avg
0 324000.062603
🤖 Python
What is it?
Python is a widely-used general-purpose, high-level programming language. It was initially
designed by Guido van Rossum in 1991 and developed by Python Software Foundation. It was
mainly developed for emphasis on code readability, and its syntax allows programmers to express
concepts in fewer lines of code.
# Select data in 2021
data_filtered = df_3[df_3.work_year.isin([2021])]
data_filtered
2.0e-5
experience_level
1.5e-5 EN
ytisned
EX
SE
1.0e-5 MI
5.0e-6
0.0
0 200,000 400,000 600,000
salary_in_usd
🤖R
What is it?
R is a programming language for statistical computing and graphics. It is an implementation of S
language.
R was created by Ross Ihaka and Robert Gentleman at the university of Auckland in 1991. It’s
name being inspired after the first character of its author’s name and as a playon the name of S.
R is used among data miners, bioinformaticians and statisticians for data analysis and developing
statistical software.
Learn R basics
IDEs (Integrated Development Environments)
RStudio
Important libraries:
data.table
ggplot2
statsmodel
Topics:
Data types (character, numeric, integer, logical, complex)
Vectors
Matrices
Dataframe
Conditional statements (if-else, while)
apply function family
Descriptive statistics in R
Creating R project in RStudio
Installing & Importing libraries
Learn R advanced
Topics:
Error handling
Lexical scoping
Creating R packages
📈 Linear Algebra
Topics:
Basic properties of matrix and vectors:
scalar multiplication,
linear transformation,
transpose,
conjugate,
rank,
determinant
Inner and outer products
Matrix multiplication rule
Matrix inverse
Special matrices (eg.g square matrix, identity matrix, triangular matrix, idea about sparse and
dense matrix, unit vectors, symmetric matrix)
Matrix factorization concept/LU decomposition
Gaussian/Gauss-Jordan elimination
Solving Ax=b linear system of equation
Vector space, basis, span, orthogonality, orthonormality, linear least square
Eigenvalues, eigenvectors, diagonalization, singular value decomposition
Why learn Linear Algebra?
You might encounter linear algebra in several machine learning algorithms. For example, principle
component analysis uses singular value decomposition to present your data in fewer dimensions.
Also, all neural network algorithms use linear algebra to present network structures and compute
the network parameters.
Resources & Courses:
1. Mathematics for Machine Learning and Data Science Specialization (Coursera +
Deeplearning.ai) (first course)
📉 Calculus
The mathematical study of continuous change.
Topics:
Limits
Derivative of a function
Integrals
Partial derivatives & the chain rule
Maxima and minima
Why learn Calculus?
Ever came across “gradient descent” method in Machine learning? This is exactly an application
of calculus.
Resources & Courses:
1. Mathematics for Machine Learning and Data Science Specialization (Coursera +
Deeplearning.ai)
Topics
Feature Selection
Feature Scaling/ standardizing
Data Resampling
Undersampling
Oversampling
Handling missing values/ Data imputation
Detecting outliers
Train-set split, cross validation
Evaluating a ML model & performance metrics
Variety of algorithms:
Machine Learning Resources
🤖 Machine Learning Specialization by Andrew Ng (Coursera)
🤖 Deep Learning Specialization by Andrew Ng (Coursera)
📚 Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
📚 Probabilistic Machine Learning: An Introduction (Kevin P. Murphy)
📚 Deep learning book (Ian Goodfellow and Yoshua Bengio and Aaron Courville)
M hi L i O
5.5. Software Development
🤖 Unit testing
Unit testing is a technique in which particular module/ function is tested to check by developer
himself whether there are any errors.
Learn to use pytest library in Python