0% found this document useful (0 votes)
29 views17 pages

Full Stack Data Science Roadmap

Uploaded by

Ifra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views17 pages

Full Stack Data Science Roadmap

Uploaded by

Ifra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

👨‍💻 Full Stack Data Science Roadmap 2023

Created by: Thu Vu

📝 1. Introduction
🎯 Goal
This notebook aims to give you an overview and help you explore the fundamental skills
required for end-to-end data science projects. Resources on the Internet are abundant, but it is
also hard to know what you should use. So in this notebook I also included my recommended
resources for you to start learning those skills.

Good luck & keep learning!


Thu Vu xx

🤖 How to use this notebook


You can pick and choose what you want to learn first. Please note that the order in which you
learn the skills does not really matter (!), as long as you have the basic programming skills
and Math/ Stats.
Always combine learning theory with practice! You can practice SQL, R and Python directly
on Datalore notebooks 😉 .
Every company may differ in their tooling and data science infratructure. However, if you
have solid fundamentals, there is no doubt you can easily learn new skills and tools down the
road.
If you want to open the links in a new tab, Ctr+click on the links in this notebook.

📈 Try Datalore for yourself!


Use my gift code THUVUDL for a 1 month of free Datalore Professional 🚀 . Click the “Edit copy”
button in the upper right corner of this notebook to create a free Community account, then
upgrade to Datalore Professional in the Account settings.

Try Datalore Enterprise for your team


If you can’t use cloud tools to work with data, your team can host a private version of Datalore
Enterprise on AWS, GCP, Azure and on-premises, ensuring the data doesn’t leave the company’s
environment.

✍️ Final notes
Visit my Youtube video on Full Stack Data Science to get a walk-through of the skills.
If you want to become a collaborator of this roadmap, please reach out to me via email
([email protected]).
If you are looking for a friendly data science community and like-minded buddies to study
with, you can join my Discord server to enjoy the companionship of almost 3,000 members.
Making great stuff takes time and $$. Some links included in this notebook are affiliate links.
By using those links, you help support me to continue sharing (for free) data science related
content like this, at zero costs to you.

👨‍💻 2. Becoming Full Stack


2.1. Programming
When working with data and building data applications, the main programming languages used to
date are:
Python
SQL
R
JavaScript/ C++/ Java (more useful for building high-scale applications)
The graph below shows the current state of programming languages in the Kaggle Machine
Learning & Data Science Survey results (2018-2021). Python and SQL continue to dominate the
toolkit of data science practitioners.

Source: https://www.kaggle.com/code/lynnxy/a-deep-dive-into-the-kaggle-survey-from-2017-
2021#1.-Introduction

🤖 SQL (Structural Querying Language)


What is it?
SQL is a programming language designed to manage data stored in relational databases. The
SQL language is widely used today across web frameworks and database applications. This
keeps data accurate and secure, and helps maintain the integrity of databases, regardless of
size.
70% of SQL is very straight-forward to learn. You can find a few demo PostgreSQL databases in
this notebook, which you can use for practicing SQL!
Example SQL queries

-- Select all data from ds_salaries database (Datalore Demo basebase)


select * from datalore.public.ds_salaries

id work_year experience_level employment_type job_title salary salary_currency salary_in_us


0 0 2020 MI FT Data 70000.0 EUR 79833.0
Scientist
Machine
1 1 2020 SE FT Learning 260000.0 USD 260000.0
Scientist
2 2 2020 SE FT Big Data 85000.0 GBP 109024.0
Engineer
Product
3 3 2020 MI FT Data 20000.0 USD 20000.0
Analyst
Machine
4 4 2020 SE FT Learning 150000.0 USD 150000.0
Engineer
... ... ... ... ... ... ... ... ...
602 602 2022 SE FT Data 154000.0 USD 154000.0
Engineer
603 603 2022 SE FT Data 126000.0 USD 126000.0
Engineer
604 604 2022 SE FT Data 129000.0 USD 129000.0
Analyst
605 605 2022 SE FT Data 150000.0 USD 150000.0
Analyst
606 606 2022 MI FT AI
Scientist 200000.0 USD 200000.0
607 rows × 12 columns

-- Find average salary in dataset


select avg(salary) from datalore.public.ds_salaries

avg
0 324000.062603

Learn SQL basics


Topics:
Relational Database Management System (RDBMS)
Database design - Entity Relationship Diagram (ERD)
Primary key
Foreign key
Data Types
Operators
Expressions
Create Database
Drop Database
Select Database
Create Table
Drop Table
Insert Query
Select Query
Where Clause
AND & OR Clauses
Update Query
Delete Query
Like Clause
Top Clause
Order By
Group By
Distinct Keyword
Sorting Results

Learn SQL Intermediate


Topics:
Constraints
Table joins
NULL values
Alias syntax
Indexes
Alter Command
Truncate Table
Using Views
Having clause
Transactions
Wildcards
Date functions
Temporary tables
Clone tables
Using Sequences
Handling duplicates
Injection

Learn SQL Advanced


Topics:
Subqueries
Set operations (UNION, UNION ALL, INTERSECT, MINUS)
GROUP BY extensions (ROLLUP, CUBE, and GROUPING SETS)
Window functions
PARTITION BY
Recursive Queries

SQL Resources, Courses & Certificates


1. Learn SQL Basics for Data Science (Coursera)
2. Complete SQL and Databases Bootcamp: Zero to Mastery (Udemy)
3. Youtube - FREE :)
4. SQLBolt - FREE :)

🤖 Python
What is it?
Python is a widely-used general-purpose, high-level programming language. It was initially
designed by Guido van Rossum in 1991 and developed by Python Software Foundation. It was
mainly developed for emphasis on code readability, and its syntax allows programmers to express
concepts in fewer lines of code.
# Select data in 2021
data_filtered = df_3[df_3.work_year.isin([2021])]
data_filtered

id work_year experience_level employment_type job_title salary salary_currency salary_in_


72 72 2021 EN FT Research 60000.0 GBP 82528.0
Scientist
73 73 2021 EX FT BI Data 150000.0 USD 150000.0
Analyst
74 74 2021 EX FT Head of 235000.0 USD 235000.0
Data
75 75 2021 SE FT Data 45000.0 EUR 53192.0
Scientist
76 76 2021 MI FT BI Data 100000.0 USD 100000.0
Analyst
... ... ... ... ... ... ... ... ...
284 284 2021 MI FT Research 69999.0 USD 69999.0
Scientist
Data
285 285 2021 SE FT Science 7000000.0 INR 94665.0
Manager
286 286 2021 SE FT Head of 87000.0 EUR 102839.0
Data
287 287 2021 MI FT Data 109000.0 USD 109000.0
Scientist
Machine
288 288 2021 MI FT Learning 43200.0 EUR 51064.0
Engineer
217 rows × 12 columns

from lets_plot import *


ggplot(data_filtered) + geom_area(aes(fill="experience_level", color="experie
2.5e-5

2.0e-5

experience_level
1.5e-5 EN
ytisned

EX
SE
1.0e-5 MI

5.0e-6

0.0
0 200,000 400,000 600,000
salary_in_usd

Learn Python Core


IDEs (Integrated Development Environments)
Popular IDEs for Python are:
Pycharm
VSCode
Jupyterlab/ Jupyter Notebook for interactive coding
Important libraries:
pandas
numpy
matplotlib
sklearn (for machine learning)
requests (for working with APIs)
Topics:
Data types
Variables
Typecasting
Operators (Assignment, Logical, Arithmetic etc.)
Conditional Statements – If else and Nested If else and elif
Collections (Arrays) – List, Tuple, Sets and Dictionary
List comprehension
Loops in Python – For Loop, While Loop & Nested Loops
String Manipulation – Basic Operations, Slicing & Functions and Methods
User Defined Functions – Defining, Calling, Types of Functions, Arguments
Lambda Function
Installing & Importing Modules

Learn Python Intermediate


Virtual Environment
Enumerate
Zip and unzip
Map, Filter and Reduce
*args and **kwargs
Errors and exception handling
Context Managers
Creating Python modules

Learn Object Oriented Programming (OOP) in Python


(this is mostly useful for model productization and software development. I explained simply
about OOP in an older video).
Basics of Object Oriented Programming
Creating Class and Object
Constructors – Parameterized and Non-parameterized
Inheritance in Python
In built class methods and attributes
Multi-Level and Multiple Inheritance
Method Overriding and Data Abstraction
Encapsulation
Polymorphism

Python Resources, Courses & Certificates


1. Python for Everybody Specialization (Coursera)
2. Applied Data Science with Python (Coursera)
3. Python Tips (Free online) - for references
4. 📚 Python for Data Analysis
5. 📚 Automate the Boring Stuff with Python
6. 📚 Interactive Python Book (How to Think Like a Computer Scientist, Runestone Academy)

🤖R
What is it?
R is a programming language for statistical computing and graphics. It is an implementation of S
language.
R was created by Ross Ihaka and Robert Gentleman at the university of Auckland in 1991. It’s
name being inspired after the first character of its author’s name and as a playon the name of S.
R is used among data miners, bioinformaticians and statisticians for data analysis and developing
statistical software.

Learn R basics
IDEs (Integrated Development Environments)
RStudio
Important libraries:
data.table
ggplot2
statsmodel

Topics:
Data types (character, numeric, integer, logical, complex)
Vectors
Matrices
Dataframe
Conditional statements (if-else, while)
apply function family
Descriptive statistics in R
Creating R project in RStudio
Installing & Importing libraries

Learn R advanced
Topics:
Error handling
Lexical scoping
Creating R packages

R Resources, Courses & Certificates


1. Data Analysis with R Specialization (Coursera)
2. 📚 R for Data Science (Hadley Wickham & Garrett Grolemund)
3. 📚 Advanced R by Hadley Wickham

5.2. Data visualization


What is it?
Data visualization is the representation of data through use of common graphics, such as charts,
plots, infographics, and even animations. These visual displays of information communicate
complex data relationships and data-driven insights in a way that is easy to understand.
Data visualization can be created using Python/ R, or proprietary software like Tableau and
PowerBI (which are popular dashboarding tools in businesses).
Popular data viz libraries in Python:
matplotlib
bokeh
plotly
seaborn
altair

Popular data viz libraries in R:


ggplot2
plotly

Data Viz Resources, Courses & Certificates


1. Data Visualization with Tableau Specialization (Coursera)
2. PowerBI course (Codebasics)
3. 📚 Storytelling with Data
4. Mistakes in Data visualization (video)

Data Viz Portfolio Projects


1. Creating an interactive Python visualization dashboard with Panel

from IPython.display import HTML, IFrame


HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/uhx

5.3. Math, Probability & Statistics


Advanced Math & Statistics are mostly useful for machine learning and advanced statistical
analyses. Don't worry if you don't cover everything :)

📈 Linear Algebra
Topics:
Basic properties of matrix and vectors:
scalar multiplication,
linear transformation,
transpose,
conjugate,
rank,
determinant
Inner and outer products
Matrix multiplication rule
Matrix inverse
Special matrices (eg.g square matrix, identity matrix, triangular matrix, idea about sparse and
dense matrix, unit vectors, symmetric matrix)
Matrix factorization concept/LU decomposition
Gaussian/Gauss-Jordan elimination
Solving Ax=b linear system of equation
Vector space, basis, span, orthogonality, orthonormality, linear least square
Eigenvalues, eigenvectors, diagonalization, singular value decomposition
Why learn Linear Algebra?
You might encounter linear algebra in several machine learning algorithms. For example, principle
component analysis uses singular value decomposition to present your data in fewer dimensions.
Also, all neural network algorithms use linear algebra to present network structures and compute
the network parameters.
Resources & Courses:
1. Mathematics for Machine Learning and Data Science Specialization (Coursera +
Deeplearning.ai) (first course)
📉 Calculus
The mathematical study of continuous change.
Topics:
Limits
Derivative of a function
Integrals
Partial derivatives & the chain rule
Maxima and minima
Why learn Calculus?
Ever came across “gradient descent” method in Machine learning? This is exactly an application
of calculus.
Resources & Courses:
1. Mathematics for Machine Learning and Data Science Specialization (Coursera +
Deeplearning.ai)

🤔 Probability & Statistics


Topics:
Basic statistics like data summaries and descriptive statistics:
mean
mode
quantile
standard deviation
variance/ covariance
Conditioinal probability (for example when you learn about Bayes theorem)
Probability distributions
Sampling
Hypothesis testing
Central Limit Theorem
Why learn Prob/ Stats?
Because it is the backbone of statistical learning (traditional ML).
Resources & Courses:
1. An Introduction to Statistical Learning
2. 📚 Naked Statistics - beginner friendly
3. Practical Statistics for Data Scientists - beginner friendly

5.4. Machine learning/ Deep learning

Topics
Feature Selection
Feature Scaling/ standardizing
Data Resampling
Undersampling
Oversampling
Handling missing values/ Data imputation
Detecting outliers
Train-set split, cross validation
Evaluating a ML model & performance metrics
Variety of algorithms:
Machine Learning Resources
🤖 Machine Learning Specialization by Andrew Ng (Coursera)
🤖 Deep Learning Specialization by Andrew Ng (Coursera)
📚 Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
📚 Probabilistic Machine Learning: An Introduction (Kevin P. Murphy)
📚 Deep learning book (Ian Goodfellow and Yoshua Bengio and Aaron Courville)
M hi L i O
5.5. Software Development

📔 Git version control


Git, invented by Linus Torvalds in 2005, is a version control system that developers use all over
the world. It helps you track different versions of your code and collaborate with other
developers.
Note: Git is NOT equal to GitHub: Git is a version control software. GitHub is a cloud-based
hosting service that lets you manage Git repositories.
🎨 Coding style
It is a good practice to stick to a certain style guide when coding. It helps make the code more
readable and easier to maintain. It also makes you look much more professional. 😉
R: Google's R style guide - based on Tidyverse style guide
Python: PEP8, and PEP484 for type hints

🧩 Data Structures & Algorithms (CS Fundamentals)


For pure data science, it is probably not necessary to learn in-depth DS&A. But when I did
network analysis, I found it quite useful to know how graph data structures work and the
algorithms on graphs.
Resources:
[https://www.programiz.com/dsa]

🤖 Unit testing
Unit testing is a technique in which particular module/ function is tested to check by developer
himself whether there are any errors.
Learn to use pytest library in Python

5.6. Other skills

You might also like