0% found this document useful (0 votes)

215 views

Lecture 1: Introduction To Reinforcement Learning: David Silver

This document provides an overview of reinforcement learning through a first lecture on the topic. It introduces reinforcement learning as a branch of machine learning where an agent learns from scalar feedback rather than examples. The goal is to maximize cumulative reward through sequential decision making by interacting with an environment. The agent receives observations from the environment and chooses actions, which affect future observations and rewards without a supervisor directly telling the agent if an action was right or wrong.

Uploaded by

Rajesh Punia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

215 views

Lecture 1: Introduction To Reinforcement Learning: David Silver

Uploaded by

Rajesh Punia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement

Learning

David Silver
Lecture 1: Introduction to Reinforcement Learning

Outline

1 Admin

2 About Reinforcement Learning

3 The Reinforcement Learning Problem

4 Inside An RL Agent

5 Problems within Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning
Admin

Class Information

Thursdays 9:30 to 11:00am

Website:
http://www.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
Group:
http://groups.google.com/group/csml-advanced-topics
Contact me: [email protected]
Lecture 1: Introduction to Reinforcement Learning
Admin

Assessment

Assessment will be 50% coursework, 50% exam

Coursework
Assignment A: RL problem
Assignment B: Kernels problem
Assessment = max(assignment1, assignment2)
Examination
A: 3 RL questions
B: 3 kernels questions
Answer any 3 questions
Lecture 1: Introduction to Reinforcement Learning
Admin

Textbooks

An Introduction to Reinforcement Learning, Sutton and

Barto, 1998
MIT Press, 1998
40 pounds
Available free online!
http://webdocs.cs.ualberta.ca/sutton/book/the-book.html
Algorithms for Reinforcement Learning, Szepesvari
Morgan and Claypool, 2010
20 pounds
Available free online!

http://www.ualberta.ca/szepesva/papers/RLAlgsInMDPs.pdf
Lecture 1: Introduction to Reinforcement Learning
About RL

Many Faces of Reinforcement Learning

Computer Science

Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality

Economics
Lecture 1: Introduction to Reinforcement Learning
About RL

Branches of Machine Learning

Supervised Unsupervised
Learning Learning

Machine
Learning

Reinforcement
Learning
Lecture 1: Introduction to Reinforcement Learning
About RL

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine

learning paradigms?
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agents actions affect the subsequent data it receives
Lecture 1: Introduction to Reinforcement Learning
About RL

Examples of Reinforcement Learning

Fly stunt manoeuvres in a helicopter

Defeat the world champion at Backgammon
Manage an investment portfolio
Control a power station
Make a humanoid robot walk
Play many different Atari games better than humans
Lecture 1: Introduction to Reinforcement Learning
About RL

Helicopter Manoeuvres
Lecture 1: Introduction to Reinforcement Learning
About RL

Bipedal Robots
Lecture 1: Introduction to Reinforcement Learning
About RL

Atari
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Reward

Rewards

A reward Rt is a scalar feedback signal

Indicates how well agent is doing at step t
The agents job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
Do you agree with this statement?
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Reward

Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
ve reward for crashing
Defeat the world champion at Backgammon
+/ve reward for winning/losing a game
Manage an investment portfolio
+ve reward for each $ in bank
Control a power station
+ve reward for producing power
ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
ve reward for falling over
Play many different Atari games better than humans
+/ve reward for increasing/decreasing score
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Reward

Sequential Decision Making

Goal: select actions to maximise total future reward

Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more
long-term reward
Examples:
A financial investment (may take months to mature)
Refuelling a helicopter (might prevent a crash in several hours)
Blocking opponent moves (might help winning chances many
moves from now)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Environments

Agent and Environment

observation action

Ot At

reward Rt
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Environments

Agent and Environment

observation action

Ot At At each step t the agent:

Executes action At
Receives observation Ot
reward Rt
Receives scalar reward Rt
The environment:
Receives action At
Emits observation Ot+1
Emits scalar reward Rt+1
t increments at env. step
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

History and State

The history is the sequence of observations, actions, rewards

Ht = O1 , R1 , A1 , ..., At1 , Ot , Rt

i.e. all observable variables up to time t

i.e. the sensorimotor stream of a robot or embodied agent
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the information used to determine what happens next
Formally, state is a function of the history:

St = f (Ht )
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

Environment State

observation action The environment state Ste is

Ot At the environments private
representation
i.e. whatever data the
reward Rt environment uses to pick the
next observation/reward
The environment state is not
usually visible to the agent
Even if Ste is visible, it may
contain irrelevant
environment state Set
information
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

Agent State
The agent state Sta is the
agent state Sat
agents internal
representation
observation action

Ot At
i.e. whatever information
the agent uses to pick the
next action
reward Rt i.e. it is the information
used by reinforcement
learning algorithms
It can be any function of
history:

Sta = f (Ht )
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

Information State
An information state (a.k.a. Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if

P[St+1 | St ] = P[St+1 | S1 , ..., St ]

The future is independent of the past given the present

H1:t St Ht+1:
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Ste is Markov
The history Ht is Markov
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

Rat Example

What if agent state = last 3 items in sequence?

What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

Fully Observable Environments

Full observability: agent directly

state action observes environment state
St At
Ot = Sta = Ste

reward Rt
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)
(Next lecture and the
majority of this course)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State

Partially Observable Environments

Partial observability: agent indirectly observes environment:

A robot with camera vision isnt told its absolute location
A trading agent only observes current prices
A poker playing agent only observes public cards
Now agent state 6= environment state
Formally this is a partially observable Markov decision process
(POMDP)
Agent must construct its own state representation Sta , e.g.
Complete history: Sta = Ht
Beliefs of environment state: Sta = (P[Ste = s 1 ], ..., P[Ste = s n ])
Recurrent neural network: Sta = (St1
a
Ws + Ot Wo )
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Major Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agents behaviour function
Value function: how good is each state and/or action
Model: agents representation of the environment
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Policy

A policy is the agents behaviour

It is a map from state to action, e.g.
Deterministic policy: a = (s)
Stochastic policy: (a|s) = P[At = a|St = s]
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/badness of states
And therefore to select between actions, e.g.

v (s) = E Rt+1 + Rt+2 + 2 Rt+3 + ... | St = s

Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Example: Value Function in Atari

Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Model

A model predicts what the environment will do next

P predicts the next state
R predicts the next (immediate) reward, e.g.
a 0
Pss 0 = P[St+1 = s | St = s, At = a]

Ras = E [Rt+1 | St = s, At = a]
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Maze Example

Start
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agents location
Goal
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Maze Example: Policy

Start

Goal

Arrows represent policy (s) for each state s

Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Maze Example: Value Function

-14 -13 -12 -11 -10 -9

Start -16 -15 -12 -8

-16 -17 -6 -7

-18 -19 -5

-24 -20 -4 -3

-23 -22 -21 -22 -2 -1 Goal

Numbers represent value v (s) of each state s

Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Maze Example: Model

-1 -1 -1 -1 -1 -1 Agent may have an internal

Start -1 -1 -1 -1 model of the environment
-1 -1 -1 Dynamics: how actions
-1 change the state
-1 -1 Rewards: how much reward
-1 -1 Goal
from each state
The model may be imperfect

a
Grid layout represents transition model Pss 0

Numbers represent immediate reward Ras from each state s

(same for all a)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Categorizing RL agents (1)

Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

Categorizing RL agents (2)

Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent

RL Agent Taxonomy

Model-Free

Value Function Actor Policy

Critic

Value-Based Policy-Based

Model-Based

Model
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Learning and Planning

Two fundamental problems in sequential decision making

Reinforcement Learning:
The environment is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
external interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Atari Example: Reinforcement Learning

observation action

Ot At
Rules of the game are
unknown
reward Rt
Learn directly from
interactive game-play
Pick actions on
joystick, see pixels
and scores
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Atari Example: Planning

Rules of the game are known

Can query emulator
right left
perfect model inside agents brain
If I take action a from state s:
what would the next state be?
what would the score be? right left right left

Plan ahead to find optimal policy

e.g. tree search
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Exploration and Exploitation (1)

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Exploration and Exploitation (2)

Exploration finds more information about the environment

Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Examples

Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Prediction and Control

Prediction: evaluate the future

Given a policy
Control: optimise the future
Find the best policy
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Gridworld Example: Prediction

A B 3.3 8.8 4.4 5.3 1.5

+5 1.5 3.0 2.3 1.9 0.5
+10 B 0.1 0.7 0.7 0.4 -0.4

-1.0 -0.4 -0.4 -0.6 -1.2

Actions
A -1.9 -1.3 -1.2 -1.4 -2.0

(a) (b)
What is the value function for the uniform random policy?
Lecture 1: Introduction to Reinforcement Learning
Problems within RL

Gridworld Example: Control

A B 22.0 24.4 22.0 19.4 17.5

+5 19.8 22.0 19.8 17.8 16.0

+10 B 17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

A 14.4 16.0 14.4 13.0 11.7

a) gridworld v
b) V* c) *

What is the optimal value function over all possible policies?

What is the optimal policy?
Lecture 1: Introduction to Reinforcement Learning
Course Outline

Course Outline

Part I: Elementary Reinforcement Learning

1 Introduction to RL
2 Markov Decision Processes
3 Planning by Dynamic Programming
4 Model-Free Prediction
5 Model-Free Control
Part II: Reinforcement Learning in Practice
1 Value Function Approximation
2 Policy Gradient Methods
3 Integrating Learning and Planning
4 Exploration and Exploitation
5 Case study - RL in games

Friendship Bench Training Manual Compressed
50% (2)
Friendship Bench Training Manual Compressed
100 pages
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
No ratings yet
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
23 pages
Autoencoders - Presentation
No ratings yet
Autoencoders - Presentation
18 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Gurevich Malignant Regression
No ratings yet
Gurevich Malignant Regression
26 pages
Argumentative Essay Final
No ratings yet
Argumentative Essay Final
3 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
(EBook PDF) Advances in Biomedical Engineering and Technology 1st edition by Albert Rizvanov, Bikesh Kumar Singh, Padma Ganasala 9811563292 9789811563294 full chapters pdf download
100% (1)
(EBook PDF) Advances in Biomedical Engineering and Technology 1st edition by Albert Rizvanov, Bikesh Kumar Singh, Padma Ganasala 9811563292 9789811563294 full chapters pdf download
87 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
1. Deep Learning
No ratings yet
1. Deep Learning
127 pages
PPT_Btech CSE
No ratings yet
PPT_Btech CSE
17 pages
Deep Learning Step by Step
No ratings yet
Deep Learning Step by Step
171 pages
Dl All Units Materials
No ratings yet
Dl All Units Materials
138 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Back-Propagation Is Very Simple. Who Made It Complicated
No ratings yet
Back-Propagation Is Very Simple. Who Made It Complicated
26 pages
Autoencoders
No ratings yet
Autoencoders
66 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Lecture 26-30 Unit 2
No ratings yet
Lecture 26-30 Unit 2
20 pages
UNIT-I_Introduction to Computer Vision
No ratings yet
UNIT-I_Introduction to Computer Vision
45 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Notes On Backpropagation
No ratings yet
Notes On Backpropagation
14 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
CNN PPT Unit Iv
No ratings yet
CNN PPT Unit Iv
134 pages
Unit 2
No ratings yet
Unit 2
112 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
CS7015 (Deep Learning) : Lecture 1
No ratings yet
CS7015 (Deep Learning) : Lecture 1
108 pages
ANN Matlab
No ratings yet
ANN Matlab
13 pages
Pthread
No ratings yet
Pthread
4 pages
LSTM
No ratings yet
LSTM
42 pages
Deep Learning
No ratings yet
Deep Learning
2 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Back Propagation
100% (1)
Back Propagation
27 pages
ANN Matlab
No ratings yet
ANN Matlab
13 pages
Computer Vision Unit 4
No ratings yet
Computer Vision Unit 4
186 pages
Unit I
0% (1)
Unit I
21 pages
Back Propagation Back Propagation Network Network Network Network
No ratings yet
Back Propagation Back Propagation Network Network Network Network
29 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
Lec19 - GANs
No ratings yet
Lec19 - GANs
47 pages
Machine Learning 1
No ratings yet
Machine Learning 1
160 pages
Multiple-Layer Networks Backpropagation Algorithms
No ratings yet
Multiple-Layer Networks Backpropagation Algorithms
46 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
ch14 Autoencoder
No ratings yet
ch14 Autoencoder
42 pages
AI-Lecture 12 - Simple Perceptron
100% (1)
AI-Lecture 12 - Simple Perceptron
24 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Unit 2 (Second Order Methods)
No ratings yet
Unit 2 (Second Order Methods)
9 pages
Activation Function
No ratings yet
Activation Function
13 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Graph Neural Networks: Aakash Kumar Arvind Ramadurai
No ratings yet
Graph Neural Networks: Aakash Kumar Arvind Ramadurai
22 pages
Perceptron and Backpropagation
No ratings yet
Perceptron and Backpropagation
17 pages
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
100% (1)
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
46 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
Ansul Fundmentals of Foam
No ratings yet
Ansul Fundmentals of Foam
66 pages
Advertisement: Dr. Ambedkar Medhavi Chhattar Yojna
No ratings yet
Advertisement: Dr. Ambedkar Medhavi Chhattar Yojna
1 page
Jobs/Methods Utilized Under TEM/ Sem / Iem: Levy Charges For Different Facilities Provided at
No ratings yet
Jobs/Methods Utilized Under TEM/ Sem / Iem: Levy Charges For Different Facilities Provided at
1 page
IJCS 18 36 - Ghorbani
No ratings yet
IJCS 18 36 - Ghorbani
13 pages
Sarkar 2016
No ratings yet
Sarkar 2016
4 pages
Teqip - Iii
No ratings yet
Teqip - Iii
33 pages
Ganita Darshan 1
No ratings yet
Ganita Darshan 1
4 pages
IJCS 18 36 - Ghorbani
No ratings yet
IJCS 18 36 - Ghorbani
13 pages
Fischer 2002
No ratings yet
Fischer 2002
8 pages
Lec 27
No ratings yet
Lec 27
9 pages
Bioactive Glasses in Dentistry: A Review: Abbasi Z, Bahrololoom ME, Shariat MH, Bagheri R
No ratings yet
Bioactive Glasses in Dentistry: A Review: Abbasi Z, Bahrololoom ME, Shariat MH, Bagheri R
22 pages
Lec 25
No ratings yet
Lec 25
9 pages
166 Foundry Technology: Figure 3.43 Typical Architecture of A Comprehensive
No ratings yet
166 Foundry Technology: Figure 3.43 Typical Architecture of A Comprehensive
3 pages
Castanal 2
No ratings yet
Castanal 2
106 pages
Analysis of Controlled Air Cooling For Castings by Experiment and Simulation - Foundry-Planet
No ratings yet
Analysis of Controlled Air Cooling For Castings by Experiment and Simulation - Foundry-Planet
16 pages
ISFM 2018 Brochure
No ratings yet
ISFM 2018 Brochure
5 pages
Final 00387
No ratings yet
Final 00387
4 pages
Untitled 1504671037
No ratings yet
Untitled 1504671037
2 pages
WJ 1995 12 s406 PDF
No ratings yet
WJ 1995 12 s406 PDF
11 pages
Indian Institute of Technology Ropar: Central Research Facilities
No ratings yet
Indian Institute of Technology Ropar: Central Research Facilities
1 page
Institute ID: IR17-ENGG-1-1-144 (ENGINEERING) Institute Name: National Institute of Technology Kurukshetra
No ratings yet
Institute ID: IR17-ENGG-1-1-144 (ENGINEERING) Institute Name: National Institute of Technology Kurukshetra
2 pages
Fig 6.1 Fig 6.2
No ratings yet
Fig 6.1 Fig 6.2
10 pages
Science6 SchimmelSmartWynn May4
No ratings yet
Science6 SchimmelSmartWynn May4
8 pages
12df Manual
No ratings yet
12df Manual
128 pages
Chapter 3 Lesson 2
No ratings yet
Chapter 3 Lesson 2
2 pages
Research
No ratings yet
Research
4 pages
Videbeck Chapter 8 Assessment - NO NOTES
No ratings yet
Videbeck Chapter 8 Assessment - NO NOTES
22 pages
Complete Download Human Factors in Simple and Complex Systems Second Edition Proctor PDF All Chapters
No ratings yet
Complete Download Human Factors in Simple and Complex Systems Second Edition Proctor PDF All Chapters
91 pages
PAIN notes 9990
No ratings yet
PAIN notes 9990
11 pages
Product Design & Er._module-4
No ratings yet
Product Design & Er._module-4
7 pages
Feelings Workbook
100% (2)
Feelings Workbook
7 pages
Unit 2 - Second Language Acquisition
No ratings yet
Unit 2 - Second Language Acquisition
5 pages
Functions of Communication
No ratings yet
Functions of Communication
38 pages
The Self 2023 PSY 1103
No ratings yet
The Self 2023 PSY 1103
21 pages
Branches of Psychology
No ratings yet
Branches of Psychology
6 pages
Chapter 6 Vocabulary - Psychology
No ratings yet
Chapter 6 Vocabulary - Psychology
3 pages
The Psychological View of Self
100% (1)
The Psychological View of Self
13 pages
Module 7 Per Dev
No ratings yet
Module 7 Per Dev
5 pages
Chapter 4 - Consumer Perception & Positioning
No ratings yet
Chapter 4 - Consumer Perception & Positioning
53 pages
The Nervous System
No ratings yet
The Nervous System
10 pages
Spatial and Verbal Memory Test Scores Following Yoga
No ratings yet
Spatial and Verbal Memory Test Scores Following Yoga
4 pages
Tylka - Intuitive Eating Assessment - JCP.06
100% (2)
Tylka - Intuitive Eating Assessment - JCP.06
15 pages
NLP Communication Model Clear Explanation
No ratings yet
NLP Communication Model Clear Explanation
9 pages
Model Test Paper 1
No ratings yet
Model Test Paper 1
6 pages
Skill Acquisition Theory
No ratings yet
Skill Acquisition Theory
13 pages
Disturbed Sensory Perception
83% (6)
Disturbed Sensory Perception
3 pages
1-s2.0-S0166432820307762-main
No ratings yet
1-s2.0-S0166432820307762-main
7 pages
Final Research 1
No ratings yet
Final Research 1
19 pages
14.4 Coordination - and - Response - Igcse Cie Biology - Ext Theory QP
No ratings yet
14.4 Coordination - and - Response - Igcse Cie Biology - Ext Theory QP
11 pages

Uploaded by

Uploaded by

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement

2 About Reinforcement Learning

3 The Reinforcement Learning Problem

5 Problems within Reinforcement Learning

Thursdays 9:30 to 11:00am

Assessment will be 50% coursework, 50% exam

An Introduction to Reinforcement Learning, Sutton and

Many Faces of Reinforcement Learning

Branches of Machine Learning

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine

Examples of Reinforcement Learning

Fly stunt manoeuvres in a helicopter

A reward Rt is a scalar feedback signal

Sequential Decision Making

Goal: select actions to maximise total future reward

Agent and Environment

Agent and Environment

Ot At At each step t the agent:

History and State

The history is the sequence of observations, actions, rewards

i.e. all observable variables up to time t

observation action The environment state Ste is

P[St+1 | St ] = P[St+1 | S1 , ..., St ]

The future is independent of the past given the present

What if agent state = last 3 items in sequence?

Fully Observable Environments

Full observability: agent directly

Partially Observable Environments

Partial observability: agent indirectly observes environment:

Major Components of an RL Agent

An RL agent may include one or more of these components:

A policy is the agents behaviour

Value function is a prediction of future reward

v (s) = E Rt+1 + Rt+2 + 2 Rt+3 + ... | St = s

Example: Value Function in Atari

A model predicts what the environment will do next

Maze Example: Policy

Arrows represent policy (s) for each state s

Maze Example: Value Function

-14 -13 -12 -11 -10 -9

Start -16 -15 -12 -8

-23 -22 -21 -22 -2 -1 Goal

Numbers represent value v (s) of each state s

Maze Example: Model

-1 -1 -1 -1 -1 -1 Agent may have an internal

Numbers represent immediate reward Ras from each state s

Categorizing RL agents (1)

Categorizing RL agents (2)

Value Function Actor Policy

Learning and Planning

Two fundamental problems in sequential decision making

Atari Example: Reinforcement Learning

Atari Example: Planning

Rules of the game are known

Plan ahead to find optimal policy

Exploration and Exploitation (1)

Reinforcement learning is like trial-and-error learning

Exploration and Exploitation (2)

Exploration finds more information about the environment

Prediction and Control

Prediction: evaluate the future

Gridworld Example: Prediction

A B 3.3 8.8 4.4 5.3 1.5

-1.0 -0.4 -0.4 -0.6 -1.2

Gridworld Example: Control

A B 22.0 24.4 22.0 19.4 17.5

+5 19.8 22.0 19.8 17.8 16.0

+10 B 17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

A 14.4 16.0 14.4 13.0 11.7

What is the optimal value function over all possible policies?

Part I: Elementary Reinforcement Learning

You might also like