@(Cabinet)[ml_dl_rl, aca_book|published_gitbook]
date = "2016-01-10"
RL an introduction, Ch3
p64
3.1 The Agent-Environment Interface
At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent's policy
and is denoted , where is the probability that if .
3.3 Returns
In general, we seek to maximize the expected return
, where the return, , is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards: .
3.4 Unified Notation for Episodic and Continuing Tasks
The return is , including the possibility that or .
3.6 Markov Decision Processes
A reinforcement learning task that satisfies the Markov property is called Markov decision process, or MDP. A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the environment.
Given any state and action, and , the probability of each possible next state, , is . These quantities are called transition probabilities
.
Similarly, given any current state and action, and , together with any next state, , the expected value of the next reward is .
These quantities, and , completely specify the most important aspects of the dynamics of the finite MDP (only information about the distribution of rewards around the expected value is lost).
3.7 Value Functions
Almost all reinforcement learning algorithms are based on estimating value functions
-- functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notation of "how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Also, value functions are defined w.r.t. particular policies
.
A policy, , is a mapping from each state, , and action, , to the probability of taking action when in state . Informally, the value
of a state under a policy , denoted , is the expected return when starting in and following thereafter. For MDPs, we define formally as: . We call the function the state-value
function for policy .
Similarly, we define the value of taking action in state under a policy , denoted , as the expected return starting from , taking the action , and thereafter following policy : . We call the action-value
function for policy .
(in p83), The Bellman equation
for is: .
It expresses a relationship between the value of a state and the values of its successor states.
3.8 Optimal Value Functions
Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of rewards over the long run. Value functions define a partial ordering over policies. A policy is defined to be better than or equal to a policy if its expected return is greater than or equal to that of for all
states. There is always at least one policy that is better than or equal to all other policies. Although there may be more than one optimal policy, we denote all the optimal policies by . They share the same state-value function, called the optimal state-value
function, denoted , and defined as , for all .
Optimal policies also share the same optimal action-value
function, denoted , and defined as , for all and , this function gives the expected return for taking action in state and thereafter following an optimal policy.
We can write in terms of as follows: .
p89
Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state: .
The Bellman optimality equation for is: .
For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy. The Bellman optimality equation is actually a system of equations, one for each state, so if there are states, then there are equations in unknowns.