A Shallow Dive into Deep Reinforcement Learning
Published:
Background
Once we take an action $A_t \in \mathcal{A}$ starting at a state $S_t \in \mathcal{S}$, we get a reward $R_{t+1} \in \mathbb{R}$ from the environment, and enter a new state $S_{t+1}$. Usually the policy is stochastic, not deterministic, hence all the notions above are random variables. Define the reward function $r$ as the expected reward given a status and action:
\[r: \mathcal{S} \times \mathcal{A} \longrightarrow \mathbb{R}, \; (s, a) \mapsto \mathbb{E} [R_{t+1} \vert S_t = s, A_t = a].\]A policy tells us which action to take at state $S_t$. It can be either deterministic $\pi (s) = a$, or stochastic $\pi (a\vert s) = \mathbb{P}_{\pi} [A = a \vert S = s]$. The future reward, a.k.a return, is the sum of discounted upcoming rewards:
\[G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k = 0}^{\infty} \gamma^k R_{t+k+1}, \; \gamma \in [0, 1].\]The action-value, a.k.a Q-value, of a state-action pair is defined as:
\[Q_{\pi} (s, a) = \mathbb{E}_{\pi} [G_t \vert S_t = s, A_t = a].\]The state-value, a.k.a V-value, of a state is defined as:
\[V_{\pi} (s) = \mathbb{E}_{\pi} [G_t \vert S_t = s] = \sum_{a \in \mathcal{A}} Q_{\pi} (s, a) \pi (a\vert s) = \mathbb{E}_{a \sim \pi (s)} [Q_{\pi} (s, a)].\]The difference between action-value and state-value is the action advantage function, a.k.a. A-value: $A_{\pi} (s, a) = Q_{\pi} (s, a) - V_{\pi} (s)$.
Markov Decision Process (MDP)
A Markov decision process consists of five elements $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle$, where $P$ is the transition probability function defined as:
\[P^a_{ss'} = \mathbb{P} [S_{t+1} = s' \vert S_t = s, A_t = a] = \sum_{r \in \mathcal{R}} \mathbb{P} [S_{t+1} = s', R_{t+1} = r\vert S_t = s, A_t = a],\]and all states are assumed to be Markov: $\mathbb{P} [S_{t+1} \vert S_t] = \mathbb{P} [S_{t+1} \vert S_1, \ldots, S_t]$.
Bellman Equations
Value Equation
State-value:
\[V_{\pi} (s) = \mathbb{E}_{\pi} [R_{t+1} + \gamma V_{\pi} (S_{t+1}) \vert S_t = s].\]Action-value:
\[Q_{\pi} (s, a) = \mathbb{E}_{\pi} [R_{t+1} + \gamma \mathbb{E}_{a' \sim \pi (S_{t+1})} [Q_{\pi} (S_{t+1}, a')] \vert S_t = s, A_t = a].\]Expectation Equation
State-value:
\[V_{\pi} (s) = \sum_{a \in \mathcal{A}} \pi (a\vert s) Q_{\pi} (s, a) = \sum_{a \in \mathcal{A}} \pi (a\vert s) \bigg(r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_{\pi} (s')\bigg).\]Action-value:
\[Q_{\pi} (s, a) = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_{\pi} (s') = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} \sum_{a' \in \mathcal{A}} \pi (a' \vert s') Q_{\pi} (s', a').\]Optimality Equation
State-value:
\[V_* (s) = \max_{a \in \mathcal{A}} Q_* (s, a) = \max_{a \in \mathcal{A}} \bigg(r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_* (s')\bigg).\]Action-value:
\[Q_* (s, a) = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_* (s') = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} \max_{a' \in \mathcal{A}} Q_* (s', a').\]Dynamic Programming
- Policy evaluation:
- Policy improvement:
Monte-Carlo Methods
Generate complete episodes ${(S_t, A_t, R_{t+1}, S_{t+1})}$ from policy $\pi$, and compute returns $G_t = \sum_{k = 0}^{T-t-1} \gamma^k R_{t+k+1}$.
Policy evaluation:
- Policy improvement:
Temporal-Difference (TD) Learning
SARSA
Generate episode $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ from policy $\pi$.
Policy evaluation:
- Policy improvement after $T$ steps.
Q-Learning
Generate episode $(s_t, a_t, r_{t+1}, s_{t+1})$ from policy $\pi$.
Policy evaluation:
- Policy improvement after $T$ steps.
Deep Q-Network
Generate replay buffer ${(s_t, a_t, r_{t+1}, s_{t+1})}$ from policy $\pi$.
Freeze target network $Q_{\pi} (s, a; \bar{\theta})$.
Update Q-network by minimizing loss:
- Policy improvement after $T$ steps.
Policy Gradient
We parameterize the policy by $\pi_{\theta}$ and define the reward function as:
\[J (\theta) = \mathbb{E}_{S \sim d_{\pi}} [V_{\pi_{\theta}} (S)] = \mathbb{E}_{S \sim d_{\pi}} \big[\mathbb{E}_{A \sim \pi_{\theta} (S)} [Q_{\pi_{\theta}} (S, A)]\big],\]where $d_{\pi}$ is the stationary distribution of Markov chain for $\pi_{\theta}$, defined as:
\[d_{\pi} (s) = \lim_{t \rightarrow \infty} \mathbb{P} (S_t = s \vert S_0 = s_0).\]The policy gradient theorem states that:
\[\nabla_{\theta} J (\theta) \propto \mathbb{E}_{S \sim d_{\pi}} \big[\mathbb{E}_{A \sim \pi_{\theta} (S)} [Q_{\pi_{\theta}} (S, A) \nabla_{\theta} \ln \pi_{\theta} (A \vert S)]\big].\]