A Shallow Dive into Deep Reinforcement Learning

4 minute read



Once we take an action $A_t \in \mathcal{A}$ starting at a state $S_t \in \mathcal{S}$, we get a reward $R_{t+1} \in \mathbb{R}$ from the environment, and enter a new state $S_{t+1}$. Usually the policy is stochastic, not deterministic, hence all the notions above are random variables. Define the reward function $r$ as the expected reward given a status and action:

\[r: \mathcal{S} \times \mathcal{A} \longrightarrow \mathbb{R}, \; (s, a) \mapsto \mathbb{E} [R_{t+1} \vert S_t = s, A_t = a].\]

A policy tells us which action to take at state $S_t$. It can be either deterministic $\pi (s) = a$, or stochastic $\pi (a\vert s) = \mathbb{P}_{\pi} [A = a \vert S = s]$. The future reward, a.k.a return, is the sum of discounted upcoming rewards:

\[G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k = 0}^{\infty} \gamma^k R_{t+k+1}, \; \gamma \in [0, 1].\]

The action-value, a.k.a Q-value, of a state-action pair is defined as:

\[Q_{\pi} (s, a) = \mathbb{E}_{\pi} [G_t \vert S_t = s, A_t = a].\]

The state-value, a.k.a V-value, of a state is defined as:

\[V_{\pi} (s) = \mathbb{E}_{\pi} [G_t \vert S_t = s] = \sum_{a \in \mathcal{A}} Q_{\pi} (s, a) \pi (a\vert s) = \mathbb{E}_{a \sim \pi (s)} [Q_{\pi} (s, a)].\]

The difference between action-value and state-value is the action advantage function, a.k.a. A-value: $A_{\pi} (s, a) = Q_{\pi} (s, a) - V_{\pi} (s)$.

Markov Decision Process (MDP)

A Markov decision process consists of five elements $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle$, where $P$ is the transition probability function defined as:

\[P^a_{ss'} = \mathbb{P} [S_{t+1} = s' \vert S_t = s, A_t = a] = \sum_{r \in \mathcal{R}} \mathbb{P} [S_{t+1} = s', R_{t+1} = r\vert S_t = s, A_t = a],\]

and all states are assumed to be Markov: $\mathbb{P} [S_{t+1} \vert S_t] = \mathbb{P} [S_{t+1} \vert S_1, \ldots, S_t]$.

Bellman Equations

Value Equation


\[V_{\pi} (s) = \mathbb{E}_{\pi} [R_{t+1} + \gamma V_{\pi} (S_{t+1}) \vert S_t = s].\]


\[Q_{\pi} (s, a) = \mathbb{E}_{\pi} [R_{t+1} + \gamma \mathbb{E}_{a' \sim \pi (S_{t+1})} [Q_{\pi} (S_{t+1}, a')] \vert S_t = s, A_t = a].\]

Expectation Equation


\[V_{\pi} (s) = \sum_{a \in \mathcal{A}} \pi (a\vert s) Q_{\pi} (s, a) = \sum_{a \in \mathcal{A}} \pi (a\vert s) \bigg(r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_{\pi} (s')\bigg).\]


\[Q_{\pi} (s, a) = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_{\pi} (s') = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} \sum_{a' \in \mathcal{A}} \pi (a' \vert s') Q_{\pi} (s', a').\]

Optimality Equation


\[V_* (s) = \max_{a \in \mathcal{A}} Q_* (s, a) = \max_{a \in \mathcal{A}} \bigg(r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_* (s')\bigg).\]


\[Q_* (s, a) = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_* (s') = r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} \max_{a' \in \mathcal{A}} Q_* (s', a').\]

Dynamic Programming

  • Policy evaluation:
\[V_{\pi} (s) = \sum_{a \in \mathcal{A}} \pi (a\vert s) \bigg(r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_{\pi} (s')\bigg).\]
  • Policy improvement:
\[\pi (s) = \arg \max_{a \in \mathcal{A}} Q_{\pi} (s, a) = \arg \max_{a \in \mathcal{A}} \bigg(r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P^a_{ss'} V_{\pi} (s')\bigg).\]

Monte-Carlo Methods

  • Generate complete episodes ${(S_t, A_t, R_{t+1}, S_{t+1})}$ from policy $\pi$, and compute returns $G_t = \sum_{k = 0}^{T-t-1} \gamma^k R_{t+k+1}$.

  • Policy evaluation:

\[Q_{\pi} (s,a) = \frac{\sum_{t = 1}^T \mathbf{1}_{\{S_t = s, A_t = a\}} G_t}{\sum_{t = 1}^T \mathbf{1}_{\{S_t = s, A_t = a\}}}.\]
  • Policy improvement:
\[\pi (s) = \arg \max_{a \in \mathcal{A}} Q_{\pi} (s, a).\]

Temporal-Difference (TD) Learning


  • Generate episode $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ from policy $\pi$.

  • Policy evaluation:

\[Q_{\pi} (s_t, a_t) \leftarrow Q_{\pi} (s_t, a_t) + \alpha (r_{t+1} + \gamma Q_{\pi} (s_{t+1} , a_{t+1}) - Q_{\pi} (s_t, a_t)).\]
  • Policy improvement after $T$ steps.


  • Generate episode $(s_t, a_t, r_{t+1}, s_{t+1})$ from policy $\pi$.

  • Policy evaluation:

\[Q_{\pi} (s_t, a_t) \leftarrow Q_{\pi} (s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q_{\pi} (s_{t+1} , a) - Q_{\pi} (s_t, a_t)).\]
  • Policy improvement after $T$ steps.

Deep Q-Network

  • Generate replay buffer ${(s_t, a_t, r_{t+1}, s_{t+1})}$ from policy $\pi$.

  • Freeze target network $Q_{\pi} (s, a; \bar{\theta})$.

  • Update Q-network by minimizing loss:

\[L (\theta) = \frac{1}{T} \sum_{t = 1}^T l(r_{t+1} + \gamma \max_{a'} Q_{\pi} (s_{t+1}, a'; \bar{\theta}), Q_{\pi} (s_t, a_t; \theta)).\]
  • Policy improvement after $T$ steps.

Policy Gradient

We parameterize the policy by $\pi_{\theta}$ and define the reward function as:

\[J (\theta) = \mathbb{E}_{S \sim d_{\pi}} [V_{\pi_{\theta}} (S)] = \mathbb{E}_{S \sim d_{\pi}} \big[\mathbb{E}_{A \sim \pi_{\theta} (S)} [Q_{\pi_{\theta}} (S, A)]\big],\]

where $d_{\pi}$ is the stationary distribution of Markov chain for $\pi_{\theta}$, defined as:

\[d_{\pi} (s) = \lim_{t \rightarrow \infty} \mathbb{P} (S_t = s \vert S_0 = s_0).\]

The policy gradient theorem states that:

\[\nabla_{\theta} J (\theta) \propto \mathbb{E}_{S \sim d_{\pi}} \big[\mathbb{E}_{A \sim \pi_{\theta} (S)} [Q_{\pi_{\theta}} (S, A) \nabla_{\theta} \ln \pi_{\theta} (A \vert S)]\big].\]