Bellman Equation and Its Optimality¶

In MDP problem solving and control theorem, Bellman Equation is an important and crutial tool.

Bellman Operator¶

For MDP problem, let \(s\) be the system state, \(a\) be an action, \(Pr(s'|s, a)\) be the transition probability of turnning from state \(s\) to \(s'\) by taking action \(a\). Let \(Q(s)\) be the expected long term reward at state \(s\) and \(r(s, a)\) be the instantaneous reward by taking action \(a\) at state \(s\). Bellman operator \(B\) works as

\[B~Q(s) = \max_a r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q(s)\]

Optimality¶

Pre-requests:

If Bellman operator \(B\) is actually a contraction mapping function, we can use Banach fixed-point theorem to show that it is going to converge. Actually, it is, and the metric is infinite norm and the Lipschitz constant is \(\gamma\).

\[\begin{split}\begin{align} \|B Q1(s) - B Q2(s)\| =& \| [\max_a r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q1(s)] - [\max_a' r(s, a') + \gamma \sum_{s'} Pr(s'|s, a') Q2(s)] \| \\ \leq \| \max_a \{ [r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q1(s)] - [r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q2(s)] \} \| \\ = \gamma \| \max_a \sum_{s'} Pr(s'|s, a)(Q1(s) - Q2(s))\| \\ = \gamma \max_a \sum_{s'} Pr(s'|s, a) (Q1(s) - Q2(s)) \\ = \gamma \| Q1(s) - Q2(s) \| \max_a \sum_{s'} Pr(s'|s, a) \\ \leq \gamma \| Q1(s) - Q2(s) \| \end{align}\end{split}\]