Bellman Equation and Its Optimality

In MDP problem solving and control theorem, Bellman Equation is an important and crutial tool.

Bellman Operator

For MDP problem, let \(s\) be the system state, \(a\) be an action, \(Pr(s'|s, a)\) be the transition probability of turnning from state \(s\) to \(s'\) by taking action \(a\). Let \(Q(s)\) be the expected long term reward at state \(s\) and \(r(s, a)\) be the instantaneous reward by taking action \(a\) at state \(s\). Bellman operator \(B\) works as

\[B~Q(s) = \max_a r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q(s)\]

Optimality

Pre-requests:

If Bellman operator \(B\) is actually a contraction mapping function, we can use Banach fixed-point theorem to show that it is going to converge. Actually, it is, and the metric is infinite norm and the Lipschitz constant is \(\gamma\).

\[\begin{split}\begin{align} \|B Q1(s) - B Q2(s)\| =& \| [\max_a r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q1(s)] - [\max_a' r(s, a') + \gamma \sum_{s'} Pr(s'|s, a') Q2(s)] \| \\ \leq \| \max_a \{ [r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q1(s)] - [r(s, a) + \gamma \sum_{s'} Pr(s'|s, a) Q2(s)] \} \| \\ = \gamma \| \max_a \sum_{s'} Pr(s'|s, a)(Q1(s) - Q2(s))\| \\ = \gamma \max_a \sum_{s'} Pr(s'|s, a) (Q1(s) - Q2(s)) \\ = \gamma \| Q1(s) - Q2(s) \| \max_a \sum_{s'} Pr(s'|s, a) \\ \leq \gamma \| Q1(s) - Q2(s) \| \end{align}\end{split}\]