In general, RL algorithms iterate over two steps:
-
Policy evaluation: For a given policy π, the value of all states Vπ(s) or all state-action pairs Qπ(s,a) is calculated or estimated.
-
Policy improvement: From the current estimated values Vπ(s) or Qπ(s,a), a new better policy π is derived.
After enough iterations, the policy converges to the optimal policy (if the states are Markov).
This alternation between policy evaluation and policy improvement is called generalized policy iteration (GPI).
One particular form of GPI is dynamic programming, where the Bellman equations are used to evaluate a policy.
Policy Evaluation
When the environment model is known, the Bellman equations for the state-value function vπ(s) become a system of ∣S∣ linear equations in the variables vπ(s). We can use iterative methods to solve these equations. This algorithm is known as iterative policy evaluation.
Loop:Δ←0Loop for each s∈S:v←V(s)V(s)←a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γV(s′)]Δ←max(Δ,∣v−V(s)∣)until Δ<θ
In practice, we consider the process converged when the maximum change Δ is less than a small positive threshold θ.
Policy Improvement
Given an existing policy π, how can we find a better policy π′? Consider a deterministic policy. Suppose for a given state s, we change the action to π′(s)=a=π(s), while keeping the policy unchanged for all other states s′=s. If qπ(s,a)>vπ(s), then we consider the new policy π′ to be better than π. This is actually a special case of the policy improvement theorem stated below. If for all s∈S, the following condition holds:
qπ(s,π′(s))≥vπ(s)
Then the policy π′ must be as good as, or better than, π. That is, for all s∈S:
vπ′(s)≥vπ(s)
The policy improvement theorem can be derived as follows:
vπ(s)≤qπ(s,π′(s))=E[Rt+1+γvπ(St+1)∣St=s,At=π′(s)]=Eπ′[Rt+1+γvπ(St+1)∣St=s]≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))∣St=s]=Eπ′[Rt+1+γEπ′[Rt+2+γvπ(St+2)∣St+1,At+1=π′(St+1)]∣St=s]=Eπ′[Rt+1+γRt+2+γ2vπ(St+2)∣St=s]≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3vπ(St+3)∣St=s]⋮≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+⋯∣St=s]=vπ′(s)
With the policy improvement theorem, we can use a greedy approach to obtain a new, better policy. The theorem guarantees that choosing actions greedily with respect to the value function of the current policy leads to an improved (or equal) policy.
π′(s)≐aargmaxqπ(s,a)=aargmaxE[Rt+1+γvπ(St+1)∣St=s,At=a]=aargmaxs′,r∑p(s′,r∣s,a)[r+γvπ(s′)]
When the greedy policy π′ is no longer strictly better than the previous policy π (i.e., vπ′=vπ), then vπ must satisfy the Bellman optimality equation:
vπ′(s)=maxa∑s′,rp(s′,r∣s,a)[r+γvπ′(s′)]
This means vπ′=v∗, and thus π′ is an optimal policy π∗.
Note that the specific calculation shown above is for deterministic policies, but the derivation of the policy improvement theorem holds for stochastic policies as well.
Policy Iteration
Combining policy evaluation and policy improvement gives us the policy iteration algorithm. Since the policy improvement step guarantees improvement until the optimal policy is reached (unless already optimal), and a finite Markov Decision Process (MDP) has a finite number of policies, policy iteration is guaranteed to converge to an optimal policy in a finite number of iterations.
Note that during the policy evaluation phase, initializing V(s) with the value function from the previous policy can speed up convergence.
1. InitializationV(s)∈R and π(s)∈A(s) arbitrarily for all s∈S2. Policy EvaluationLoop:Δ←0Loop for each s∈S:v←V(s)V(s)←s′,r∑p(s′,r∣s,π(s))[r+γV(s′)]// Using current policy πΔ←max(Δ,∣v−V(s)∣)until Δ<θ// V≈vπ3. Policy Improvementpolicy-stable←trueFor each s∈S:old-action←π(s)π(s)←aargmaxs′,r∑p(s′,r∣s,a)[r+γV(s′)]If old-action=π(s), then policy-stable←falseIf policy-stable, then stop and return V≈v∗ and π≈π∗; else go to 2
Value Iteration
A potential drawback of policy iteration is that each iteration involves a full policy evaluation phase, which can be computationally expensive as it requires multiple sweeps through the state space until convergence. In fact, we can truncate the policy evaluation step after just one sweep (one update for each state). This leads to the value iteration algorithm:
vk+1(s)≐maxa∑s′,rp(s′,r∣s,a)[r+γvk(s′)]
This update rule can also be seen as applying the Bellman optimality equation directly as an update. It can be proven that this iterative method converges to the optimal value function v∗. To show this, we define the Bellman optimality operator T:
Tvk(s)≐maxa∑s′,rp(s′,r∣s,a)[r+γvk(s′)]
So the update rule is vk+1=Tvk.
We introduce the concept of a contraction mapping. An operator O mapping a Banach space (like the space of value functions) to itself is a γ-contraction if there exists a γ∈[0,1) such that for any two elements V,V′ in the space, ∣∣OV−OV′∣∣≤γ∣∣V−V′∣∣, where ∣∣⋅∣∣ is a norm. We will use the max norm (or infinity norm), ∣∣x∣∣∞=maxs∣x(s)∣.
We prove that the Bellman optimality operator T is a γ-contraction in the max norm:
∣∣Tv−Tv′∣∣∞=s∈Smax∣amaxs′,r∑p(s′,r∣s,a)[r+γv(s′)]−a′maxs′,r∑p(s′,r∣s,a′)[r+γv′(s′)]∣≤s∈Smaxamax∣s′,r∑p(s′,r∣s,a)[r+γv(s′)]−s′,r∑p(s′,r∣s,a)[r+γv′(s′)]∣=s∈Smaxamax∣s′,r∑p(s′,r∣s,a)[γv(s′)−γv′(s′)]∣=γs∈Smaxamax∣s′,r∑p(s′,r∣s,a)(v(s′)−v′(s′))∣≤γs∈Smaxamaxs′,r∑p(s′,r∣s,a)∣v(s′)−v′(s′)∣≤γs∈Smaxamaxs′,r∑p(s′,r∣s,a)s′′max∣v(s′′)−v′(s′′)∣=γ∣∣v−v′∣∣∞s∈Smaxamaxs′,r∑p(s′,r∣s,a)=γ∣∣v−v′∣∣∞(since s′,r∑p(s′,r∣s,a)=1)
The key step uses the inequality maxxf(x)−maxyg(y)≤maxx(f(x)−g(x)).
Since T is a γ-contraction and the optimal value function v∗ is its unique fixed point (Tv∗=v∗), the Banach fixed-point theorem guarantees that the sequence vk defined by vk+1=Tvk converges to v∗ for any starting v0. Specifically, when v′=v∗:
∣∣vk+1−v∗∣∣∞=∣∣Tvk−Tv∗∣∣∞≤γ∣∣vk−v∗∣∣∞≤⋯≤γk+1∣∣v0−v∗∣∣∞
Therefore, when γ<1, limk→∞vk=v∗.
The overall algorithm is as follows:
Initialize V(s)∈R arbitrarily for all s∈SLoop:Δ←0Loop for each s∈S:v←V(s)V(s)←amaxs′,r∑p(s′,r∣s,a)[r+γV(s′)]Δ←max(Δ,∣v−V(s)∣)until Δ<θ(a small positive number determining accuracy)Output a deterministic policy, π≈π∗, such thatπ(s)=aargmaxs′,r∑p(s′,r∣s,a)[r+γV(s′)]
Asynchronous Dynamic Programming
A major drawback of the dynamic programming methods discussed (policy iteration and value iteration) is that they involve iterative sweeps through the entire state set S. If the state set is very large, even a single sweep can be prohibitively expensive.
Asynchronous dynamic programming algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps over the state set. They update the values of states in any order, using whatever values of other states are available. This approach offers great flexibility in selecting which states to update. It can significantly improve efficiency, especially if some state values converge faster than others, and allows computation to be focused on states that are most relevant to the agent's current behavior or situation (e.g., updating only states visited in an ongoing episode). For convergence, the condition is typically that all states continue to be updated eventually.
Efficiency of Dynamic Programming
Dynamic programming methods can be computationally efficient for solving MDP planning problems when a perfect model is available. Their worst-case time complexity is polynomial in the number of states ∣S∣ and actions ∣A∣. This compares favorably to direct search in policy space (which is exponential) or linear programming methods, especially for large state spaces (though DP still faces challenges with extremely large state spaces, the "curse of dimensionality"). Furthermore, in practice, DP algorithms like value iteration often converge much faster than their theoretical worst-case time bounds suggest.