Untitled

The MDPDist Mathematica package

On-the-fly exact computation of bisimilarity distances for Markov Decision Process with rewards

Authors: Giorgio Bacci, Giovanni Bacci, Kim G. Larsen, and Radu Mardare

In this tutorial we present the main features of the MDPDist package. This Mathematica package implements an on-the-fly method for computing the bisimilarity pseudometric of Ferns et al. [1] for Markov Decision Process with rewards. The algorithm implemented in this package is an exension of the on-the-fly method recently proposed in [2] for Markov Chains.

Preliminaries

Markov Decision Processes and Bisimulation

A probability distribution over a nonempty enumerable set S is a function μ : S → [0,1] such that . We denote by Δ(S) the set of probability distributions over S.

Definition (Markov Decision Process). A Markov Decision Process with rewads (MDP) is a tuple M =(S,A,τ,ρ) consisting of a finite nonempty set S of states, a finite nonempty set A of actions, a transition function τ : S × S → Δ(S), and a reward function τ ∶ S × A → R.

The operational behavior of an MDP M =(S,A,τ,ρ) is as follows: the process in the state chooses nondeterministically an action a ∈ A and it changes the state to , with the probabilty τ(,a)(). The choice of the action a made in is rewarded by ρ(,a). The exectutions are transition sequences w = (,),(,),....; the challenge is to find strategies for choosing the actions in order to maximize the accumulated reward , where λ ∈ (0,1) is the discount factor. A strategy is given by a function π : S → Δ(A), called policy, where π()(a) is the probabilty distribution over executions defined, for arbitrary w = (,),(,)...., by . The value of s according to π is characterized by , where Ω(w) is the set of executions from s.

Definition (Stocastic Bisimulation). Let M =(S,A,τ,ρ) be an MDP. An equivalence relation R ⊆ S × S is a stocatic bisimulation if whenever (s,t) ∈ R then, for all a ∈ A, ρ(s,a) = ρ(t,a) and, for all R-equivalence classes C, . Two states s,t ∈ S are stocatic bisimilar, written s ∼ t, if they are related by some stochastic bisimulation on M.

Bisimilarity Pseudometric

To cope with the problem of measuring how similar two MDPs are, Ferns et al. [1] defined a bisimilarity pseudometric that measures the behavioral similarity of two non-bisimilar MDPs. This is defined as the least fixed point of a transformation operator on functions in S × S → [0,∞).
Consider an arbitrary MDP M =(S,A,τ,ρ) and a discount factor λ ∈ (0,1). The set S × S → [0,∞) of [0,∞)-valued maps on S × S equipped with the point-wise partial order defined by d ⊆ d’ ⇔ d(s,t) ≤ d’(s,t), for all s,t ∈ S.
We define a fixed point operator , for d : S × S → [0,∞) and s,t ∈ S as

Where denotes the optimal value of a Transportation Problem with marginals τ(s,a) and τ(t,a) and costs given by d

is monotonic, thus, by Tarski’s fixed point theorem, its admits a least fixed point. This fixed point is the bisimilarity pseudometric.

Definition (Bisimilarity psudometric). Let M =(S,A,τ,ρ) be an MDP and λ ∈ (0,1) be a discount factor, then the λ-discounted bisimilarity pseudometric for M, written , is the lest fixed point of .

The pseudometric enjoys the property that two states are at distance zero if and only if they are bisimilar.

Getting started

The package can be loaded by setting as current directory the directory of the MDPDist package, then using the commad << (Get) as follows

Encoding and manipulating Markov Decision Processes

Encoding an MDP

An MPD M =(S,A,τ,ρ) with and is represented as a term of the form MPD[tm,rw,act] where
- tm is an n × m × n probability matrix such that tm[[i,k,j]] = for all 1 ≤ i,j ≤ n and all 1 ≤ k ≤ m,
- rw is an n × m real-valued matrix such that rw[[i,k]]= for all 1 ≤ i ≤ n and all 1 ≤ k ≤ m, and
- act is a list of pairwise distinct strings such that act[[k]] represents the action for all 1 ≤ k ≤ m.
A state is implicitly represented by its index 1 ≤ i ≤ n.

Assume we want to encode the following MDP M =(S,A,τ,ρ) with S = {1,2,3,4,5,6} and A = {inc, dec}, with the following intuitive operational behavior. Given M in state i, the choice of the action inc causes the MDP to move in state i+1 with probability 0.8, and to stay in state i or to move in state i-1 both with probability 0.1. The choice of the action dec causes the MDP to move in state i-1 with probability 0.7, to stay in state i with probability 0.1, and to move in state i+1 with probability 0.2. The reward of choising an inc action is always -1, whereas the reward for a dec action is 5 if it is made from state 5 or 6, 0 otherwise.

Mathematica allows to easily define functions and lists, so that we can exploit these features in order to define the data structures that are needed to encode M

MDP Tutorial_36.gif

A represents the set of actions, τ is the transition function and ρ is the reward function.
One can obtain the respective transition and reward matrix using the built-in function Table as follows:

Then M is encoded by

Alternatively, one may use the functions MDPtm and MDPrw to construct the transition and the reward matrices listing only the values different from zero. Given the list of rules tmRules = { , the probabilty transition matrix induced by the probability transition function τ may be constructed by calling MDPtm[tmRules, n, m].
Analogously, for rwRules = { , one can construct the reward matrix induced by the reward function ρ by calling MDPrw[rwRules, n, m].

The list of rules may be given explicitly or using patterns. For example, the matrices tm and rw defined above can be constructed as follows

MDP Tutorial_43.gif

Manipulating MDPs

An MDP is dispayed by the function PlotMDP. States are represented as white nodes labeled with their respective index. Red edges exiting from each node represent a nondeterministic choice of an action. The chosen action is displayed as a colored node labeled with its associated reward, called choice node. Edges exiting form a choice node represent the probability distribution obtained after a nondeterministic choice. For example, the MDP mdp defined above is dispayed as follows

MDP Tutorial_45.gif

Given a sequence of MDPs, JoinMDP[] yields an MDP representing their disjoint union, where the indices representing the set of states are obtained shifting the indices of the states of its arguments according to their relative order in the sequence (e.g. if has n states, the index corresponding to the i-th state of in JoinMDP[,] is n + i).

MDP Tutorial_53.gif

Computing Bisimilarity Distances

Given an MDP M, a discount factor λ, and a list of state pairs Q, BDistMDP[M,λ,Q] computes the distance associated with each pair of states (s,t) in Q. For example, taking MDP mdp considered so far, a discount factor 0.5 and a single-pair query Q = {(1,2)} we have

while for the query Q = {(1,2),(3,4),(6,5)} we have

If one is interested in computing the distance between all state pairs, it can use the alias All as follows

MDP Tutorial_60.gif

The computation can be traced setting the option Verbose to True. It will display the couplings encoutered during each iteration of the on-the-fly algorithm. This option can be also used to analyze how the algorithm explores the state space of the MDP.

MDP Tutorial_65.gif

MDP Tutorial_66.gif

MDP Tutorial_67.gif

MDP Tutorial_68.gif

MDP Tutorial_69.gif

MDP Tutorial_70.gif

MDP Tutorial_71.gif

MDP Tutorial_72.gif

MDP Tutorial_73.gif

MDP Tutorial_74.gif

MDP Tutorial_75.gif

MDP Tutorial_76.gif

MDP Tutorial_77.gif

MDP Tutorial_78.gif

MDP Tutorial_79.gif

MDP Tutorial_83.gif

MDP Tutorial_84.gif

MDP Tutorial_85.gif

MDP Tutorial_86.gif

MDP Tutorial_87.gif

MDP Tutorial_88.gif

MDP Tutorial_89.gif

MDP Tutorial_90.gif

MDP Tutorial_91.gif

MDP Tutorial_92.gif

MDP Tutorial_93.gif

MDP Tutorial_94.gif

MDP Tutorial_95.gif

MDP Tutorial_96.gif

MDP Tutorial_97.gif

MDP Tutorial_101.gif

MDP Tutorial_102.gif

MDP Tutorial_103.gif

MDP Tutorial_104.gif

MDP Tutorial_105.gif

MDP Tutorial_106.gif

MDP Tutorial_107.gif

MDP Tutorial_108.gif

MDP Tutorial_109.gif

MDP Tutorial_110.gif

MDP Tutorial_111.gif

MDP Tutorial_112.gif

MDP Tutorial_113.gif

MDP Tutorial_114.gif

MDP Tutorial_115.gif

MDP Tutorial_119.gif

MDP Tutorial_120.gif

MDP Tutorial_121.gif

MDP Tutorial_122.gif

MDP Tutorial_123.gif

MDP Tutorial_124.gif

MDP Tutorial_125.gif

MDP Tutorial_126.gif

MDP Tutorial_127.gif

MDP Tutorial_128.gif

MDP Tutorial_129.gif

MDP Tutorial_130.gif

MDP Tutorial_131.gif

MDP Tutorial_132.gif

MDP Tutorial_133.gif

MDP Tutorial_137.gif

MDP Tutorial_138.gif

MDP Tutorial_139.gif

MDP Tutorial_140.gif

MDP Tutorial_141.gif

MDP Tutorial_142.gif

MDP Tutorial_143.gif

MDP Tutorial_144.gif

MDP Tutorial_145.gif

MDP Tutorial_146.gif

MDP Tutorial_147.gif

MDP Tutorial_148.gif

MDP Tutorial_149.gif

MDP Tutorial_150.gif

MDP Tutorial_151.gif

MDP Tutorial_155.gif

MDP Tutorial_156.gif

MDP Tutorial_157.gif

MDP Tutorial_158.gif

MDP Tutorial_159.gif

MDP Tutorial_160.gif

MDP Tutorial_161.gif

MDP Tutorial_162.gif

MDP Tutorial_163.gif

MDP Tutorial_164.gif

MDP Tutorial_165.gif

MDP Tutorial_166.gif

MDP Tutorial_167.gif

MDP Tutorial_168.gif

MDP Tutorial_169.gif

MDP Tutorial_173.gif

MDP Tutorial_174.gif

MDP Tutorial_175.gif

MDP Tutorial_176.gif

MDP Tutorial_177.gif

MDP Tutorial_178.gif

MDP Tutorial_179.gif

MDP Tutorial_180.gif

MDP Tutorial_181.gif

MDP Tutorial_182.gif

MDP Tutorial_183.gif

MDP Tutorial_184.gif

MDP Tutorial_185.gif

MDP Tutorial_186.gif

MDP Tutorial_187.gif

MDP Tutorial_191.gif

MDP Tutorial_192.gif

MDP Tutorial_193.gif

MDP Tutorial_194.gif

MDP Tutorial_195.gif

MDP Tutorial_196.gif

MDP Tutorial_197.gif

MDP Tutorial_198.gif

MDP Tutorial_199.gif

MDP Tutorial_200.gif

MDP Tutorial_201.gif

MDP Tutorial_202.gif

MDP Tutorial_203.gif

MDP Tutorial_204.gif

MDP Tutorial_205.gif

MDP Tutorial_209.gif

MDP Tutorial_210.gif

MDP Tutorial_211.gif

MDP Tutorial_212.gif

MDP Tutorial_213.gif

MDP Tutorial_214.gif

MDP Tutorial_215.gif

MDP Tutorial_216.gif

MDP Tutorial_217.gif

MDP Tutorial_218.gif

MDP Tutorial_219.gif

MDP Tutorial_220.gif

MDP Tutorial_221.gif

MDP Tutorial_222.gif

MDP Tutorial_223.gif

MDP Tutorial_227.gif

MDP Tutorial_228.gif

MDP Tutorial_229.gif

MDP Tutorial_230.gif

MDP Tutorial_231.gif

MDP Tutorial_232.gif

MDP Tutorial_233.gif

MDP Tutorial_234.gif

MDP Tutorial_235.gif

MDP Tutorial_236.gif

MDP Tutorial_237.gif

MDP Tutorial_238.gif

MDP Tutorial_239.gif

MDP Tutorial_240.gif

MDP Tutorial_241.gif

MDP Tutorial_245.gif

MDP Tutorial_246.gif

MDP Tutorial_247.gif

MDP Tutorial_248.gif

MDP Tutorial_249.gif

MDP Tutorial_250.gif

MDP Tutorial_251.gif

MDP Tutorial_252.gif

MDP Tutorial_253.gif

MDP Tutorial_254.gif

MDP Tutorial_255.gif

MDP Tutorial_256.gif

MDP Tutorial_257.gif

MDP Tutorial_258.gif

MDP Tutorial_259.gif

MDP Tutorial_263.gif

One can provide some estimates in order to further reduce the exploration of the state space. This can be done using the option Estimates

MDP Tutorial_265.gif

Exact Computations

Mathematica allows one to perform exact computations over Rational numbers. Usually the Bellman-optimality equation system is computed approximately since it is in general a non-linear equation system. Despite this fact, our method is able to compute the exact result when the input MDP is rational-valued (i.e. both the transition fuction and the reward function are rational-valued)

For example, the MDP we considered so far can be rationalized as follows

MDP Tutorial_266.gif

now, the exact value of the distance can be computed setting the option Exact to True

MDP Tutorial_268.gif

Bisimilarity Classes & Quotient

Due to the fact that (s,t) equals zero if and only if the states s and t are bisimilar, we can use the algorithm for computing in order to compute the bisimilarity classes and the bisimilarity quotient for an MDP.

The function BisimClassesMDP[M] returns the set of bisimilarity classes of M, that is S/∼.

The function BisimQuotientMDP[M] yields an MDP representing the bisimilarity quotient of M.

MDP Tutorial_276.gif

Created with Wolfram Mathematica 9.0