The MDPDist Mathematica package

On-the-fly exact computation of bisimilarity distances for Markov Decision Process with rewards

Authors: Giorgio Bacci, Giovanni Bacci, Kim G. Larsen, and Radu Mardare

In this tutorial we present the main features of the MDPDist package. This Mathematica package implements an on-the-fly method for computing the bisimilarity pseudometric of Ferns et al. [1] for Markov Decision Process with rewards. The algorithm implemented in this package is an exension of the on-the-fly method recently proposed in [2] for Markov Chains.

Preliminaries

Markov Decision Processes and Bisimulation

A probability distribution over a nonempty enumerable set S is a function μ : S → [0,1] such that MDP Tutorial_1.gif. We denote by Δ(S) the set of probability distributions over S.

Definition (Markov Decision Process). A Markov Decision Process with rewads (MDP) is a tuple M =(S,A,τ,ρ) consisting of a finite nonempty set S of states, a finite nonempty set A of actions, a transition function τ : S × S → Δ(S), and a reward function τ ∶ S × A → R.

The operational behavior of an MDP M =(S,A,τ,ρ) is as follows: the process in the state MDP Tutorial_2.gif chooses nondeterministically an action a ∈ A and it changes the state to MDP Tutorial_3.gif, with the probabilty τ(MDP Tutorial_4.gif,a)(MDP Tutorial_5.gif). The choice of the action a made in MDP Tutorial_6.gifis rewarded by ρ(MDP Tutorial_7.gif,a). The exectutions are transition sequences w = (MDP Tutorial_8.gif,MDP Tutorial_9.gif),(MDP Tutorial_10.gif,MDP Tutorial_11.gif),....; the challenge is to find strategies for choosing the actions in order to maximize the accumulated reward MDP Tutorial_12.gif, where λ ∈ (0,1) is the discount factor. A strategy is given by a function π : S → Δ(A), called policy, where π(MDP Tutorial_13.gif)(a) is the probabilty distribution over executions defined, for arbitrary w = (MDP Tutorial_14.gif,MDP Tutorial_15.gif),(MDP Tutorial_16.gif,MDP Tutorial_17.gif)...., by MDP Tutorial_18.gif. The value of s according to π is characterized by MDP Tutorial_19.gif, where Ω(w) is the set of executions from s.

Definition (Stocastic Bisimulation). Let M =(S,A,τ,ρ) be an MDP. An equivalence relation R ⊆ S × S is a stocatic bisimulation if whenever (s,t) ∈ R then, for all a ∈ A, ρ(s,a) = ρ(t,a) and, for all R-equivalence classes C, MDP Tutorial_20.gif. Two states s,t ∈ S are stocatic bisimilar, written s ∼ t, if they are related by some stochastic bisimulation on M.

Bisimilarity Pseudometric

To cope with the problem of measuring how similar two MDPs are, Ferns et al. [1] defined a bisimilarity pseudometric that measures the behavioral similarity of two non-bisimilar MDPs. This is defined as the least fixed point of a transformation operator on functions in S × S → [0,∞).
Consider an arbitrary MDP M =(S,A,τ,ρ) and a discount factor λ ∈ (0,1). The set S × S → [0,∞) of [0,∞)-valued maps on S × S equipped with the point-wise partial order defined by d ⊆ d’ ⇔ d(s,t) ≤ d’(s,t), for all s,t ∈ S.
We define a fixed point operator MDP Tutorial_21.gif, for d : S × S → [0,∞) and s,t ∈ S as

MDP Tutorial_22.gif

Where MDP Tutorial_23.gif denotes the optimal value of a Transportation Problem with marginals τ(s,a) and τ(t,a) and costs given by d

MDP Tutorial_24.gif

MDP Tutorial_25.gif is monotonic, thus, by Tarski’s fixed point theorem, its admits a least fixed point. This fixed point is the bisimilarity pseudometric.

Definition (Bisimilarity psudometric). Let M =(S,A,τ,ρ) be an MDP and λ ∈ (0,1) be a discount factor, then the λ-discounted bisimilarity pseudometric for M, written MDP Tutorial_26.gif, is the lest fixed point of MDP Tutorial_27.gif.

The pseudometric MDP Tutorial_28.gif enjoys the property that two states are at distance zero if and only if they are bisimilar.

Getting started

The package can be loaded by setting as current directory the directory of the MDPDist package, then using the commad << (Get) as follows

MDP Tutorial_29.gif

Encoding and manipulating Markov Decision Processes

Encoding an MDP

An MPD M =(S,A,τ,ρ) with MDP Tutorial_30.gif and MDP Tutorial_31.gif is represented as a term of the form MPD[tm,rw,act] where
- tm is an n × m × n probability matrix such that tm[[i,k,j]] = MDP Tutorial_32.gif for all 1 ≤ i,j ≤ n and all 1 ≤ k ≤ m,
- rw is an n × m real-valued matrix such that rw[[i,k]]= MDP Tutorial_33.gif for all 1 ≤ i ≤ n and all 1 ≤ k ≤ m, and
- act is a list of pairwise distinct strings such that act[[k]] represents the action MDP Tutorial_34.giffor all 1 ≤ k ≤ m.
A state MDP Tutorial_35.gif is implicitly represented by its index 1 ≤ i ≤ n.

Assume we want to encode the following MDP M =(S,A,τ,ρ) with S = {1,2,3,4,5,6} and A = {inc, dec}, with the following intuitive operational behavior.  Given M in state i, the choice of the action inc causes the MDP to move in state i+1 with probability 0.8, and to stay in state i or to move in state i-1 both with probability 0.1. The choice of the action dec causes the MDP to move in state i-1 with probability 0.7, to stay in state i with probability 0.1, and to move in state i+1 with probability 0.2. The reward of choising an inc action is always -1, whereas the reward for a dec action is 5 if it is made from state 5 or 6, 0 otherwise.

Mathematica allows to easily define functions and lists, so that we can exploit these features in order to define the data structures that are needed to encode M

MDP Tutorial_36.gif

A represents the set of actions, τ is the transition function and ρ is the reward function.
One can obtain the respective transition and reward matrix using the built-in function Table as follows:

MDP Tutorial_37.gif

MDP Tutorial_38.gif

MDP Tutorial_39.gif

Then M is encoded by

MDP Tutorial_40.gif

Alternatively, one may use the functions MDPtm and MDPrw to construct the transition and the reward matrices listing only the values different from zero. Given the list of rules tmRules = { MDP Tutorial_41.gif, the probabilty transition matrix induced by the probability transition function τ may be constructed by calling MDPtm[tmRules, n, m].
Analogously, for rwRules = { MDP Tutorial_42.gif, one can construct the reward matrix induced by the reward function ρ by calling MDPrw[rwRules, n, m].

The list of rules may be given explicitly or using patterns. For example, the matrices tm and rw defined above can be constructed as follows

MDP Tutorial_43.gif

Manipulating MDPs

An MDP is dispayed by the function PlotMDP. States are represented as white nodes labeled with their respective index. Red edges exiting from each node represent a nondeterministic choice of an action. The chosen action is displayed as a colored node labeled with its associated reward, called choice node. Edges exiting form a choice node represent the probability distribution obtained after a nondeterministic choice. For example, the MDP mdp defined above is dispayed as follows

MDP Tutorial_44.gif

MDP Tutorial_45.gif

Given a sequence MDP Tutorial_46.gif of MDPs, JoinMDP[MDP Tutorial_47.gif] yields an MDP representing their disjoint union, where the indices representing the set of states are obtained shifting the indices of the states of its arguments according to their relative order in the sequence (e.g. if MDP Tutorial_48.gif has n states, the index corresponding to the i-th state of MDP Tutorial_49.gif in JoinMDP[MDP Tutorial_50.gif,MDP Tutorial_51.gif] is n + i).

MDP Tutorial_52.gif

MDP Tutorial_53.gif

Computing Bisimilarity Distances

Given an MDP M, a discount factor λ, and a list of state pairs Q, BDistMDP[M,λ,Q] computes the distance MDP Tutorial_54.gif associated with each pair of states (s,t) in Q. For example, taking MDP mdp considered so far, a discount factor 0.5 and a single-pair query Q = {(1,2)} we have

MDP Tutorial_55.gif

MDP Tutorial_56.gif

while for the query Q = {(1,2),(3,4),(6,5)} we have

MDP Tutorial_57.gif

MDP Tutorial_58.gif

If one is interested in computing the distance between all state pairs, it can use the alias All as follows

MDP Tutorial_59.gif

MDP Tutorial_60.gif

The computation can be traced setting the option Verbose to True. It will display the couplings encoutered during each iteration of the on-the-fly algorithm. This option can be also used to analyze how the algorithm explores the state space of the MDP.

MDP Tutorial_61.gif

MDP Tutorial_62.gif

MDP Tutorial_63.gif

MDP Tutorial_64.gif

MDP Tutorial_65.gif

MDP Tutorial_66.gif

MDP Tutorial_67.gif

MDP Tutorial_68.gif

MDP Tutorial_69.gif

MDP Tutorial_70.gif

MDP Tutorial_71.gif

MDP Tutorial_72.gif

MDP Tutorial_73.gif

MDP Tutorial_74.gif

MDP Tutorial_75.gif

MDP Tutorial_76.gif

MDP Tutorial_77.gif

MDP Tutorial_78.gif

MDP Tutorial_79.gif

MDP Tutorial_80.gif

MDP Tutorial_81.gif

MDP Tutorial_82.gif

MDP Tutorial_83.gif

MDP Tutorial_84.gif

MDP Tutorial_85.gif

MDP Tutorial_86.gif

MDP Tutorial_87.gif

MDP Tutorial_88.gif

MDP Tutorial_89.gif

MDP Tutorial_90.gif

MDP Tutorial_91.gif

MDP Tutorial_92.gif

MDP Tutorial_93.gif

MDP Tutorial_94.gif

MDP Tutorial_95.gif

MDP Tutorial_96.gif

MDP Tutorial_97.gif

MDP Tutorial_98.gif

MDP Tutorial_99.gif

MDP Tutorial_100.gif

MDP Tutorial_101.gif

MDP Tutorial_102.gif

MDP Tutorial_103.gif

MDP Tutorial_104.gif

MDP Tutorial_105.gif

MDP Tutorial_106.gif

MDP Tutorial_107.gif

MDP Tutorial_108.gif

MDP Tutorial_109.gif

MDP Tutorial_110.gif

MDP Tutorial_111.gif

MDP Tutorial_112.gif

MDP Tutorial_113.gif

MDP Tutorial_114.gif

MDP Tutorial_115.gif

MDP Tutorial_116.gif

MDP Tutorial_117.gif

MDP Tutorial_118.gif

MDP Tutorial_119.gif

MDP Tutorial_120.gif

MDP Tutorial_121.gif

MDP Tutorial_122.gif

MDP Tutorial_123.gif

MDP Tutorial_124.gif

MDP Tutorial_125.gif

MDP Tutorial_126.gif

MDP Tutorial_127.gif

MDP Tutorial_128.gif

MDP Tutorial_129.gif

MDP Tutorial_130.gif

MDP Tutorial_131.gif

MDP Tutorial_132.gif

MDP Tutorial_133.gif

MDP Tutorial_134.gif

MDP Tutorial_135.gif

MDP Tutorial_136.gif

MDP Tutorial_137.gif

MDP Tutorial_138.gif

MDP Tutorial_139.gif

MDP Tutorial_140.gif

MDP Tutorial_141.gif

MDP Tutorial_142.gif

MDP Tutorial_143.gif

MDP Tutorial_144.gif

MDP Tutorial_145.gif

MDP Tutorial_146.gif

MDP Tutorial_147.gif

MDP Tutorial_148.gif

MDP Tutorial_149.gif

MDP Tutorial_150.gif

MDP Tutorial_151.gif

MDP Tutorial_152.gif

MDP Tutorial_153.gif

MDP Tutorial_154.gif

MDP Tutorial_155.gif

MDP Tutorial_156.gif

MDP Tutorial_157.gif

MDP Tutorial_158.gif

MDP Tutorial_159.gif

MDP Tutorial_160.gif

MDP Tutorial_161.gif

MDP Tutorial_162.gif

MDP Tutorial_163.gif

MDP Tutorial_164.gif

MDP Tutorial_165.gif

MDP Tutorial_166.gif

MDP Tutorial_167.gif

MDP Tutorial_168.gif

MDP Tutorial_169.gif

MDP Tutorial_170.gif

MDP Tutorial_171.gif

MDP Tutorial_172.gif

MDP Tutorial_173.gif

MDP Tutorial_174.gif

MDP Tutorial_175.gif

MDP Tutorial_176.gif

MDP Tutorial_177.gif

MDP Tutorial_178.gif

MDP Tutorial_179.gif

MDP Tutorial_180.gif

MDP Tutorial_181.gif

MDP Tutorial_182.gif

MDP Tutorial_183.gif

MDP Tutorial_184.gif

MDP Tutorial_185.gif

MDP Tutorial_186.gif

MDP Tutorial_187.gif

MDP Tutorial_188.gif

MDP Tutorial_189.gif

MDP Tutorial_190.gif

MDP Tutorial_191.gif

MDP Tutorial_192.gif

MDP Tutorial_193.gif

MDP Tutorial_194.gif

MDP Tutorial_195.gif

MDP Tutorial_196.gif

MDP Tutorial_197.gif

MDP Tutorial_198.gif

MDP Tutorial_199.gif

MDP Tutorial_200.gif

MDP Tutorial_201.gif

MDP Tutorial_202.gif

MDP Tutorial_203.gif

MDP Tutorial_204.gif

MDP Tutorial_205.gif

MDP Tutorial_206.gif

MDP Tutorial_207.gif

MDP Tutorial_208.gif

MDP Tutorial_209.gif

MDP Tutorial_210.gif

MDP Tutorial_211.gif

MDP Tutorial_212.gif

MDP Tutorial_213.gif

MDP Tutorial_214.gif

MDP Tutorial_215.gif

MDP Tutorial_216.gif

MDP Tutorial_217.gif

MDP Tutorial_218.gif

MDP Tutorial_219.gif

MDP Tutorial_220.gif

MDP Tutorial_221.gif

MDP Tutorial_222.gif

MDP Tutorial_223.gif

MDP Tutorial_224.gif

MDP Tutorial_225.gif

MDP Tutorial_226.gif

MDP Tutorial_227.gif

MDP Tutorial_228.gif

MDP Tutorial_229.gif

MDP Tutorial_230.gif

MDP Tutorial_231.gif

MDP Tutorial_232.gif

MDP Tutorial_233.gif

MDP Tutorial_234.gif

MDP Tutorial_235.gif

MDP Tutorial_236.gif

MDP Tutorial_237.gif

MDP Tutorial_238.gif

MDP Tutorial_239.gif

MDP Tutorial_240.gif

MDP Tutorial_241.gif

MDP Tutorial_242.gif

MDP Tutorial_243.gif

MDP Tutorial_244.gif

MDP Tutorial_245.gif

MDP Tutorial_246.gif

MDP Tutorial_247.gif

MDP Tutorial_248.gif

MDP Tutorial_249.gif

MDP Tutorial_250.gif

MDP Tutorial_251.gif

MDP Tutorial_252.gif

MDP Tutorial_253.gif

MDP Tutorial_254.gif

MDP Tutorial_255.gif

MDP Tutorial_256.gif

MDP Tutorial_257.gif

MDP Tutorial_258.gif

MDP Tutorial_259.gif

MDP Tutorial_260.gif

MDP Tutorial_261.gif

MDP Tutorial_262.gif

MDP Tutorial_263.gif

One can provide some estimates in order to further reduce the exploration of the state space. This can be done using the option Estimates

MDP Tutorial_264.gif

MDP Tutorial_265.gif

Exact Computations

Mathematica allows one to perform exact computations over Rational numbers. Usually the Bellman-optimality equation system is computed approximately since it is in general a non-linear equation system. Despite this fact, our method is able to compute the exact result when the input MDP is rational-valued (i.e. both the transition fuction and the reward function are rational-valued)

For example, the MDP we considered so far can be rationalized as follows

MDP Tutorial_266.gif

now, the exact value of the distance can be computed setting the option Exact to True

MDP Tutorial_267.gif

MDP Tutorial_268.gif

Bisimilarity Classes & Quotient

Due to the fact that MDP Tutorial_269.gif(s,t) equals zero if and only if the states s and t are bisimilar, we can use the algorithm for computing MDP Tutorial_270.gif in order to compute the bisimilarity classes and the bisimilarity quotient for an MDP.

The function BisimClassesMDP[M] returns the set of bisimilarity classes of M, that is S/∼.

MDP Tutorial_271.gif

MDP Tutorial_272.gif

MDP Tutorial_273.gif

MDP Tutorial_274.gif

The function BisimQuotientMDP[M] yields an MDP representing the bisimilarity quotient of M.

MDP Tutorial_275.gif

MDP Tutorial_276.gif

Spikey Created with Wolfram Mathematica 9.0