Part IB — Markov Chains

Based on lectures by G. R. Grimmett

Notes taken by Dexter Chua

Michaelmas 2015

These notes are not endorsed by the lecturers, and I have modified them (often

significantly) after lectures. They are nowhere near accurate representations of what

was actually lectured, and in particular, all errors are almost surely mine.

Discrete-time chains

Definition and basic properties, the transition matrix. Calculation of

n

-step transition

probabilities. Communicating classes, closed classes, absorption, irreducibility. Calcu-

lation of hitting probabilities and mean hitting times; survival probability for birth

and death chains. Stopping times and statement of the strong Markov property. [5]

Recurrence and transience; equivalence of transience and summability of

n

-step transi-

tion probabilities; equivalence of recurrence and certainty of return. Recurrence as a

class property, relation with closed classes. Simple random walks in dimensions one,

two and three. [3]

Invariant distributions, statement of existence and uniqueness up to constant multiples.

Mean return time, positive recurrence; equivalence of positive recurrence and the

existence of an invariant distribution. Convergence to equilibrium for irreducible,

p ositive recurrent, aperiodic chains *and proof by coupling*. Long-run proportion of

time spent in a given state. [3]

Time reversal, detailed balance, reversibility, random walk on a graph. [1]

Contents

0 Introduction

1 Markov chains

1.1 The Markov property

1.2 Transition probability

2 Classification of chains and states

2.1 Communicating classes

2.2 Recurrence or transience

2.3 Hitting probabilities

2.4 The strong Markov property and applications

2.5 Further classification of states

3 Long-run behaviour

3.1 Invariant distributions

3.2 Convergence to equilibrium

4 Time reversal

0 Introduction

So far, in IA Probability, we have always dealt with one random variable, or

numerous independent variables, and we were able to handle them. However, in

real life, things often are dependent, and things become much more difficult.

In general, there are many ways in which variables can be dependent. Their

dependence can be very complicated, or very simple. If we are just told two

variables are dependent, we have no idea what we can do with them.

This is similar to our study of functions. We can develop theories about

continuous functions, increasing functions, or differentiable functions, but if we

are just given a random function without assuming anything about it, there

really isn’t much we can do.

Hence, in this course, we are just going to study a particular kind of dependent

variables, known as Markov chains. In fact, in IA Probability, we have already

encountered some of these. One prominent example is the random walk, in

which the next position depends on the previous position. This gives us some

dependent random variables, but they are dependent in a very simple way.

In reality, a random walk is too simple of a model to describe the world. We

need something more general, and these are Markov chains. These, by definition,

are random distributions that satisfy the Markov assumption. This assumption,

intuitively, says that the future depends only upon the current state, and not

how we got to the current state. It turns out that just given this assumption,

we can prove a lot about these chains.

1 Markov chains

1.1 The Markov property

We start with the definition of a Markov chain.

Definition (Markov chain). Let

X

= (

X

0

, X

1

, ···

) be a sequence of random

variables taking values in some set

S

, the state space. We assume that

S

is

countable (which could be finite).

We say

X

has the Markov property if for all

n ≥

0

, i

0

, ··· , i

n+1

∈ S

, we have

P(X

n+1

= i

n+1

| X

0

= i

0

, ··· , X

n

= i

n

) = P(X

n+1

= i

n+1

| X

n

= i

n

).

If X has the Markov property, we call it a Markov chain.

We say that a Markov chain

X

is homogeneous if the conditional probabilities

P(X

n+1

= j | X

n

= i) do not depend on n.

All our chains

X

will be Markov and homogeneous unless otherwise specified.

Since the state space

S

is countable, we usually label the states by integers

i ∈ N.

Example.

(i) A random walk is a Markov chain.

(ii) The branching process is a Markov chain.

In general, to fully specify a (homogeneous) Markov chain, we will need two

items:

(i)

The initial distribution

λ

i

=

P

(

X

0

=

i

). We can write this as a vector

λ = (λ

i

: i ∈ S).

(ii)

The transition probabilities

p

i,j

=

P

(

X

n+1

=

j | X

n

=

i

). We can write

this as a matrix P = (p

i,j

)

i,j∈S

.

We will start by proving a few properties of

λ

and

P

. These let us know

whether an arbitrary pair of vector and matrix (

λ, P

) actually specifies a Markov

chain.

Proposition.

(i) λ is a distribution, i.e. λ

i

≥ 0,

P

i

λ

i

= 1.

(ii) P is a stochastic matrix, i.e. p

i,j

≥ 0 and

P

j

p

i,j

= 1 for all i.

Proof.

(i) Obvious since λ is a probability distribution.

(ii) p

i,j

≥ 0 since p

ij

is a probability. We also have

X

j

p

i,j

=

X

j

P(X

1

= j | X

0

= i) = 1

since P(X

1

= · | X

0

= i) is a probability distribution function.

Note that we only require the row sum to be 1, and the column sum need

not be.

We will prove another seemingly obvious fact.

Theorem. Let

λ

be a distribution (on

S

) and

P

a stochastic matrix. The

sequence

X

= (

X

0

, X

1

, ···

) is a Markov chain with initial distribution

λ

and

transition matrix P iff

P(X

0

= i

0

, X

1

= i

1

, ··· , X

n

= i

n

) = λ

i

0

p

i

0

,i

1

p

i

1

,i

2

···p

i

n−1

,i

n

(∗)

for all n, i

0

, ··· , i

n

Proof. Let A

k

be the event X

k

= i

k

. Then we can write (∗) as

P(A

0

∩ A

1

∩ ···∩ A

n

) = λ

i

0

p

i

0

,i

1

p

i

1

,i

2

···p

i

n−1

,i

n

. (∗)

We first assume that X is a Markov chain. We prove (∗) by induction on n.

When n = 0, (∗) says P(A

0

) = λ

i

0

. This is true by definition of λ.

Assume that it is true for all n < N . Then

P(A

0

∩ A

1

∩ ··· ∩ A

N

) = P(A

0

, ··· , A

N−1

)P(A

0

, ··· , A

N

| A

0

, ··· , A

N−1

)

= λ

i

0

p

i

0

,i

1

···p

i

N −2

,i

N −1

P(A

N

| A

0

, ··· , A

N−1

)

= λ

i

0

p

i

0

,i

1

···p

i

N −2

,i

N −1

P(A

N

| A

N−1

)

= λ

i

0

p

i

0

,i

1

···p

i

N −2

,i

N −1

p

i

N −1

,i

N

.

So it is true for N as well. Hence we are done by induction.

Conversely, suppose that (

∗

) holds. Then for

n

= 0, we have

P

(

X

0

=

i

0

) =

λ

i

0

.

Otherwise,

P(X

n

= i

n

| X

0

= i

0

, ··· , X

n−1

= i

n−1

) = P(A

n

| A

0

∩ ··· ∩ A

n−1

)

=

P(A

0

∩ ··· ∩ A

n

)

P(A

0

∩ ··· ∩ A

n−1

)

= p

i

n−1

,i

n

,

which is independent of i

0

, ··· , i

n−2

. So this is Markov.

Often, we do not use the Markov property directly. Instead, we use the

following:

Theorem (Extended Markov property). Let

X

be a Markov chain. For

n ≥

0,

any

H

given in terms of the past

{X

i

:

i < n}

, and any

F

given in terms of the

future {X

i

: i > n}, we have

P(F | X

n

= i, H) = P(F | X

n

= i).

To prove this, we need to stitch together many instances of the Markov

property. Actual proof is omitted.

1.2 Transition probability

Recall that we can specify the dynamics of a Markov chain by the one-step

transition probability,

p

i,j

= P(X

n+1

= j | X

n

= i).

However, we don’t always want to take 1 step. We might want to take 2 steps, 3

steps, or, in general, n steps. Hence, we define

Definition (

n

-step transition probability). The

n

-step transition probability

from i to j is

p

i,j

(n) = P(X

n

= j | X

0

= i).

How do we compute these probabilities? The idea is to break this down into

smaller parts. We want to express

p

i,j

(

m

+

n

) in terms of

n

-step and

m

-step

transition probabilities. Then we can iteratively break down an arbitrary

p

i,j

(

n

)

into expressions involving the one-step transition probabilities only.

To compute

p

i,j

(

m

+

n

), we can think of this as a two-step process. We first

go from

i

to some unknown point

k

in

m

steps, and then travel from

k

to

j

in

n

more steps. To find the probability to get from

i

to

j

, we consider all possible

routes from i to j, and sum up all the probability of the paths. We have

p

ij

(m + n) = P(X

m+n

| X

0

= i)

=

X

k

P(X

m+n

= j | X

m

= k, X

0

= i)P(X

m

= k | X

0

= i)

=

X

k

P(X

m+n

= j | X

m

= k)P(X

m

= k | X

0

= i)

=

X

k

p

i,k

(m)p

k,j

(n).

Thus we get

Theorem (Chapman-Kolmogorov equation).

p

i,j

(m + n) =

X

k∈S

p

i,k

(m)p

k,j

(n).

This formula is suspiciously familiar. It is just matrix multiplication!

Notation. Write P (m) = (p

i,j

(m))

i,j∈S

.

Then we have

P (m + n) = P (m)P (n)

In particular, we have

P (n) = P (1)P (n − 1) = ··· = P (1)

n

= P

n

.

This allows us to easily compute the

n

-step transition probability by matrix

multiplication.

Example. Let S = {1, 2}, with

P =

1 − α α

β 1 − β

We assume 0 < α, β < 1. We want to find the n-step transition probability.

We can achieve this via diagonalization. We can write P as

P = U

−1

κ

1

0

0 κ

2

U,

where the κ

i

are eigenvalues of P , and U is composed of the eigenvectors.

To find the eigenvalues, we calculate

det(P − λI) = (1 − α − λ)(1 − β − λ) − αβ = 0.

We solve this to obtain

κ

1

= 1, κ

2

= 1 − α −β.

Usually, the next thing to do would be to find the eigenvectors to obtain

U

.

However, here we can cheat a bit and not do that. Using the diagonalization of

P , we have

P

n

= U

−1

κ

n

1

0

0 κ

n

2

U.

We can now attempt to compute p

1,2

. We know that it must be of the form

p

1,2

= Aκ

n

1

+ Bκ

n

2

= A + B(1 −α − β)

n

where

A

and

B

are constants coming from

U

and

U

−1

. However, we know well

that

p

1,2

(0) = 0, p

12

(1) = α.

So we obtain

A + B = 0

A + B(1 −α − β) = α.

This is something we can solve, and obtain

p

1,2

(n) =

α

α + β

(1 − (1 − α −β)

n

) = 1 − p

1,1

(n).

How about

p

2,1

and

p

2,2

? Well we don’t need additional work. We can obtain

these simply by interchanging α and β. So we obtain

P

n

=

1

α + β

β + α(1 − α − β)

n

α −α(1 −α −β)

n

β − β(1 − β − α)

n

α + β(1 − β − α)

n

What happens as n → ∞? We can take the limit and obtain

P

n

→

1

α + β

β α

β α

We see that the two rows are the same. This means that as time goes on, where

we end up does not depend on where we started. We will later (near the end of

the course) see that this is generally true for most Markov chains.

Alternatively, we can solve this by a difference equation. The recurrence

relation is given by

p

1,1

(n + 1) = p

1,1

(n)(1 − α) + p

1,2

(n)β.

Writing in terms of p

11

only, we have

p

1,1

(n + 1) = p

1,1

(n)(1 − α) + (1 −p

1,1

(n))β.

We can solve this as we have done in IA Differential Equations.

We saw that the Chapman-Kolmogorov equation can be concisely stated as

a rule about matrix multiplication. In general, many statements about Markov

chains can be formulated in the language of linear algebra naturally.

For example, let

X

0

have distribution

λ

. What is the distribution of

X

1

? By

definition, it is

P(X

1

= j) =

X

i

P(X

1

= j | X

0

= i)P(X

0

= i) =

X

i

λ

i

p

i,j

.

Hence this has a distribution

λP

, where

λ

is treated as a row vector. Similarly,

X

n

has the distribution λP

n

.

In fact, historically, Markov chains was initially developed as a branch of

linear algebra, and a lot of the proofs were just linear algebra manipulations.

However, nowadays, we often look at it as a branch of probability theory instead,

and this is what we will do in this course. So don’t be scared if you hate linear

algebra.

2 Classification of chains and states

2.1 Communicating classes

Suppose we have a Markov chain

X

over some state space

S

. While we would

usually expect the different states in

S

to be mutually interacting, it is possible

that we have a state

i ∈ S

that can never be reached, or we might get stuck in

some state

j ∈ S

and can never leave. These are usually less interesting. Hence

we would like to rule out these scenarios, and focus on what we call irreducible

chains, where we can freely move between different states.

We start with an some elementary definitions.

Definition (Leading to and communicate). Suppose we have two states

i, j ∈ S

.

We write

i → j

(

i

leads to

j

) if there is some

n ≥

0 such that

p

i,j

(

n

)

>

0, i.e.

it is possible for us to get from

i

to

j

(in multiple steps). Note that we allow

n = 0. So we always have i → i.

We write i ↔ j if i → j and j → i. If i ↔ j, we say i and j communicate.

Proposition. ↔ is an equivalence relation.

Proof.

(i) Reflexive: we have i ↔ i since p

i,i

(0) = 1.

(ii) Symmetric: trivial by definition.

(iii)

Transitive: suppose

i → j

and

j → k

. Since

i → j

, there is some

m >

0

such that

p

i,j

(

m

)

>

0. Since

j → k

, there is some

n

such that

p

j,k

(

n

)

>

0.

Then p

i,k

(m + n) =

P

r

p

i,r

(m)p

rk

(n) ≥ p

i,j

(m)p

j,k

(n) > 0. So i → k.

Similarly, if

j → i

and

k → j

, then

k → i

. So

i ↔ j

and

j ↔ k

implies

that i ↔ k.

So we have an equivalence relation, and we know what to do with equivalence

relations. We form equivalence classes!

Definition (Communicating classes). The equivalence classes of

↔

are commu-

nicating classes.

We have to be careful with these communicating classes. Different commu-

nicating classes are not completely isolated. Within a communicating class

A

,

of course we can move between any two vertices. However, it is also possible

that we can escape from a class

A

to a different class

B

. It is just that after

going to

B

, we cannot return to class

A

. From

B

, we might be able to get to

another space

C

. We can jump around all the time, but (if there are finitely

many communicating classes) eventually we have to stop when we have visited

every class. Then we are bound to stay in that class.

Since we are eventually going to be stuck in that class anyway, often, we can

just consider this final communicating class and ignore the others. So wlog we

can assume that the chain only has one communicating class.

Definition (Irreducible chain). A Markov chain is irreducible if there is a unique

communication class.

From now on, we will mostly care about irreducible chains only.

More generally, we call a subset closed if we cannot escape from it.

Definition (Closed). A subset C ⊆ S is closed if p

i,j

= 0 for all i ∈ C, j ∈ C.

Proposition. A subset C is closed iff “i ∈ C, i → j implies j ∈ C”.

Proof.

Assume

C

is closed. Let

i ∈ C, i → j

. Since

i → j

, there is some

m

such

that p

i,j

(m) > 0. Expanding the Chapman-Kolmogorov equation, we have

p

i,j

(m) =

X

i

1

,··· ,i

m−1

p

i,i

1

p

i

1

,i

2

, ··· , p

i

m−1

,j

> 0.

So there is some route

i, i

1

, ··· , i

m−1

, j

such that

p

i,i

1

, p

i

1

,i

2

, ··· , p

i

m−1

,j

>

0.

Since

p

i,i

1

>

0, we have

i

1

∈ C

. Since

p

i

1

,i

2

>

0, we have

i

2

∈ C

. By induction,

we get that j ∈ C.

To prove the other direction, assume that “

i ∈ C, i → j

implies

j ∈ C

”. So

for any i ∈ C, j ∈ C, then i → j. So in particular p

i,j

= 0.

Example. Consider S = {1, 2, 3, 4, 5, 6} with transition matrix

P =

1

2

1

2

0 0 0 0

0 0 1 0 0 0

1

3

0 0

1

3

1

3

0

0 0 0

1

2

1

2

0

0 0 0 0 0 1

0 0 0 0 1 0

1

2

3

4

5 6

We see that the communicating classes are {1, 2, 3}, {4}, {5, 6}, where {5, 6} is

closed.

2.2 Recurrence or transience

The major focus of this chapter is recurrence and transience. This was something

that came up when we discussed random walks in IA Probability — given a

random walk, say starting at 0, what is the probability that we will return to

0 later on? Recurrence and transience is a qualitative way of answering this

question. As we mentioned, we will mostly focus on irreducible chains. So by

definition there is always a non-zero probability of returning to 0. Hence the

question we want to ask is whether we are going to return to 0 with certainty, i.e.

with probability 1. If we are bound to return, then we say the state is recurrent.

Otherwise, we say it is transient.

It should be clear that this notion is usually interesting only for an infinite

state space. If we have an infinite state space, we might get transience because we

are very likely to drift away to a place far, far away and never return. However,

in a finite state space, this can’t happen. Transience can occur only if we get

stuck in a place and can’t leave, i.e. we are not in an irreducible state space.

These are not interesting.

Notation. For convenience, we will introduce some notations. We write

P

i

(A) = P(A | X

0

= i),

and

E

i

(Z) = E(Z | X

0

= i).

Suppose we start from

i

, and randomly wander around. Eventually, we may

or may not get to

j

. If we do, there is a time at which we first reach

j

. We call

this the first passage time.

Definition (First passage time and probability). The first passage time of

j ∈ S

starting from i is

T

j

= min{n ≥ 1 : X

n

= j}.

Note that this implicitly depends on

i

. Here we require

n ≥

1. Otherwise

T

i

would always be 0.

The first passage probability is

f

ij

(n) = P

i

(T

j

= n).

Definition (Recurrent state). A state i ∈ S is recurrent (or persistent) if

P

i

(T

i

< ∞) = 1,

i.e. we will eventually get back to the state. Otherwise, we call the state transient.

Note that transient does not mean we don’t get back. It’s just that we are

not sure that we will get back. We can show that if a state is recurrent, then

the probability that we return to i infinitely many times is also 1.

Our current objective is to show the following characterization of recurrence.

Theorem. i is recurrent iff

P

n

p

i,i

(n) = ∞.

The technique to prove this would be to use generating functions. We need to

first decide what sequence to work with. For any fixed

i, j

, consider the sequence

p

ij

(n) as a sequence in n. Then we define

P

i,j

(s) =

∞

X

n=0

p

i,j

(n)s

n

.

We also define

F

i,j

(S) =

∞

X

n=0

f

i,j

(n)s

n

,

where

f

i,j

is our first passage probability. For the sake of clarity, we make it

explicit that p

i,j

(0) = δ

i,j

, and f

i,j

(0) = 0.

Our proof would be heavily based on the result below:

Theorem.

P

i,j

(s) = δ

i,j

+ F

i,j

(s)P

j,j

(s),

for −1 < s ≤ 1.

Proof. Using the law of total probability

p

i,j

(n) =

n

X

m=1

P

i

(X

n

= j | T

j

= m)P

i

(T

j

= m)

Using the Markov property, we can write this as

=

n

X

m=1

P(X

n

= j | X

m

= j)P

i

(T

j

= m)

=

n

X

m=1

p

j,j

(n − m)f

i,j

(m).

We can multiply through by s

n

and sum over all n to obtain

∞

X

n=1

p

i,j

(n)s

n

=

∞

X

n=1

n

X

m=1

p

j,j

(n − m)s

n−m

f

i,j

(m)s

m

.

The left hand side is almost the generating function

P

i,j

(

s

), except that we

are missing an

n

= 0 term, which is

p

i,j

(0) =

δ

i,j

. The right hand side is the

“convolution” of the power series P

j,j

(s) and F

i,j

(s), which we can write as the

product P

j,j

(s)F

i,j

(s). So

P

i,j

(s) − δ

i,j

= P

j,j

(s)F

i,j

(s).

Before we actually prove our theorem, we need one helpful result from

Analysis that allows us to deal with power series nicely.

Lemma (Abel’s lemma). Let

u

1

, u

2

, ···

be real numbers such that

U

(

s

) =

P

n

u

n

s

n

converges for 0 < s < 1. Then

lim

s→1

−

U(s) =

X

n

u

n

.

Proof is an exercise in analysis, which happens to be on the first example

sheet of Analysis II.

We now prove the theorem we initially wanted to show

Theorem. i is recurrent iff

P

n

p

ii

(n) = ∞.

Proof. Using j = i in the above formula, for 0 < s < 1, we have

P

i,i

(s) =

1

1 − F

i,i

(s)

.

Here we need to be careful that we are not dividing by 0. This would be a

problem if F

ii

(s) = 1. By definition, we have

F

i,i

(s) =

∞

X

n=1

f

i,i

(n)s

n

.

Also, by definition of f

ii

, we have

F

i,i

(1) =

X

n

f

i,i

(n) = P(ever returning to i) ≤ 1.

So for

|s| <

1,

F

i,i

(

s

)

<

1. So we are not dividing by zero. Now we use our

original equation

P

i,i

(s) =

1

1 − F

i,i

(s)

,

and take the limit as

s →

1. By Abel’s lemma, we know that the left hand side

is

lim

s→1

P

i,i

(s) = P

i,i

(1) =

X

n

p

i,i

(n).

The other side is

lim

s→1

1

1 − F

i,i

(s)

=

1

1 −

P

f

i,i

(n)

.

Hence we have

X

n

p

i,i

(n) =

1

1 −

P

f

i,i

(n)

.

Since

P

f

i,i

(

n

) is the probability of ever returning, the probability of ever

returning is 1 if and only if

P

n

p

i,i

(n) = ∞.

Using this result, we can check if a state is recurrent. However, a Markov

chain has many states, and it would be tedious to check every state individually.

Thus we have the following helpful result.

Theorem. Let C be a communicating class. Then

(i) Either every state in C is recurrent, or every state is transient.

(ii) If C contains a recurrent state, then C is closed.

Proof.

(i)

Let

i ↔ j

and

i

=

j

. Then by definition of communicating, there is some

m

such that

p

i,j

(

m

) =

α >

0, and some

n

such that

p

j,i

(

n

) =

β >

0. So

for each k, we have

p

i,i

(m + k + n) ≥ p

i,j

(m)p

j,j

(k)p

j,i

(n) = αβp

j,j

(k).

So if

P

k

p

j,j

(

k

) =

∞

, then

P

r

p

i,i

(

r

) =

∞

. So

j

recurrent implies

i

recurrent. Similarly, i recurrent implies j recurrent.

(ii)

If

C

is not closed, then there is a non-zero probability that we leave the

class and never get back. So the states are not recurrent.

Note that there is a profound difference between a finite state space and an

infinite state space. A finite state space can be represented by a finite matrix,

and we are all very familiar with a finite matrices. We can use everything we

know about finite matrices from IA Vectors and Matrices. However, infinite

matrices are weirder.

For example, any finite transition matrix

P

has an eigenvalue of 1. This

is since the row sums of a transition matrix is always 1. So if we multiply

P

by e = (1

,

1

, ··· ,

1), then we get e again. However, this is not true for infinite

matrices, since we usually don’t usually allow arbitrary infinite vectors. To

avoid getting infinitely large numbers when multiplying vectors and matrices, we

usually restrict our focus to vectors x such that

P

x

2

i

is finite. In this case the

vector e is not allowed, and the transition matrix need not have eigenvalue 1.

Another thing about a finite state space is that probability “cannot escape”.

Each step of a Markov chain gives a probability distribution on the state space,

and we can imagine the progression of the chain as a flow of probabilities around

the state space. If we have a finite state space, then all the probability flow must

be contained within our finite state space. However, if we have an infinite state

space, then probabilities can just drift away to infinity.

More concretely, we get the following result about finite state spaces.

Theorem. In a finite state space,

(i) There exists at least one recurrent state.

(ii) If the chain is irreducible, every state is recurrent.

Proof.

(ii) follows immediately from (i) since if a chain is irreducible, either all

states are transient or all states are recurrent. So we just have to prove (i).

We first fix an arbitrary i. Recall that

P

i,j

(s) = δ

i,j

+ P

j,j

(s)F

i,j

(s).

If

j

is transient, then

P

n

p

j,j

(

n

) =

P

j,j

(1)

< ∞

. Also,

F

i,j

(1) is the probability

of ever reaching

j

from

i

, and is hence finite as well. So we have

P

i,j

(1)

< ∞

.

By Abel’s lemma, P

i,j

(1) is given by

P

i,j

(1) =

X

n

p

i,j

(n).

Since this is finite, we must have p

i,j

(n) → 0.

Since we know that

X

j∈S

p

i,j

(n) = 1,

if every state is transient, then since the sum is finite, we know

P

p

i,j

(

n

)

→

0

as n → ∞. This is a contradiction. So we must have a recurrent state.

Theorem (P´olya’s theorem). Consider

Z

d

=

{

(

x

1

, x

2

, ··· , x

d

) :

x

i

∈ Z}

. This

generates a graph with

x

adjacent to

y

if

|x −y|

= 1, where

| · |

is the Euclidean

norm.

d = 1

d = 2

Consider a random walk in

Z

d

. At each step, it moves to a neighbour, each

chosen with equal probability, i.e.

P(X

n+1

= j | X

n

= i) =

(

1

2d

|j − i| = 1

0 otherwise

This is an irreducible chain, since it is possible to get from one point to any

other point. So the points are either all recurrent or all transient.

The theorem says this is recurrent iff d = 1 or 2.

Intuitively, this makes sense that we get recurrence only for low dimensions,

since if we have more dimensions, then it is easier to get lost.

Proof.

We will start with the case

d

= 1. We want to show that

P

p

0,0

(

n

) =

∞

.

Then we know the origin is recurrent. However, we can simplify this a bit. It is

impossible to get back to the origin in an odd number of steps. So we can instead

consider

P

p

0,0

(2

n

). However, we can write down this expression immediately.

To return to the origin after 2

n

steps, we need to have made

n

steps to the left,

and n steps to the right, in any order. So we have

p

0,0

(2n) = P(n steps left, n steps right) =

2n

n

1

2

2n

.

To show if this converges, it is not helpful to work with these binomial coefficients

and factorials. So we use Stirling’s formula

n

!

≃

√

2πn

n

e

n

. If we plug this in,

we get

p

0,0

(2n) ∼

1

√

πn

.

This tends to 0, but really slowly, and even more slowly than the harmonic series.

So we have

P

p

0,0

(2n) = ∞.

In the

d

= 2 case, suppose after 2

n

steps, I have taken

r

steps right,

ℓ

steps

left,

u

steps up and

d

steps down. We must have

r

+

ℓ

+

u

+

d

= 2

n

, and we need

r

=

ℓ, u

=

d

to return the origin. So we let

r

=

ℓ

=

m, u

=

d

=

n −m

. So we get

p

0,0

(2n) =

1

4

2n

n

X

m=0

2n

m, m, n − m, n − m

=

1

4

2n

n

X

m=0

(2n)!

(m!)

2

((n − m)!)

2

=

1

4

2n

2n

n

n

X

m=0

n!

m!(n − m)!

2

=

1

4

2n

2n

n

n

X

m=0

n

m

n

n − m

We now use a well-known identity (proved in IA Numbers and Sets) to obtain

=

1

4

2n

2n

n

2n

n

=

"

2n

n

1

2

2n

#

2

∼

1

πn

.

So the sum diverges. So this is recurrent. Note that the two-dimensional

probability turns out to be the square of the one-dimensional probability. This

is not a coincidence, and we will explain this after the proof. However, this does

not extend to higher dimensions.

In the d = 3 case, we have

p

0,0

(2n) =

1

6

2n

X

i+j+k=n

(2n)!

(i!j!k!)

2

.

This time, there is no neat combinatorial formula. Since we want to show this is

summable, we can try to bound this from above. We have

p

0,0

(2n) =

1

6

2n

2n

n

X

n!

i!j!k!

2

=

1

2

2n

2n

n

X

n!

3

n

i!j!k!

2

Why do we write it like this? We are going to use the identity

X

i+j+k=n

n!

3

n

i!j!k!

=

1. Where does this come from? Suppose we have three urns, and throw

n

balls

into it. Then the probability of getting

i

balls in the first,

j

in the second and

k

in the third is exactly

n!

3

n

i!j!k!

. Summing over all possible combinations of

i

,

j

and

k

gives the total probability of getting in any configuration, which is 1. So

we can bound this by

≤

1

2

2n

2n

n

max

n!

3

n

i!j!k!

X

n!

3

n

i!j!k!

=

1

2

2n

2n

n

max

n!

3

n

i!j!k!

To find the maximum, we can replace the factorial by the gamma function and

use Lagrange multipliers. However, we would just argue that the maximum can

be achieved when i, j and k are as close to each other as possible. So we get

≤

1

2

2n

2n

n

n!

3

n

1

⌊n/3⌋!

3

≤ Cn

−3/2

for some constant

C

using Stirling’s formula. So

P

p

0,0

(2

n

)

< ∞

and the chain

is transient. We can prove similarly for higher dimensions.

Let’s get back to why the two dimensional probability is the square of the one-

dimensional probability. This square might remind us of independence. However,

it is obviously not true that horizontal movement and vertical movement are

independent — if we go sideways in one step, then we cannot move vertically.

So we need something more sophisticated.

We write

X

n

= (

A

n

, B

n

). What we do is that we try to rotate our space.

We are going to record our coordinates in a pair of axis that is rotated by 45

◦

.

A

B

(A

n

, B

n

)

V

U

U

n

/

√

2

V

n

/

√

2

We can define the new coordinates as

U

n

= A

n

− B

n

V

n

= A

n

+ B

n

In each step, either

A

n

or

B

n

change by one step. So

U

n

and

V

n

both change by

1. Moreover, they are independent. So we have

p

0,0

(2n) = P(A

n

= B

n

= 0)

= P(U

n

= V

n

= 0)

= P(U

n

= 0)P(V

n

= 0)

=

"

2n

n

1

2

2n

#

2

.

2.3 Hitting probabilities

Recurrence and transience tells us if we are going to return to the original

state with (almost) certainty. Often, we would like to know something more

qualitative. What is the actual probability of returning to the state

i

? If we

return, what is the expected duration of returning?

We can formulate this in a more general setting. Let

S

be our state space,

and let

A ⊆ S

. We want to know how likely and how long it takes for us to

reach

A

. For example, if we are in a casino, we want to win, say, a million, and

don’t want to go bankrupt. So we would like to know the probability of reaching

A = {1 million} and A = {0}.

Definition (Hitting time). The hitting time of

A ⊆ S

is the random variable

H

A

=

min{n ≥

0 :

X

n

∈ A}

. In particular, if we start in

A

, then

H

A

= 0. We

also have

h

A

i

= P

i

(H

A

< ∞) = P

i

(ever reach A).

To determine hitting times, we mostly rely on the following result:

Theorem. The vector (h

A

i

: i ∈ S) satisfies

h

A

i

=

(

1 i ∈ A

P

j∈S

p

i,j

h

A

j

i ∈ A

,

and is minimal in that for any non-negative solution (

x

i

:

i ∈ S

) to these

equations, we have h

A

i

≤ x

i

for all i.

It is easy to show that

h

A

i

satisfies the formula given, but it takes some more

work to show that

h

A

i

is the minimal. Recall, however, that we have proved a

similar result for random walks in IA probability, and the proof is more-or-less

the same.

Proof. By definition, h

A

i

= 1 if i ∈ A. Otherwise, we have

h

A

i

= P

i

(H

A

< ∞) =

X

j∈S

P

i

(H

A

< ∞ | X

1

= j)p

i,j

=

X

j∈S

h

A

j

p

i,j

.

So h

A

i

is indeed a solution to the equations.

To show that

h

A

i

is the minimal solution, suppose

x

= (

x

i

:

i ∈ S

) is a

non-negative solution, i.e.

x

i

=

(

1 i ∈ A

P

j∈S

p

i,j

x

j

A i ∈ A

,

If i ∈ A, we have h

A

i

= x

i

= 1. Otherwise, we can write

x

i

=

X

j

p

i,j

x

j

=

X

j∈A

p

i,j

x

j

+

X

j∈A

p

i,j

x

j

=

X

j∈A

p

i,j

+

X

j∈A

p

i,j

x

j

≥

X

j∈A

p

i,j

= P

i

(H

A

= 1).

By iterating this process, we can write

x

i

=

X

j∈A

p

i,j

+

X

j∈A

p

i,j

X

k

p

i,k

x

k

!

=

X

j∈A

p

i,j

+

X

j∈A

p

i,j

X

k∈A

p

i,k

x

k

+

X

k∈A

p

i,k

x

k

≥ P

i

(H

A

= 1) +

X

j∈A,k∈A

p

i,j

p

j,k

= P

i

(H

A

= 1) + P

i

(H

A

= 2)

= P

i

(H

A

≤ 2).

By induction, we obtain

x

i

≥ P

i

(H

A

≤ n)

for all n. Taking the limit as n → ∞, we get

x

i

≥ P

i

(H

A

≤ ∞) = h

A

i

.

So h

A

i

is minimal.

The next question we want to ask is how long it will take for us to hit

A

.

We want to find

E

i

(

H

A

) =

k

A

i

. Note that we have to be careful — if there is a

chance that we never hit

A

, then

H

A

could be infinite, and

E

i

(

H

A

) =

∞

. This

occurs if

h

A

i

<

1. So often we are only interested in the case where

h

A

i

= 1 (note

that h

A

i

= 1 does not imply that k

A

i

< ∞. It is merely a necessary condition).

We get a similar result characterizing the expected hitting time.

Theorem. (k

A

i

: i ∈ S) is the minimal non-negative solution to

k

A

i

=

(

0 i ∈ A

1 +

P

j

p

i,j

k

A

j

i ∈ A.

Note that we have this “1+” since when we move from

i

to

j

, one step has

already passed.

The proof is almost the same as the proof we had above.

Proof. The proof that (k

A

i

) satisfies the equations is the same as before.

Now let (y

i

: i ∈ S) be a non-negative solution. We show that y

i

≥ k

A

i

.

If i ∈ A, we get y

i

= k

A

i

= 0. Otherwise, suppose i ∈ A. Then we have

y

i

= 1 +

X

j

p

i,j

y

j

= 1 +

X

j∈A

p

i,j

y

j

+

X

j∈A

p

i,j

y

j

= 1 +

X

j∈A

p

i,j

y

j

= 1 +

X

j∈A

p

i,j

1 +

X

k∈A

p

j,k

y

k

≥ 1 +

X

j∈A

p

i,j

= P

i

(H

A

≥ 1) + P

i

(H

A

≥ 2).

By induction, we know that

y

i

≥ P

i

(H

A

≥ 1) + ··· + P

i

(H

A

≥ n)

for all n. Let n → ∞. Then we get

y

i

≥

X

m≥1

P

i

(H

A

≥ m) =

X

m≥1

mP

i

(H

A

= m) = k

A

i

.

Example (Gambler’s ruin). This time, we will consider a random walk on

N

.

In each step, we either move to the right with probability

p

, or to the left with

probability

q

= 1

− p

. What is the probability of ever hitting 0 from a given

initial point? In other words, we want to find h

i

= h

{0}

i

.

We know h

i

is the minimal solution to

h

i

=

(

1 i = 0

qh

i−1

+ ph

i+1

i = 0.

What are the solutions to these equations? We can view this as a difference

equation

ph

i+1

− h

i

+ qh

i−1

= 0, i ≥ 1.

with the boundary condition that

h

0

= 1. We all know how to solve difference

equations, so let’s just jump to the solution.

If p = q, i.e. p =

1

2

, then the solution has the form

h

i

= A + B

q

p

i

for

i ≥

0. If

p < q

, then for large

i

,

q

p

i

is very large and blows up. However,

since

h

i

is a probability, it can never blow up. So we must have

B

= 0. So

h

i

is

constant. Since h

0

= 1, we have h

i

= 1 for all i. So we always get to 0.

If p > q, since h

0

= 1, we have A + B = 1. So

h

i

=

q

p

i

+ A

1 −

q

p

i

!

.

This is in fact a solution for all A. So we want to find the smallest solution.

As

i → ∞

, we get

h

i

→ A

. Since

h

i

≥

0, we know that

A ≥

0. Subject to this

constraint, the minimum is attained when

A

= 0 (since (

q/p

)

i

and (1

−

(

q/p

)

i

)

are both positive). So we have

h

i

=

q

p

i

.

There is another way to solve this. We can give ourselves a ceiling, i.e. we also

stop when we hit

k >

0, i.e.

h

k

= 1. We now have two boundary conditions

and can find a unique solution. Then we take the limit as

k → ∞

. This is the

approach taken in IA Probability.

Here if p = q, then by the same arguments, we get h

i

= 1 for all i.

Example (Birth-death chain). Let (

p

i

:

i ≥

1) be an arbitrary sequence such

that

p

i

∈

(0

,

1). We let

q

i

= 1

− p

i

. We let

N

be our state space and define the

transition probabilities to be

p

i,i+1

= p

i

, p

i,i−1

= q

i

.

This is a more general case of the random walk — in the random walk we have

a constant p

i

sequence.

This is a general model for population growth, where the change in population

depends on what the current population is. Here each “step” does not correspond

to some unit time, since births and deaths occur rather randomly. Instead, we

just make a “step” whenever some birth or death occurs, regardless of what time

they occur.

Here, if we have no people left, then it is impossible for us to reproduce and

get more population. So we have

p

0,0

= 1.

We say 0 is absorbing in that {0} is closed. We let h

i

= h

{0}

i

. We know that

h

0

= 1, p

i

h

i+1

− h

i

+ q

i

h

i−1

= 0, i ≥ 1.

This is no longer a difference equation, since the coefficients depends on the

index i. To solve this, we need magic. We rewrite this as

p

i

h

i+1

− h

i

+ q

i

h

i−1

= p

i

h

i+1

− (p

i

+ q

i

)h

i

+ q

i

h

i−1

= p

i

(h

i+1

− h

i

) − q

i

(h

i

− h

i−1

).

We let

u

i

=

h

i−1

− h

i

(picking

h

i

− h

i−1

might seem more natural, but this

definition makes u

i

positive). Then our equation becomes

u

i+1

=

q

i

p

i

u

i

.

We can iterate this to become

u

i+1

=

q

i

p

i

q

i−1

p

i−1

···

q

1

p

1

u

1

.

We let

γ

i

=

q

1

q

2

···q

i

p

1

p

2

···p

i

.

Then we get

u

i+1

=

γ

i

u

1

. For convenience, we let

γ

0

= 1. Now we want to

retrieve our

h

i

. We can do this by summing the equation

u

i

=

h

i−1

− h

i

. So we

get

h

0

− h

i

= u

1

+ u

2

+ ··· + u

i

.

Using the fact that h

0

= 1, we get

h

i

= 1 − u

1

(γ

0

+ γ

1

+ ··· + γ

i−1

).

Here we have a parameter

u

1

, and we need to find out what this is. Our theorem

tells us the value of u

1

minimizes h

i

. This all depends on the value of

S =

∞

X

i=0

γ

i

.

By the law of excluded middle,

S

either diverges or converges. If

S

=

∞

, then

we must have

u

1

= 0. Otherwise,

h

i

blows up for large

i

, but we know that

0

≤ h

i

≤

1. If

S

is finite, then

u

1

can be non-zero. We know that the

γ

i

are

all positive. So to minimize

h

i

, we need to maximize

u

1

. We cannot make

u

1

arbitrarily large, since this will make

h

i

negative. To find the maximum possible

value of

u

1

, we take the limit as

i → ∞

. Then we know that the maximum value

of u

1

satisfies

0 = 1 − u

1

S.

In other words, u

1

= 1/S. So we have

h

i

=

P

∞

k=i

γ

k

P

∞

k=0

γ

k

.

2.4 The strong Markov property and applications

We are going to state the strong Markov property and see applications of it.

Before this, we should know what the weak Markov property is. We have, in fact,

already seen the weak Markov property. It’s just that we called it the “Markov

property’ instead.

In probability, we often have “strong” and “weak” versions of things. For

example, we have the strong and weak law of large numbers. The difference is

that the weak versions are expressed in terms of probabilities, while the strong

versions are expressed in terms of random variables.

Initially, when people first started developing probability theory, they just

talked about probability distributions like the Poisson distribution or the normal

distribution. However, later it turned out it is often nicer to talk about random

variables instead. After messing with random variables, we can just take ex-

pectations or evaluate probabilities to get the corresponding statement about

probability distributions. Hence usually the “strong” versions imply the “weak”

version, but not the other way round.

In this case, recall that we defined the Markov property in terms of the

probabilities at some fixed time. We have some fixed time

t

, and we want to

know the probabilities of events after

t

in terms of what happened before

t

.

In the strong Markov property, we will allow

t

to be a random variable, and

say something slightly more involved. However, we will not allow

T

to be any

random variable, but just some nice ones.

Definition (Stopping time). Let

X

be a Markov chain. A random variable

T

(which is a function Ω

→ N ∪ {∞}

) is a stopping time for the chain

X

= (

X

n

) if

for n ≥ 0, the event {T = n} is given in terms of X

0

, ··· , X

n

.

For example, suppose we are in a casino and gambling. We let

X

n

be the

amount of money we have at time

n

. Then we can set our stopping time as “the

time when we have $10 left”. This is a stopping time, in the sense that we can

use this as a guide to when to stop — it is certainly possible to set yourself a

guide that you should leave the casino when you have $10 left. However, it does

not make sense to say “I will leave if the next game will make me bankrupt”,

since there is no way to tell if the next game will make you bankrupt (it certainly

will not if you win the game!). Hence this is not a stopping time.

Example. The hitting time

H

A

is a stopping time. This is since

{H

A

=

n}

=

{X

i

∈ A for i < n}∩ {X

n

∈ A}

. We also know that

H

A

+ 1 is a stopping time

since it only depends in

X

i

for

i ≤ n −

1. However,

H

A

−

1 is not a stopping

time since it depends on X

n+1

.

We can now state the strong Markov property, which is expressed in a rather

complicated manner but can be useful at times.

Theorem (Strong Markov property). Let

X

be a Markov chain with transition

matrix

P

, and let

T

be a stopping time for

X

. Given

T < ∞

and

X

T

=

i

, the

chain (

X

T +k

:

k ≥

0) is a Markov chain with transition matrix

P

with initial

distribution X

T +0

= i, and this Markov chain is independent of X

0

, ··· , X

T

.

Proof is omitted.

Example (Gambler’s ruin). Again, this is the Markov chain taking values on

the non-negative integers, moving to the right with probability

p

and left with

probability

q

= 1

− p

. 0 is an absorbing state, since we have no money left to

bet if we are broke.

Instead of computing the probability of hitting zero, we want to find the

time it takes to get to 0, i.e.

H = inf{n ≥ 0 : X

n

= 0}.

Note that the infimum of the empty set is +

∞

, i.e. if we never hit zero, we say

it takes infinite time. What is the distribution of

H

? We define the generating

function

G

i

(s) = E

i

(s

H

) =

∞

X

n=0

s

n

P

i

(H = n), |s| < 1.

Note that we need the requirement that

|s| <

1, since it is possible that

H

is

infinite, and we would not want to think whether the sum converges when

s

= 1.

However, we know that it does for |s| < 1.

We have

G

1

(s) = E

1

(s

H

) = pE

1

(s

H

| X

1

= 2) + qE

1

(s

H

| X

1

= 0).

How can we simplify this? The second term is easy, since if

X

1

= 0, then we

must have

H

= 1. So

E

1

(

s

H

| X

1

= 0) =

s

. The first term is more tricky. We

are now at 2. To get to 0, we need to pass through 1. So the time needed to get

to 0 is the time to get from 2 to 1 (say

H

′

), plus the time to get from 1 to 0

(say

H

′′

). We know that

H

′

and

H

′′

have the same distribution as

H

, and by

the strong Markov property, they are independent. So

G

1

= pE

1

(s

H

′

+H

′′

+1

) + qs = psG

2

1

+ qs. (∗)

Solving this, we get two solutions

G

1

(s) =

1 ±

p

1 − 4pqs

2

2ps

.

We have to be careful here. This result says that for each value of

s

,

G

1

(

s

) is

either

1+

√

1−4pqs

2

2ps

or

1−

√

1−4pqs

2

2ps

. It does not say that there is some consistent

choice of + or − that works for everything.

However, we know that if we suddenly change the sign, then

G

1

(

s

) will be

discontinuous at that point, but

G

1

, being a power series, has to be continuous.

So the solution must be either + for all s, or − for all s.

To determine the sign, we can look at what happens when

s

= 0. We see

that the numerator becomes 1

±

1, while the denominator is 0. We know that

G

converges at s = 0. Hence the numerator must be 0. So we must pick −, i.e.

G

1

(s) =

1 −

p

1 − 4pqs

2

2ps

.

We can find P

1

(H = k) by expanding the Taylor series.

What is the probability of ever hitting 0? This is

P

1

(H < ∞) =

∞

X

n=1

P

1

(H = n) = lim

s→1

G

1

(s) =

1 −

√

1 − 4pq

2p

.

We can rewrite this using the fact that

q

= 1

− p

. So 1

−

4

pq

= 1

−

4

p

(1

− p

) =

(1 − 2p)

2

= |q − p|

2

. So we can write

P

1

(H < ∞) =

1 − |p − q|

2p

=

(

1 p ≤ q

q

p

p > q

.

Using this, we can also find

µ

=

E

1

(

H

). Firstly, if

p > q

, then it is possible

that

H

=

∞

. So

µ

=

∞

. If

p ≤ q

, we can find

µ

by differentiating

G

1

(

s

) and

evaluating at

s

= 1. Doing this directly would result it horrible and messy

algebra, which we want to avoid. Instead, we can differentiate (∗) and obtain

G

′

1

= pG

2

1

+ ps2G

1

G

′

1

+ q.

We can rewrite this as

G

′

1

(s) =

pG(s)

2

+ q

1 − 2psG(s)

.

By Abel’s lemma, we have

µ = lim

s→1

G

′

(s) =

(

∞ p =

1

2

1

p−q

p <

1

2

.

2.5 Further classification of states

So far, we have classified chains in say, irreducible and reducible chains. We

have also seen that states can be recurrent or transient. By definition, a state

is recurrent if the probability of returning to it is 1. However, we can further

classify recurrent states. Even if a state is recurrent, it is possible that the

expected time of returning is infinite. So while we will eventually return to the

original state, this is likely to take a long, long time. The opposite behaviour is

also possible — the original state might be very attracting, and we are likely to

return quickly. It turns out this distinction can affect the long-term behaviour

of the chain.

First we have the following proposition, which tells us that if a state is

recurrent, then we are expected to return to it infinitely many times.

Theorem. Suppose

X

0

=

i

. Let

V

i

=

|{n ≥

1 :

X

n

=

i}|

. Let

f

i,i

=

P

i

(

T

i

< ∞

).

Then

P

i

(V

i

= r) = f

r

i,i

(1 − f

i,i

),

since we have to return

r

times, each with probability

f

i,i

, and then never return.

Hence, if

f

i,i

= 1, then

P

i

(

V

i

=

r

) = 0 for all

r

. So

P

i

(

V

i

=

∞

) = 1.

Otherwise,

P

i

(

V

i

=

r

) is a genuine geometric distribution, and we get

P

i

(

V

i

<

∞) = 1.

Proof. Exercise, using the strong Markov property.

Definition (Mean recurrence time). Let

T

i

be the returning time to a state

i

.

Then the mean recurrence time of i is

µ

i

= E

i

(T

i

) =

(

∞ i transient

P

∞

n=1

nf

i,i

(n) i recurrent

Definition (Null and positive state). If

i

is recurrent, we call

i

a null state if

µ

i

= ∞. Otherwise i is non-null or positive.

This is mostly all we care about. However, there is still one more technical

consideration. Recall that in the random walk starting from the origin, we can

only return to the origin after an even number of steps. This causes problems for

a lot of our future results. For example, we will later look at the “convergence”

of Markov chains. If we are very likely to return to 0 after an even number of

steps, but is impossible for an odd number of steps, we don’t get convergence.

Hence we would like to prevent this from happening.

Definition (Period). The period of a state i is d

i

= gcd{n ≥ 1 : p

i,i

(n) > 0}.

A state is aperiodic if d

i

= 1.

In general, we like aperiodic states. This is not a very severe restriction. For

example, in the random walk, we can get rid of periodicity by saying there is a

very small chance of staying at the same spot in each step. We can make this

chance is so small that we can ignore its existence for most practical purposes,

but will help us get rid of the technical problem of periodicity.

Definition (Ergodic state). A state

i

is ergodic if it is aperiodic and positive

recurrent.

These are the really nice states.

Recall that recurrence is a class property — if two states are in the same

communicating class, then they are either both recurrent, or both transient. Is

this true for the properties above as well? The answer is yes.

Theorem. If i ↔ j are communicating, then

(i) d

i

= d

j

.

(ii) i is recurrent iff j is recurrent.

(iii) i is positive recurrent iff j is positive recurrent.

(iv) i is ergodic iff j is ergodic.

Proof.

(i)

Assume

i ↔ j

. Then there are

m, n ≥

1 with

p

i,j

(

m

)

, p

j,i

(

n

)

>

0. By the

Chapman-Kolmogorov equation, we know that

p

i,i

(m + r + n) ≥ p

i,j

(m)p

j,j

(r)p

j,i

(n) ≥ αp

j,j

(r),

where

α

=

p

i,j

(

m

)

p

j,i

(

n

)

>

0. Now let

D

j

=

{r ≥

1 :

p

j,j

(

r

)

>

0

}

. Then

by definition, d

j

= gcd D

j

.

We have just shown that if

r ∈ D

j

, then we have

m

+

r

+

n ∈ D

i

. We

also know that

n

+

m ∈ D

i

, since

p

i,i

(

n

+

m

)

≥ p

i,j

(

n

)

p

ji

(

m

)

>

0. Hence

for any

r ∈ D

j

, we know that

d

i

| m

+

r

+

n

, and also

d

i

| m

+

n

. So

d

i

| r

. Hence

gcd D

i

| gcd D

j

. By symmetry,

gcd D

j

| gcd D

i

as well. So

gcd D

i

= gcd D

j

.

(ii) Proved before.

(iii) This is deferred to a later time.

(iv) Follows directly from (i), (ii) and (iii) by definition.

We also have the following proposition we will use later on:

Proposition. If the chain is irreducible and j ∈ S is recurrent, then

P(X

n

= j for some n ≥ 1) = 1,

regardless of the distribution of X

0

.

Note that this is not just the definition of recurrence, since recurrence says

that if we start at

i

, then we will return to

i

. Here we are saying wherever we

start, we will eventually visit i.

Proof. Let

f

i,j

= P

i

(X

n

= j for some n ≥ 1).

Since

j → i

, there exists a least integer

m ≥

1 with

p

j,i

(

m

)

>

0. Since

m

is least,

we know that

p

j,i

(m) = P

j

(X

m

= i, X

r

= j for r < m).

This is since we cannot visit

j

in the path, or else we can truncate our path and

get a shorter path from j to i. Then

p

j,i

(m)(1 − f

i,j

) ≤ 1 − f

j,j

.

This is since the left hand side is the probability that we first go from

j

to

i

in

m

steps, and then never go from

i

to

j

again; while the right is just the probability

of never returning to

j

starting from

j

; and we know that it is easier to just not

get back to

j

than to go to

i

in exactly

m

steps and never returning to

j

. Hence

if f

j,j

= 1, then f

i,j

= 1.

Now let λ

k

= P(X

0

= k) be our initial distribution. Then

P(X

n

= j for some n ≥ 1) =

X

i

λ

i

P

i

(X

n

= j for some n ≥ 1) = 1.

3 Long-run behaviour

3.1 Invariant distributions

We want to look at what happens in the long run. Recall that at the very

beginning of the course, we calculated the transition probabilities of the two-

state Markov chain, and saw that in the long run, as

n → ∞

, the probability

distribution of the

X

n

will “converge” to some particular distribution. Moreover,

this limit does not depend on where we started. We’d like to see if this is true

for all Markov chains.

First of all, we want to make it clear what we mean by the chain “converging”

to something. When we are dealing with real sequences

x

n

, we have a precise

definition of what it means for

x

n

→ x

. How can we define the convergence of a

sequence of random variables

X

n

? These are not proper numbers, so we cannot

just apply our usual definitions.

For the purposes of our investigation of Markov chains here, it turns out

that the right way to think about convergence is to look at the probability mass

function. For each k ∈ S, we ask if P(X

n

= k) converges to anything.

In most cases,

P

(

X

n

=

k

) converges to something. Hence, this is not an

interesting question to ask. What we would really want to know is whether the

limit is a probability mass function. It is, for example, possible that

P

(

X

n

=

k) → 0 for all k, and the result is not a distribution.

From Analysis, we know there are, in general, two ways to prove something

converges — we either “guess” a limit and then prove it does indeed converge to

that limit, or we prove the sequence is Cauchy. It would be rather silly to prove

that these probabilities form a Cauchy sequence, so we will attempt to guess the

limit. The limit will be known as an invariant distribution, for reasons that will

become obvious shortly.

The main focus of this section is to study the existence and properties of

invariant distributions, and we will provide sufficient conditions for convergence

to occur in the next.

Recall that if we have a starting state

λ

, then the distribution of the

n

th

step is given by λP

n

. We have the following trivial identity:

λP

n+1

= (λP

n

)P.

If the distribution converges, then we have

λP

n

→ π

for some

π

, and also

λP

n+1

→ π. Hence the limit π satisfies

πP = π.

We call these invariant distributions.

Definition (Invariant distriubtion). Let

X

j

be a Markov chain with transition

probabilities

P

. The distribution

π

= (

π

k

:

k ∈ S

) is an invariant<