IA Probability - Discrete random variables

3Discrete random variables

IA Probability

3.1 Discrete random variables

Definition (Random variable). A random variable

taking values in a set Ω

is a function X : Ω → Ω

. Ω

is usually a set of numbers, e.g. R or N.

Intuitively, a random variable assigns a “number” (or a thing in Ω

) to each

event (e.g. assign 6 to the event “dice roll gives 6”).

Definition (Discrete random variables). A random variable is discrete if Ω

finite or countably infinite.

Notation. Let T ⊆ Ω

, define

P(X ∈ T) = P({ω ∈ Ω : X(ω) ∈ T }).

i.e. the probability that the outcome is in T .

Here, instead of talking about the probability of getting a particular outcome

or event, we are concerned with the probability of a random variable taking a

particular value. If Ω is itself countable, then we can write this as

P(X ∈ T) =

ω∈Ω:X(ω)∈T

Example. Let

be the value shown by rolling a fair die. Then Ω

{1, 2, 3, 4, 5, 6}. We know that

P(X = i) =

We call this the discrete uniform distribution.

Definition (Discrete uniform distribution). A discrete uniform distribution

is a discrete distribution with finitely many possible outcomes, in which each

outcome is equally likely.

Example. Suppose we roll two dice, and let the values obtained by

and

Then the sum can be represented by X + Y , with

Ω

X+Y

= {2, 3, ··· , 12}.

This shows that we can add random variables to get a new random variable.

Notation. We write

(x) = P(X = x).

We can also write X ∼ B(n, p) to mean

P(X = r) =





(1 −p)

n−r

and similarly for the other distributions we have come up with before.

Definition (Expectation). The expectation (or mean) of a real-valued

equal to

E[X] =

ω∈Ω

X(ω).

provided this is absolutely convergent. Otherwise, we say the expectation doesn’t

exist. Alternatively,

E[X] =

x∈Ω

ω:X(ω)=x

X(ω)

x∈Ω

ω:X(ω)=x

x∈Ω

xP (X = x).

We are sometimes lazy and just write EX.

This is the “average” value of

we expect to get. Note that this definition

only holds in the case where the sample space Ω is countable. If Ω is continuous

(e.g. the whole of R), then we have to define the expectation as an integral.

Example. Let X be the sum of the outcomes of two dice. Then

E[X] = 2 ·

+ 3 ·

+ ··· + 12 ·

= 7.

Note that

[

] can be non-existent if the sum is not absolutely convergent.

However, it is possible for the expected value to be infinite:

Example (St. Petersburg paradox). Suppose we play a game in which we keep

tossing a coin until you get a tail. If you get a tail on the

th round, then I pay

you $2

. The expected value is

E[X] =

· 2 +

· 4 +

· 8 + ··· = ∞.

This means that on average, you can expect to get an infinite amount of money!

In real life, though, people would hardly be willing to pay $20 to play this game.

There are many ways to resolve this paradox, such as taking into account the

fact that the host of the game has only finitely many money and thus your real

expected gain is much smaller.

Example. We calculate the expected values of different distributions:

(i) Poisson P (λ). Let X ∼ P (λ). Then

(r) =

−λ

E[X] =

∞

r=0

rP (X = r)

∞

r=0

rλ

−λ

∞

r=1

r−1

−λ

(r − 1)!

= λ

∞

r=0

−λ

= λ.

(ii) Let X ∼ B(n, p). Then

E[X] =

rP (x = r)





(1 −p)

n−r

r!(n −r)!

(1 −p)

n−r

= np

r=1

(n −1)!

(r − 1)![(n − 1) − (r − 1)]!

r−1

(1 − p)

(n−1)−(r−1)

= np

n−1



n − 1



(1 − p)

n−1−r

= np.

Given a random variable

, we can create new random variables such as

+ 3 or

. Formally, let

R → R

and

be a real-valued random variable.

Then f(X) is a new random variable that maps ω 7→ f (X(ω)).

Example. if

a, b, c

are constants, then

and (

X −c

)

are random variables,

defined as

(a + bX)(ω) = a + bX(ω)

(X − c)

(ω) = (X(ω) − c)

Theorem.

(i) If X ≥ 0, then E[X] ≥ 0.

(ii) If X ≥ 0 and E[X] = 0, then P(X = 0) = 1.

(iii) If a and b are constants, then E[a + bX] = a + bE[X].

(iv)

and

are random variables, then

[

] =

[

] +

[

]. This is

true even if X and Y are not independent.

(v) E[X] is a constant that minimizes E[(X − c)

] over c.

Proof.

(i) X ≥ 0 means that X(ω) ≥ 0 for all ω. Then

E[X] =

X(ω) ≥ 0.

(ii)

If there exists

such that

(

)

0 and

0, then

[

]

0. So

X(ω) = 0 for all ω.

(iii)

E[a + bX] =

(a + bX(ω))p

= a + b

= a + b E[X].

(iv)

E[X+Y ] =

[X(ω)+Y (ω)] =

X(ω)+

Y (ω) = E[X]+E[Y ].

(v)

E[(X − c)

] = E[(X − E[X] + E[X] − c)

]

= E[(X − E[X])

+ 2(E[X] − c)(X − E[X]) + (E[X] − c)

]

= E(X − E[X])

+ 0 + (E[X] − c)

This is clearly minimized when

[

]. Note that we obtained the zero

in the middle because E[X − E[X]] = E[X] − E[X] = 0.

An easy generalization of (iv) above is

Theorem. For any random variables

, X

, ···X

, for which the following

expectations exist,

i=1

E[X

Proof.

p(ω)[X

(ω) + ··· + X

(ω)] =

p(ω)X

(ω) + ··· +

p(ω)X

(ω).

Definition (Variance and standard deviation). The variance of a random

variable X is defined as

var(X) = E[(X − E[X])

The standard deviation is the square root of the variance,

var(X).

This is a measure of how “dispersed” the random variable

is. If we have a

low variance, then the value of X is very likely to be close to E[X].

Theorem.

(i) var X ≥ 0. If var X = 0, then P(X = E[X]) = 1.

(ii) var

(

) =

var

(

). This can be proved by expanding the definition

and using the linearity of the expected value.

(iii) var(X) = E[X

] − E[X]

, also proven by expanding the definition.

Example (Binomial distribution). Let

X ∼ B

(

n, p

) be a binomial distribution.

Then E[X] = np. We also have

E[X(X −1)] =

r=0

r(r − 1)

r!(n − r)!

(1 − p)

n−r

= n(n − 1)p

r=2



n − 2

r − 2



r−2

(1 − p)

(n−2)−(r−2)

= n(n − 1)p

The sum goes to 1 since it is the sum of all probabilities of a binomial

(

n −

, p

)

So E[X

] = n(n − 1)p

+ E[X] = n(n − 1)p

+ np. So

var(X) = E[X

] − (E[X])

= np(1 − p) = npq.

Example (Poisson distribution). If

X ∼ P

(

), then

[

] =

, and

var

(

) =

since P (λ) is B(n, p) with n → ∞, p → 0, np → λ.

Example (Geometric distribution). Suppose

(

) =

for

= 0

, ···

Then

E[X] =

∞

rpq

= pq

∞

r−1

= pq

∞

= pq

∞

= pq

1 − q

(1 − q)

Then

E[X(X −1)] =

∞

r(r − 1)pq

= pq

∞

r(r − 1)q

r−2

= pq

1 − q

2pq

(1 − q)

So the variance is

var(X) =

2pq

(1 − q)

−

Definition (Indicator function). The indicator function or indicator variable

I[A] (or I

) of an event A ⊆ Ω is

I[A](ω) =

(

1 ω ∈ A

0 ω ∈ A

This indicator random variable is not interesting by itself. However, it is a

rather useful tool to prove results.

It has the following properties:

Proposition.

– E[I[A]] =

p(ω)I[A](ω) = P(A).

– I[A

] = 1 − I[A].

– I[A ∩ B] = I[A]I[B].

– I[A ∪ B] = I[A] + I[B] − I[A]I[B].

– I[A]

= I[A].

These are easy to prove from definition. In particular, the last property

comes from the fact that I[A] is either 0 and 1, and 0

= 0, 1

= 1.

Example. Let 2

people (

husbands and

wives, with

n >

2) sit alternate

man-woman around the table randomly. Let

be the number of couples sitting

next to each other.

Let A

= [ith couple sits together]. Then

N =

i=1

I[A

Then

E[N] = E

I[A

]



I[A

]



= nE



I[A

]



= nP(A

) = n ·

= 2.

We also have

E[N

] = E





I[A

]





= E





I[A

]

+ 2

i<j

I[A

]I[A

]





= nE



I[A

]



+ n(n − 1)E



I[A

]I[A

]



We have

[

]

[

]] =

(

∩ A

) =



n−1

n−2

n−1



. Plugging in,

we ultimately obtain var(N ) =

2(n−2)

n−1

In fact, as n → ∞, N ∼ P (2).

We can use these to prove the inclusion-exclusion formula:

Theorem (Inclusion-exclusion formula).

[

P(A

) −

P(A

∩ A

) +

P(A

∩ A

) − ···

+ (−1)

n−1

P(A

∩ ··· ∩ A

Proof. Let I

be the indicator function for A

. Write

<···<i

···I

and

= E[S

] =

<···<i

P(A

∩ ··· ∩ A

Then

1 −

j=1

(1 − I

) = S

− S

+ S

··· + (−1)

n−1

[

= E

1 −

(1 − I

)

= s

− s

+ s

− ··· + (−1)

n−1

We can extend the idea of independence to random variables. Two random

variables are independent if the value of the first does not affect the value of the

second.

Definition (Independent random variables). Let

, X

, ··· , X

be discrete

random variables. They are independent iff for any x

, x

, ··· , x

P(X

= x

, ··· , X

= x

) = P(X

= x

) ···P(X

= x

Theorem. If

, ··· , X

are independent random variables, and

, ··· , f

are

functions R → R, then f

), ··· , f

) are independent random variables.

Proof.

Note that given a particular

, there can be many different

for which

(

) =

. When finding

(

) =

), we need to sum over all

such that

) = f

. Then

P(f

) = y

, ···f

) = y

) =

)=y

P(X

= x

, ··· , X

= x

)

)=y

i=1

P(X

= x

)

i=1

)=y

P(X

= x

)

i=1

P(f

) = y

Note that the switch from the second to third line is valid since they both expand

to the same mess.

Theorem. If

, ··· , X

are independent random variables and all the following

expectations exists, then

E[X

Proof. Write R

for the range of X

. Then

∈R

···

∈R

···x

× P(X

= x

, ··· , X

= x

)

i=1

∈R

P(X

= x

)

i=1

E[X

Corollary. Let

, ···X

be independent random variables, and

, f

, ···f

are functions R → R. Then

)

E[f

)].

Theorem. If X

, X

, ···X

are independent random variables, then

var





var(X

Proof.

var





= E









−



i

= E





i=j





−



E[X

]



E[X

] +

i=j

E[X

]E[X

] −

(E[X

])

−

i=j

E[X

]E[X

]

E[X

] − (E[X

])

Corollary. Let

, X

, ···X

be independent identically distributed random

variables (iid rvs). Then

var





var(X

Proof.

var





var





var(X

)

n var(X

)

var(X

)

This result is important in statistics. This means that if we want to reduce

the variance of our experimental results, then we can repeat the experiment

many times (corresponding to a large

), and then the sample average will have

a small variance.

Example. Let

be iid

, p

), i.e.

(1) =

and

(0) = 1

− p

. Then

Y = X

+ X

+ ··· + X

∼ B(n, p).

Since

var

(

) =

[

]

−

(

[

])

p − p

− p

), we have

var

(

) =

np(1 − p).

Example. Suppose we have two rods of unknown lengths

a, b

. We can measure

the lengths, but is not accurate. Let

and

be the measured value. Suppose

E[A] = a, var(A) = σ

E[B] = b, var(B) = σ

We can measure it more accurately by measuring

and

A − B

Then we estimate a and b by

ˆa =

X + Y

b =

X − Y

Then E[ˆa] = a and E[

b] = b, i.e. they are unbiased. Also

var(ˆa) =

var(X + Y ) =

2σ

and similarly for

. So we can measure it more accurately by measuring the

sticks together instead of separately.