3Discrete random variables
IA Probability
3.1 Discrete random variables
Definition (Random variable). A random variable
X
taking values in a set Ω
X
is a function X : Ω → Ω
X
. Ω
X
is usually a set of numbers, e.g. R or N.
Intuitively, a random variable assigns a “number” (or a thing in Ω
X
) to each
event (e.g. assign 6 to the event “dice roll gives 6”).
Definition (Discrete random variables). A random variable is discrete if Ω
X
is
finite or countably infinite.
Notation. Let T ⊆ Ω
X
, define
P(X ∈ T) = P({ω ∈ Ω : X(ω) ∈ T }).
i.e. the probability that the outcome is in T .
Here, instead of talking about the probability of getting a particular outcome
or event, we are concerned with the probability of a random variable taking a
particular value. If Ω is itself countable, then we can write this as
P(X ∈ T) =
X
ω∈Ω:X(ω)∈T
p
ω
.
Example. Let
X
be the value shown by rolling a fair die. Then Ω
X
=
{1, 2, 3, 4, 5, 6}. We know that
P(X = i) =
1
6
.
We call this the discrete uniform distribution.
Definition (Discrete uniform distribution). A discrete uniform distribution
is a discrete distribution with finitely many possible outcomes, in which each
outcome is equally likely.
Example. Suppose we roll two dice, and let the values obtained by
X
and
Y
.
Then the sum can be represented by X + Y , with
Ω
X+Y
= {2, 3, ··· , 12}.
This shows that we can add random variables to get a new random variable.
Notation. We write
P
X
(x) = P(X = x).
We can also write X ∼ B(n, p) to mean
P(X = r) =
n
r
p
r
(1 −p)
n−r
,
and similarly for the other distributions we have come up with before.
Definition (Expectation). The expectation (or mean) of a real-valued
X
is
equal to
E[X] =
X
ω∈Ω
p
ω
X(ω).
provided this is absolutely convergent. Otherwise, we say the expectation doesn’t
exist. Alternatively,
E[X] =
X
x∈Ω
X
X
ω:X(ω)=x
p
ω
X(ω)
=
X
x∈Ω
X
x
X
ω:X(ω)=x
p
ω
=
X
x∈Ω
X
xP (X = x).
We are sometimes lazy and just write EX.
This is the “average” value of
X
we expect to get. Note that this definition
only holds in the case where the sample space Ω is countable. If Ω is continuous
(e.g. the whole of R), then we have to define the expectation as an integral.
Example. Let X be the sum of the outcomes of two dice. Then
E[X] = 2 ·
1
36
+ 3 ·
2
36
+ ··· + 12 ·
1
36
= 7.
Note that
E
[
X
] can be non-existent if the sum is not absolutely convergent.
However, it is possible for the expected value to be infinite:
Example (St. Petersburg paradox). Suppose we play a game in which we keep
tossing a coin until you get a tail. If you get a tail on the
i
th round, then I pay
you $2
i
. The expected value is
E[X] =
1
2
· 2 +
1
4
· 4 +
1
8
· 8 + ··· = ∞.
This means that on average, you can expect to get an infinite amount of money!
In real life, though, people would hardly be willing to pay $20 to play this game.
There are many ways to resolve this paradox, such as taking into account the
fact that the host of the game has only finitely many money and thus your real
expected gain is much smaller.
Example. We calculate the expected values of different distributions:
(i) Poisson P (λ). Let X ∼ P (λ). Then
P
X
(r) =
λ
r
e
−λ
r!
.
So
E[X] =
∞
X
r=0
rP (X = r)
=
∞
X
r=0
rλ
r
e
−λ
r!
=
∞
X
r=1
λ
λ
r−1
e
−λ
(r − 1)!
= λ
∞
X
r=0
λ
r
e
−λ
r!
= λ.
(ii) Let X ∼ B(n, p). Then
E[X] =
n
X
0
rP (x = r)
=
n
X
0
r
n
r
p
r
(1 −p)
n−r
=
n
X
0
r
n!
r!(n −r)!
p
r
(1 −p)
n−r
= np
n
X
r=1
(n −1)!
(r − 1)![(n − 1) − (r − 1)]!
p
r−1
(1 − p)
(n−1)−(r−1)
= np
n−1
X
0
n − 1
r
p
r
(1 − p)
n−1−r
= np.
Given a random variable
X
, we can create new random variables such as
X
+ 3 or
X
2
. Formally, let
f
:
R → R
and
X
be a real-valued random variable.
Then f(X) is a new random variable that maps ω 7→ f (X(ω)).
Example. if
a, b, c
are constants, then
a
+
bX
and (
X −c
)
2
are random variables,
defined as
(a + bX)(ω) = a + bX(ω)
(X − c)
2
(ω) = (X(ω) − c)
2
.
Theorem.
(i) If X ≥ 0, then E[X] ≥ 0.
(ii) If X ≥ 0 and E[X] = 0, then P(X = 0) = 1.
(iii) If a and b are constants, then E[a + bX] = a + bE[X].
(iv)
If
X
and
Y
are random variables, then
E
[
X
+
Y
] =
E
[
X
] +
E
[
Y
]. This is
true even if X and Y are not independent.
(v) E[X] is a constant that minimizes E[(X − c)
2
] over c.
Proof.
(i) X ≥ 0 means that X(ω) ≥ 0 for all ω. Then
E[X] =
X
ω
p
ω
X(ω) ≥ 0.
(ii)
If there exists
ω
such that
X
(
ω
)
>
0 and
p
ω
>
0, then
E
[
X
]
>
0. So
X(ω) = 0 for all ω.
(iii)
E[a + bX] =
X
ω
(a + bX(ω))p
ω
= a + b
X
ω
p
ω
= a + b E[X].
(iv)
E[X+Y ] =
X
ω
p
ω
[X(ω)+Y (ω)] =
X
ω
p
ω
X(ω)+
X
ω
p
ω
Y (ω) = E[X]+E[Y ].
(v)
E[(X − c)
2
] = E[(X − E[X] + E[X] − c)
2
]
= E[(X − E[X])
2
+ 2(E[X] − c)(X − E[X]) + (E[X] − c)
2
]
= E(X − E[X])
2
+ 0 + (E[X] − c)
2
.
This is clearly minimized when
c
=
E
[
X
]. Note that we obtained the zero
in the middle because E[X − E[X]] = E[X] − E[X] = 0.
An easy generalization of (iv) above is
Theorem. For any random variables
X
1
, X
2
, ···X
n
, for which the following
expectations exist,
E
"
n
X
i=1
X
i
#
=
n
X
i=1
E[X
i
].
Proof.
X
ω
p(ω)[X
1
(ω) + ··· + X
n
(ω)] =
X
ω
p(ω)X
1
(ω) + ··· +
X
ω
p(ω)X
n
(ω).
Definition (Variance and standard deviation). The variance of a random
variable X is defined as
var(X) = E[(X − E[X])
2
].
The standard deviation is the square root of the variance,
p
var(X).
This is a measure of how “dispersed” the random variable
X
is. If we have a
low variance, then the value of X is very likely to be close to E[X].
Theorem.
(i) var X ≥ 0. If var X = 0, then P(X = E[X]) = 1.
(ii) var
(
a
+
bX
) =
b
2
var
(
X
). This can be proved by expanding the definition
and using the linearity of the expected value.
(iii) var(X) = E[X
2
] − E[X]
2
, also proven by expanding the definition.
Example (Binomial distribution). Let
X ∼ B
(
n, p
) be a binomial distribution.
Then E[X] = np. We also have
E[X(X −1)] =
n
X
r=0
r(r − 1)
n!
r!(n − r)!
p
r
(1 − p)
n−r
= n(n − 1)p
2
n
X
r=2
n − 2
r − 2
p
r−2
(1 − p)
(n−2)−(r−2)
= n(n − 1)p
2
.
The sum goes to 1 since it is the sum of all probabilities of a binomial
N
(
n −
2
, p
)
So E[X
2
] = n(n − 1)p
2
+ E[X] = n(n − 1)p
2
+ np. So
var(X) = E[X
2
] − (E[X])
2
= np(1 − p) = npq.
Example (Poisson distribution). If
X ∼ P
(
λ
), then
E
[
X
] =
λ
, and
var
(
X
) =
λ
,
since P (λ) is B(n, p) with n → ∞, p → 0, np → λ.
Example (Geometric distribution). Suppose
P
(
X
=
r
) =
q
r
p
for
r
= 0
,
1
,
2
, ···
.
Then
E[X] =
∞
X
0
rpq
r
= pq
∞
X
0
rq
r−1
= pq
∞
X
0
d
dq
q
r
= pq
d
dq
∞
X
0
q
r
= pq
d
dq
1
1 − q
=
pq
(1 − q)
2
=
q
p
.
Then
E[X(X −1)] =
∞
X
0
r(r − 1)pq
r
= pq
2
∞
X
0
r(r − 1)q
r−2
= pq
2
d
2
dq
2
1
1 − q
=
2pq
2
(1 − q)
3
So the variance is
var(X) =
2pq
2
(1 − q)
3
+
q
p
−
q
2
p
2
=
q
p
2
.
Definition (Indicator function). The indicator function or indicator variable
I[A] (or I
A
) of an event A ⊆ Ω is
I[A](ω) =
(
1 ω ∈ A
0 ω ∈ A
This indicator random variable is not interesting by itself. However, it is a
rather useful tool to prove results.
It has the following properties:
Proposition.
– E[I[A]] =
P
ω
p(ω)I[A](ω) = P(A).
– I[A
C
] = 1 − I[A].
– I[A ∩ B] = I[A]I[B].
– I[A ∪ B] = I[A] + I[B] − I[A]I[B].
– I[A]
2
= I[A].
These are easy to prove from definition. In particular, the last property
comes from the fact that I[A] is either 0 and 1, and 0
2
= 0, 1
2
= 1.
Example. Let 2
n
people (
n
husbands and
n
wives, with
n >
2) sit alternate
man-woman around the table randomly. Let
N
be the number of couples sitting
next to each other.
Let A
i
= [ith couple sits together]. Then
N =
n
X
i=1
I[A
i
].
Then
E[N] = E
h
X
I[A
i
]
i
=
n
X
1
E
I[A
i
]
= nE
I[A
1
]
= nP(A
i
) = n ·
2
n
= 2.
We also have
E[N
2
] = E
X
I[A
i
]
2
= E
X
i
I[A
i
]
2
+ 2
X
i<j
I[A
i
]I[A
j
]
= nE
I[A
i
]
+ n(n − 1)E
I[A
1
]I[A
2
]
We have
E
[
I
[
A
1
]
I
[
A
2
]] =
P
(
A
1
∩ A
2
) =
2
n
1
n−1
1
n−1
+
n−2
n−1
2
n−1
. Plugging in,
we ultimately obtain var(N ) =
2(n−2)
n−1
.
In fact, as n → ∞, N ∼ P (2).
We can use these to prove the inclusion-exclusion formula:
Theorem (Inclusion-exclusion formula).
P
n
[
i
A
i
!
=
n
X
1
P(A
i
) −
X
i
1
<i
2
P(A
i
1
∩ A
j
2
) +
X
i
1
<i
2
<i
3
P(A
i
1
∩ A
i
2
∩ A
i
3
) − ···
+ (−1)
n−1
P(A
1
∩ ··· ∩ A
n
).
Proof. Let I
j
be the indicator function for A
j
. Write
S
r
=
X
i
1
<i
2
<···<i
r
I
i
1
I
i
2
···I
i
r
,
and
s
r
= E[S
r
] =
X
i
1
<···<i
r
P(A
i
1
∩ ··· ∩ A
i
r
).
Then
1 −
n
Y
j=1
(1 − I
j
) = S
1
− S
2
+ S
3
··· + (−1)
n−1
S
n
.
So
P
n
[
1
A
j
!
= E
"
1 −
n
Y
1
(1 − I
j
)
#
= s
1
− s
2
+ s
3
− ··· + (−1)
n−1
s
n
.
We can extend the idea of independence to random variables. Two random
variables are independent if the value of the first does not affect the value of the
second.
Definition (Independent random variables). Let
X
1
, X
2
, ··· , X
n
be discrete
random variables. They are independent iff for any x
1
, x
2
, ··· , x
n
,
P(X
1
= x
1
, ··· , X
n
= x
n
) = P(X
1
= x
1
) ···P(X
n
= x
n
).
Theorem. If
X
1
, ··· , X
n
are independent random variables, and
f
1
, ··· , f
n
are
functions R → R, then f
1
(X
1
), ··· , f
n
(X
n
) are independent random variables.
Proof.
Note that given a particular
y
i
, there can be many different
x
i
for which
f
i
(
x
i
) =
y
i
. When finding
P
(
f
i
(
x
i
) =
y
i
), we need to sum over all
x
i
such that
f
i
(x
i
) = f
i
. Then
P(f
1
(X
1
) = y
1
, ···f
n
(X
n
) = y
n
) =
X
x
1
:f
1
(x
1
)=y
1
·
·
x
n
:f
n
(x
n
)=y
n
P(X
1
= x
1
, ··· , X
n
= x
n
)
=
X
x
1
:f
1
(x
1
)=y
1
·
·
x
n
:f
n
(x
n
)=y
n
n
Y
i=1
P(X
i
= x
i
)
=
n
Y
i=1
X
x
i
:f
i
(x
i
)=y
i
P(X
i
= x
i
)
=
n
Y
i=1
P(f
i
(x
i
) = y
i
).
Note that the switch from the second to third line is valid since they both expand
to the same mess.
Theorem. If
X
1
, ··· , X
n
are independent random variables and all the following
expectations exists, then
E
h
Y
X
i
i
=
Y
E[X
i
].
Proof. Write R
i
for the range of X
i
. Then
E
"
n
Y
1
X
i
#
=
X
x
1
∈R
1
···
X
x
n
∈R
n
x
1
x
2
···x
n
× P(X
1
= x
1
, ··· , X
n
= x
n
)
=
n
Y
i=1
X
x
i
∈R
i
x
i
P(X
i
= x
i
)
=
n
Y
i=1
E[X
i
].
Corollary. Let
X
1
, ···X
n
be independent random variables, and
f
1
, f
2
, ···f
n
are functions R → R. Then
E
h
Y
f
i
(x
i
)
i
=
Y
E[f
i
(x
i
)].
Theorem. If X
1
, X
2
, ···X
n
are independent random variables, then
var
X
X
i
=
X
var(X
i
).
Proof.
var
X
X
i
= E
X
X
i
2
−
E
h
X
X
i
i
2
= E
X
X
2
i
+
X
i=j
X
i
X
j
−
X
E[X
i
]
2
=
X
E[X
2
i
] +
X
i=j
E[X
i
]E[X
j
] −
X
(E[X
i
])
2
−
X
i=j
E[X
i
]E[X
j
]
=
X
E[X
2
i
] − (E[X
i
])
2
.
Corollary. Let
X
1
, X
2
, ···X
n
be independent identically distributed random
variables (iid rvs). Then
var
1
n
X
X
i
=
1
n
var(X
1
).
Proof.
var
1
n
X
X
i
=
1
n
2
var
X
X
i
=
1
n
2
X
var(X
i
)
=
1
n
2
n var(X
1
)
=
1
n
var(X
1
)
This result is important in statistics. This means that if we want to reduce
the variance of our experimental results, then we can repeat the experiment
many times (corresponding to a large
n
), and then the sample average will have
a small variance.
Example. Let
X
i
be iid
B
(1
, p
), i.e.
P
(1) =
p
and
P
(0) = 1
− p
. Then
Y = X
1
+ X
2
+ ··· + X
n
∼ B(n, p).
Since
var
(
X
i
) =
E
[
X
2
i
]
−
(
E
[
X
i
])
2
=
p − p
2
=
p
(1
− p
), we have
var
(
Y
) =
np(1 − p).
Example. Suppose we have two rods of unknown lengths
a, b
. We can measure
the lengths, but is not accurate. Let
A
and
B
be the measured value. Suppose
E[A] = a, var(A) = σ
2
E[B] = b, var(B) = σ
2
.
We can measure it more accurately by measuring
X
=
A
+
B
and
Y
=
A − B
.
Then we estimate a and b by
ˆa =
X + Y
2
,
ˆ
b =
X − Y
2
.
Then E[ˆa] = a and E[
ˆ
b] = b, i.e. they are unbiased. Also
var(ˆa) =
1
4
var(X + Y ) =
1
4
2σ
2
=
1
2
σ
2
,
and similarly for
b
. So we can measure it more accurately by measuring the
sticks together instead of separately.