3Discrete random variables
IA Probability
3.2 Inequalities
Here we prove a lot of different inequalities which may be useful for certain
calculations. In particular, Chebyshev’s inequality will allow us to prove the
weak law of large numbers.
Definition (Convex function). A function
f
: (
a, b
)
→ R
is convex if for all
x
1
, x
2
∈ (a, b) and λ
1
, λ
2
≥ 0 such that λ
1
+ λ
2
= 1,
λ
1
f(x
1
) + λ
2
f(x
2
) ≥ f(λ
1
x
1
+ λ
2
x
2
).
It is strictly convex if the inequality above is strict (except when
x
1
=
x
2
or
λ
1
or λ
2
= 0).
x
1
x
2
λ
1
x
1
+ λ
2
x
2
λ
1
f (x
1
) + λ
2
f (x
2
)
A function is concave if −f is convex.
A useful criterion for convexity is
Proposition. If
f
is differentiable and
f
′′
(
x
)
≥
0 for all
x ∈
(
a, b
), then it is
convex. It is strictly convex if f
′′
(x) > 0.
Theorem (Jensen’s inequality). If f : (a, b) → R is convex, then
n
X
i=1
p
i
f(x
i
) ≥ f
n
X
i=1
p
i
x
i
!
for all p
1
, p
2
, ··· , p
n
such that p
i
≥ 0 and
P
p
i
= 1, and x
i
∈ (a, b).
This says that E[f(X)] ≥ f (E[X]) (where P(X = x
i
) = p
i
).
If
f
is strictly convex, then equalities hold only if all
x
i
are equal, i.e.
X
takes only one possible value.
Proof. Induct on n. It is true for n = 2 by the definition of convexity. Then
f(p
1
x
1
+ ··· + p
n
x
n
) = f
p
1
x
1
+ (p
2
+ ··· + p
n
)
p
2
x
2
+ ··· + l
n
x
n
p
2
+ ··· + p
n
≤ p
1
f(x
1
) + (p
2
+ ···p
n
)f
p
2
x
2
+ ··· + p
n
x
n
p
2
+ ··· + p
n
.
≤ p
1
f(x
1
) + (p
2
+ ··· + p
n
)
p
2
( )
f(x
2
) + ··· +
p
n
( )
f(x
n
)
= p
1
f(x
1
) + ··· + p
n
(x
n
).
where the ( ) is p
2
+ ··· + p
n
.
Strictly convex case is proved with
≤
replaced by
<
by definition of strict
convexity.
Corollary (AM-GM inequality). Given x
1
, ··· , x
n
positive reals, then
Y
x
i
1/n
≤
1
n
X
x
i
.
Proof.
Take
f
(
x
) =
−log x
. This is convex since its second derivative is
x
−2
>
0.
Take P(x = x
i
) = 1/n. Then
E[f(x)] =
1
n
X
−log x
i
= −log GM
and
f(E[x]) = −log
1
n
X
x
i
= −log AM
Since
f
(
E
[
x
])
≤ E
[
f
(
x
)], AM
≥
GM. Since
−log x
is strictly convex, AM = GM
only if all x
i
are equal.
Theorem (Cauchy-Schwarz inequality). For any two random variables X, Y ,
(E[XY ])
2
≤ E[X
2
]E[Y
2
].
Proof. If Y = 0, then both sides are 0. Otherwise, E[Y
2
] > 0. Let
w = X − Y ·
E[XY ]
E[Y
2
]
.
Then
E[w
2
] = E
X
2
− 2XY
E[XY ]
E[Y
2
]
+ Y
2
(E[XY ])
2
(E[Y
2
])
2
= E[X
2
] − 2
(E[XY ])
2
E[Y
2
]
+
(E[XY ])
2
E[Y
2
]
= E[X
2
] −
(E[XY ])
2
E[Y
2
]
Since E[w
2
] ≥ 0, the Cauchy-Schwarz inequality follows.
Theorem (Markov inequality). If
X
is a random variable with
E|X| < ∞
and
ε > 0, then
P(|X| ≥ ε) ≤
E|X|
ε
.
Proof. We make use of the indicator function. We have
I[|X| ≥ ε] ≤
|X|
ε
.
This is proved by exhaustion: if
|X| ≥ ε
, then LHS = 1 and RHS
≥
1; If
|X| < ε
,
then LHS = 0 and RHS is non-negative.
Take the expected value to obtain
P(|X| ≥ ε) ≤
E|X|
ε
.
Similarly, we have
Theorem (Chebyshev inequality). If
X
is a random variable with
E
[
X
2
]
< ∞
and ε > 0, then
P(|X| ≥ ε) ≤
E[X
2
]
ε
2
.
Proof. Again, we have
I[{|X| ≥ ε}] ≤
x
2
ε
2
.
Then take the expected value and the result follows.
Note that these are really powerful results, since they do not make any
assumptions about the distribution of
X
. On the other hand, if we know
something about the distribution, we can often get a larger bound.
An important corollary is that if µ = E[X], then
P(|X − µ| ≥ ε) ≤
E[(X − µ)
2
]
ε
2
=
var X
ε
2