1Classical statistics

III Modern Statistical Methods



1 Classical statistics
This is a course on modern statistical methods. Before we study methods, we
give a brief summary of what we are not going to talk about, namely classical
statistics.
So suppose we are doing regression. We have some predictors
x
i
R
p
and
responses
Y
i
R
, and we hope to find a model that describes
Y
as a function of
x. For convenience, define the vectors
X =
x
T
1
.
.
.
x
T
n
, Y =
Y
T
1
.
.
.
Y
T
n
.
The linear model then assumes there is some β
0
R
p
such that
Y = Xβ
0
+ ε,
where
ε
is some (hopefully small) error random variable. Our goal is then to
estimate β
0
given the data we have.
If
X
has full column rank, so that
X
T
X
is invertible, then we can use ordinary
least squares to estimate β
0
, with estimate
ˆ
β
OLS
= argmin
βR
p
kY k
2
2
= (X
T
X)
1
X
T
Y.
This assumes nothing about
ε
itself, but if we assume that
Eε
= 0 and
var
(
E
) =
σ
2
I, then this estimate satisfies
E
β
ˆ
β
OLS
= (X
T
X)
1
X
T
Xβ
0
= β
0
var
β
(
ˆ
β
OLS
) = (X
T
X
1
)X
T
var(ε)X(X
T
X)
1
= σ
2
(X
T
X)
1
.
In particular, this is an unbiased estimator. Even better, this is the best linear
unbiased estimator. More precisely, the Gauss–Markov theorem says any other
linear estimator
˜
β = AY has var(
˜
β) var(
ˆ
β
OLS
) positive semi-definite.
Of course, ordinary least squares is not the only way to estimate
β
0
. Another
common method for estimating parameters is maximum likelihood estimation,
and this works for more general models than linear regression. For people who are
already sick of meeting likelihoods, this will be the last time we meet likelihoods
in this course.
Suppose we want to estimate a parameter
θ
via knowledge of some data
Y
.
We assume Y has density f(y; θ). We define the log-likelihood by
`(θ) = log f(Y, θ).
The maximum likelihood estimator then maximize `(θ) over θ to get
ˆ
θ.
Similar to ordinary least squares, there is a theorem that says maximum
likelihood estimation is the “best”. To do so, we introduce the Fisher information
matrix. This is a family of d × d matrices indexed by θ, defined by
I
jk
(θ) = E
θ
2
θ
j
θ
j
`(θ)
.
The relevant theorem is
Theorem (Cram´er–Rao bound). If
˜
θ
is an unbiased estimator for
θ
, then
var(
˜
θ) I
1
(θ) is positive semi-definite.
Moreover, asymptotically, as
n
, the maximum likelihood estimator is
unbiased and achieves the Carm´er–Rao bound.
Another wonderful fact about the maximum likelihood estimator is that
asymptotically, it is normal distributed, and so it is something we understand
well.
This might seem very wonderful, but there are a few problems here. The
results we stated are asymptotic, but what we actually see in real life is that
as
n
, the value of
p
also increases. In these contexts, the asymptotic
property doesn’t tell us much. Another issue is that these all talk about unbiased
estimators. In a lot of situations of interest, it turns out biased methods do
much much better than these methods we have.
Another thing we might be interested is that as
n
gets large, we might want
to use more complicated models than simple parametric models, as we have
much more data to mess with. This is not something ordinary least squares
provides us with.