III Modern Statistical Methods - Classical statistics

1Classical statistics

III Modern Statistical Methods

1 Classical statistics

This is a course on modern statistical methods. Before we study methods, we

give a brief summary of what we are not going to talk about, namely classical

statistics.

So suppose we are doing regression. We have some predictors

∈ R

and

responses

∈ R

, and we hope to find a model that describes

as a function of

x. For convenience, define the vectors

X =













, Y =













The linear model then assumes there is some β

∈ R

such that

Y = Xβ

+ ε,

where

is some (hopefully small) error random variable. Our goal is then to

estimate β

given the data we have.

has full column rank, so that

is invertible, then we can use ordinary

least squares to estimate β

, with estimate

OLS

= argmin

β∈R

kY − xβk

= (X

−1

This assumes nothing about

itself, but if we assume that

Eε

= 0 and

var

(

) =

I, then this estimate satisfies

– E

OLS

= (X

−1

Xβ

= β

– var

(

OLS

) = (X

−1

var(ε)X(X

−1

= σ

−1

In particular, this is an unbiased estimator. Even better, this is the best linear

unbiased estimator. More precisely, the Gauss–Markov theorem says any other

linear estimator

β = AY has var(

β) − var(

OLS

) positive semi-definite.

Of course, ordinary least squares is not the only way to estimate

. Another

common method for estimating parameters is maximum likelihood estimation,

and this works for more general models than linear regression. For people who are

already sick of meeting likelihoods, this will be the last time we meet likelihoods

in this course.

Suppose we want to estimate a parameter

via knowledge of some data

We assume Y has density f(y; θ). We define the log-likelihood by

`(θ) = log f(Y, θ).

The maximum likelihood estimator then maximize `(θ) over θ to get

θ.

Similar to ordinary least squares, there is a theorem that says maximum

likelihood estimation is the “best”. To do so, we introduce the Fisher information

matrix. This is a family of d × d matrices indexed by θ, defined by

(θ) = −E



∂

∂θ

`(θ)



The relevant theorem is

Theorem (Cram´er–Rao bound). If

is an unbiased estimator for

, then

var(

θ) − I

−1

(θ) is positive semi-definite.

Moreover, asymptotically, as

n → ∞

, the maximum likelihood estimator is

unbiased and achieves the Carm´er–Rao bound.

Another wonderful fact about the maximum likelihood estimator is that

asymptotically, it is normal distributed, and so it is something we understand

well.

This might seem very wonderful, but there are a few problems here. The

results we stated are asymptotic, but what we actually see in real life is that

n → ∞

, the value of

also increases. In these contexts, the asymptotic

property doesn’t tell us much. Another issue is that these all talk about unbiased

estimators. In a lot of situations of interest, it turns out biased methods do

much much better than these methods we have.

Another thing we might be interested is that as

gets large, we might want

to use more complicated models than simple parametric models, as we have

much more data to mess with. This is not something ordinary least squares

provides us with.