A report for my Stat 200B course

By Laura Balzano

In estimation, we are interested to do as best as we can to find the truth behind the data that we have observed. A way of posing this question statistically would be: what is the probability to see the observed values, given the true value of the unknowns? There is a common framework that forms a class of estimators, one of which is the Maximum Likelihood Estimator (MLE). Let’s examine this framework.

_{}

The MLE and many other estimators fit into the above common
framework. In this section I would like to answer two questions: 1. Why is this
framework an important one? 2. Is the MLE optimal among this class of
estimators?

There is a concept in statistics called *consistency*. The idea is this: as we gather a lot of data, we will
have more and more information about our unknown parameters. Under these
circumstances, our estimator should converge to the true value of our unknowns.
One way to show this would be to show that our estimator converges to the true
expected value. In turn then, since our estimating equation is equal to zero, _{} should also be zero in
a limit of large data.

_{}

Let’s show that this is the case for the MLE. We want to
prove that _{}, or in other words, that our estimator converges to the true
value of _{}. We have defined the MLE as follows.

_{}

First we should examine our assumption. Why does the
regularity condition of _{} hold true? This is
simple. First look at the following basic yet very useful property.

_{}

Now we have

_{}

Now that we know that the regularity condition is satisfied,
we want to continue understanding the consistency of the MLE. There exists some
truth, and let’s say the true values are distributed like _{}. We have our model, which is distributed like
_{}. The very best we can do is to find a point in our model
that connects with a perpendicular line to the true value of the parameter.

We want to minimize the distance between _{} and _{}, and to assess this we will look at the Kullback-Leibler
distance, _{}. It is defined as follows.

_{}

With infinite data, the likelihood becomes

_{}

Thus the Kullback-Leibler distance here is

_{}

The entropy is a value we cannot change. So in order to maximize the likelihood, we need to minimize the Kullback-Leibler distance. Likelihood can be seen as a kind of error here—the Kullback-Leibler deviation.

For the MLE, we substitute into the Kullback-Leibler inequality above as follows.

_{}

with equality only when _{}. With log likelihood, we are equivalently maximizing

_{}

and the Law of Large Numbers says that

_{}

We know from the above inequality that

_{}

This tells us that _{} is maximized for _{}. Therefore, _{} must also be maximized
for _{}, which means that as _{},

_{}

Thus we have proved MLE is consistent!

Now we want to show that the MLE is the optimal estimator among all unbiased estimators. We want to show that it has minimum variance, so we must first prove the bound of minimum variance and then we can show that MLE achieves that variance.

First, we need to recall some basic properties. Again denote _{}. We will again use the property:

_{}

This property can be used to decompose _{}.

_{}

The left hand side of this equation is a derivative of an
expected value, which is a scalar, and therefore it is zero. Also, because the
means of _{} and _{} are zero, we have a
covariance on the right hand side.

_{}

We can now utilize the bound on covariance.

_{}

To be more general for a moment, let _{} be any unbiased
estimator. Because _{} we have

_{}

and taking the derivative of both
sides, we have

_{}

Again denoting _{} we have

_{}

This denominator _{} is called the Fisher
Information for the data set.

We have achieved the result of the Cramer Rao Lower Bound. An estimator that reaches this bound is
called an *efficient estimator*. No
estimator can beat this lower bound on variance. An estimator that achieves
this variance is called a Minimum Variance Unbiased Estimator. This result
holds for any unbiased estimator _{}.

Now to examine the optimality of the ML estimator, first recall

_{}

Substituting _{}, the ML estimator, we have:

_{}

This will be useful after we do a bit more math. Remember, what we really want to find is how the variance of our ML estimator looks asymptotically. Does it converge to the Cramer Rao Lower Bound?

Let’s approximate our class of estimator with a linear
function by doing a Taylor expansion around the true value of
_{}. If we let _{}, we have:

_{}

where _{} is a very small value
that goes to zero as N tends to infinity. From this linear perspective, _{} is the intercept and _{} is the slope.

_{}

By the central limit theorem, the numerator on the right hand side converges to a normal distribution;

_{}

and by the Law of Large Numbers, the denominator converges to the expected value.

_{}

By the Slusky theorem, this ratio
then converges as follows.

_{}

This is great news! For MLE, we have:

_{}

So, our convergence becomes

_{}

Of course, we recognize this as variance of the inverse of Fisher Information. Very nice!! This is the asymptotic PDF of the MLE. It proves that the MLE is an efficient estimator.