A report for my Stat 200B course
By Laura Balzano
In estimation, we are interested to do as best as we can to find the truth behind the data that we have observed. A way of posing this question statistically would be: what is the probability to see the observed values, given the true value of the unknowns? There is a common framework that forms a class of estimators, one of which is the Maximum Likelihood Estimator (MLE). Letís examine this framework.
The MLE and many other estimators fit into the above common framework. In this section I would like to answer two questions: 1. Why is this framework an important one? 2. Is the MLE optimal among this class of estimators?
There is a concept in statistics called consistency. The idea is this: as we gather a lot of data, we will have more and more information about our unknown parameters. Under these circumstances, our estimator should converge to the true value of our unknowns. One way to show this would be to show that our estimator converges to the true expected value. In turn then, since our estimating equation is equal to zero, †should also be zero in a limit of large data.
Letís show that this is the case for the MLE. We want to prove that , or in other words, that our estimator converges to the true value of . We have defined the MLE as follows.
First we should examine our assumption. Why does the regularity condition of †hold true? This is simple. First look at the following basic yet very useful property.
Now we have
Now that we know that the regularity condition is satisfied, we want to continue understanding the consistency of the MLE. There exists some truth, and letís say the true values are distributed like . We have our model, which is distributed like . The very best we can do is to find a point in our model that connects with a perpendicular line to the true value of the parameter.
We want to minimize the distance between †and , and to assess this we will look at the Kullback-Leibler distance, . It is defined as follows.
With infinite data, the likelihood becomes
Thus the Kullback-Leibler distance here is
The entropy is a value we cannot change. So in order to maximize the likelihood, we need to minimize the Kullback-Leibler distance. Likelihood can be seen as a kind of error hereóthe Kullback-Leibler deviation.
For the MLE, we substitute into the Kullback-Leibler inequality above as follows.
with equality only when . With log likelihood, we are equivalently maximizing
and the Law of Large Numbers says that
We know from the above inequality that
This tells us that †is maximized for . Therefore, †must also be maximized for , which means that as ,
Thus we have proved MLE is consistent!
Now we want to show that the MLE is the optimal estimator among all unbiased estimators. We want to show that it has minimum variance, so we must first prove the bound of minimum variance and then we can show that MLE achieves that variance.
First, we need to recall some basic properties. Again denote . We will again use the property:†††††††††††
This property can be used to decompose .
The left hand side of this equation is a derivative of an expected value, which is a scalar, and therefore it is zero. Also, because the means of †and †are zero, we have a covariance on the right hand side.
We can now utilize the bound on covariance.
To be more general for a moment, let †be any unbiased estimator. Because †we have
and taking the derivative of both sides, we have
Again denoting †we have
This denominator †is called the Fisher Information for the data set.
We have achieved the result of the Cramer Rao Lower Bound. An estimator that reaches this bound is called an efficient estimator. No estimator can beat this lower bound on variance. An estimator that achieves this variance is called a Minimum Variance Unbiased Estimator. This result holds for any unbiased estimator .
Now to examine the optimality of the ML estimator, first recall
Substituting , the ML estimator, we have:
This will be useful after we do a bit more math. Remember, what we really want to find is how the variance of our ML estimator looks asymptotically. Does it converge to the Cramer Rao Lower Bound?
Letís approximate our class of estimator with a linear function by doing a Taylor expansion around the true value of . If we let , we have:
where †is a very small value that goes to zero as N tends to infinity. From this linear perspective, †is the intercept and †is the slope.
By the central limit theorem, the numerator on the right hand side converges to a normal distribution;
and by the Law of Large Numbers, the denominator converges to the expected value.
By the Slusky theorem, this ratio then converges as follows.
This is great news! For MLE, we have:
So, our convergence becomes
Of course, we recognize this as variance of the inverse of Fisher Information. Very nice!! This is the asymptotic PDF of the MLE. It proves that the MLE is an efficient estimator.