Appendix: Maximum Likelihood Method

A report for my Stat 200B course

By Laura Balzano


In estimation, we are interested to do as best as we can to find the truth behind the data that we have observed. A way of posing this question statistically would be: what is the probability to see the observed values, given the true value of the unknowns? There is a common framework that forms a class of estimators, one of which is the Maximum Likelihood Estimator (MLE). Letís examine this framework.



The MLE and many other estimators fit into the above common framework. In this section I would like to answer two questions: 1. Why is this framework an important one? 2. Is the MLE optimal among this class of estimators?


There is a concept in statistics called consistency. The idea is this: as we gather a lot of data, we will have more and more information about our unknown parameters. Under these circumstances, our estimator should converge to the true value of our unknowns. One way to show this would be to show that our estimator converges to the true expected value. In turn then, since our estimating equation is equal to zero, should also be zero in a limit of large data.


Letís show that this is the case for the MLE. We want to prove that , or in other words, that our estimator converges to the true value of . We have defined the MLE as follows.



First we should examine our assumption. Why does the regularity condition of hold true? This is simple. First look at the following basic yet very useful property.



Now we have




Now that we know that the regularity condition is satisfied, we want to continue understanding the consistency of the MLE. There exists some truth, and letís say the true values are distributed like . We have our model, which is distributed like . The very best we can do is to find a point in our model that connects with a perpendicular line to the true value of the parameter.




We want to minimize the distance between and , and to assess this we will look at the Kullback-Leibler distance, . It is defined as follows.




With infinite data, the likelihood becomes



Thus the Kullback-Leibler distance here is




The entropy is a value we cannot change. So in order to maximize the likelihood, we need to minimize the Kullback-Leibler distance. Likelihood can be seen as a kind of error hereóthe Kullback-Leibler deviation.


For the MLE, we substitute into the Kullback-Leibler inequality above as follows.




with equality only when . With log likelihood, we are equivalently maximizing




and the Law of Large Numbers says that




We know from the above inequality that




This tells us that is maximized for . Therefore, must also be maximized for , which means that as ,



Thus we have proved MLE is consistent!


Now we want to show that the MLE is the optimal estimator among all unbiased estimators. We want to show that it has minimum variance, so we must first prove the bound of minimum variance and then we can show that MLE achieves that variance.

First, we need to recall some basic properties. Again denote . We will again use the property:†††††††††††



This property can be used to decompose .




The left hand side of this equation is a derivative of an expected value, which is a scalar, and therefore it is zero. Also, because the means of and are zero, we have a covariance on the right hand side.



We can now utilize the bound on covariance.



To be more general for a moment, let be any unbiased estimator. Because we have



and taking the derivative of both sides, we have



Again denoting we have




This denominator is called the Fisher Information for the data set.


We have achieved the result of the Cramer Rao Lower Bound. An estimator that reaches this bound is called an efficient estimator. No estimator can beat this lower bound on variance. An estimator that achieves this variance is called a Minimum Variance Unbiased Estimator. This result holds for any unbiased estimator .


Now to examine the optimality of the ML estimator, first recall



Substituting , the ML estimator, we have:



This will be useful after we do a bit more math. Remember, what we really want to find is how the variance of our ML estimator looks asymptotically. Does it converge to the Cramer Rao Lower Bound?


Letís approximate our class of estimator with a linear function by doing a Taylor expansion around the true value of . If we let , we have:




where is a very small value that goes to zero as N tends to infinity. From this linear perspective, is the intercept and is the slope.



By the central limit theorem, the numerator on the right hand side converges to a normal distribution;



and by the Law of Large Numbers, the denominator converges to the expected value.




By the Slusky theorem, this ratio then converges as follows.




This is great news! For MLE, we have:




So, our convergence becomes




Of course, we recognize this as variance of the inverse of Fisher Information. Very nice!! This is the asymptotic PDF of the MLE. It proves that the MLE is an efficient estimator.