Estimation and Regression

A report[i] for Statistical Theory class by Laura Balzano, written January/February 2005


click here for this report in pdf format


In science, engineering and statistics, we are often interested in predicting the future based on the past. When you have data and you would like to best fit a curve to that data, you do a regression. When you have collected samples and you would like to estimate parameterization of those samples, you do estimation. They boil down to the same thing: creating a model for the natural world from data samples such that you can interpolate and extrapolate more information.




During the 17th and 18th centuries, many prominent mathematicians were interested in understanding the orbit of the planets. Isaac Newton developed a now well-known equation for the gravitational pull between two masses, . The orbit of the planets can be fairly well described by using this equation to calculate the gravitational pull between the planet and the sun, which is the body in our solar system with the most mass. However, this equation did not give an exact orbit; the mass of some of the planets, particularly the largest two, Jupiter and Saturn, had an additional affect on how the orbit traced itself out in the sky.


In 1749 Euler, after setting out to solve the problem of Jupiter’s influence on Saturn, came up with an equation solution. There were many variables and some of them could be calculated from observations.



were given, and could be easily calculated. There was a set of 75 observations that he used to measure these quantities. Still, that left him with … six more variables that remained unknown, and he wanted to estimate them. So with 75 equations and 6 unknowns, mathematicians were now faced with the problem of how best to solve for the 6 unknown variables.


Linear Regression and Combinations of Equations


It turned out that Euler’s equation was linear in the unknown variables. We can therefore formulate the question in terms of a linear regression—how do we estimate those 6 unknowns in order to best fit the data?


Rewrite the equation as




or,  with




Around this time, Laplace had a set of 24 equations with 4 unknown variables that he was trying to solve. His method for solving the equations was to choose 24 of the equations and combine them so as to get only 4 equations. His combinations were made up of additions and subtractions of various of the 24 equations in such a way that Laplace thought would find the best estimate for the unknown variables. It seems, though, that we could come up with many ways of combining equations to solve for the unknowns in a problem. What is the best way, and how do we know we are using the best way?


Least Squares


In 1805, Legendre decided to take a different approach. Instead of trying to find the best way to combine equations, he wanted to start from some criterion which he could optimize. He created an objective equation which finds the difference between what is observed and what is predicted. He decided to minimize this difference. Legendre designed this objective equation to be scalar-- it takes the observed equations and returns one number.




This method is named Least Squares. Gauss later claimed that he had been using this method since 1795.


LS Estimating Equation


Because Legendre’s objective equation is convex, we can minimize this difference by taking the derivative and setting it to zero. This leaves us with the estimating equation for beta. Watch the matrix and vector notation very carefully—it is tricky to keep it straight.


This last step is the trickiest matrix-wise and even I have not convinced myself of it …




If you can be convinced of that… or even if not: We can generalize this estimating equation as follows. Assume we have n equations and d unknowns, and we would like to have d equations.



To turn our n equations into d, we can multiply both sides by a matrix A of dimensions d by n, and then solve for beta. Now we see that Least Square is of this general form with.


Projection and the Cauchy Inequality


We can view the estimate,, as a projection of Y into the space of the model X. We can see that  will always be smaller than Y from Cauchy’s inequality.



I have never had a complete intuition for Cauchy-Schwartz, and thinking of things geometrically has never been my cup of tea. So this short section is not complete, but hopefully, if you are good with this sort of thing, it might give you some insight that I don’t have.


Hypothetical Repeated Sampling


Take the following thought experiment: If we can fix the conditions for all n observations (X), and then measure Y repeatedly, the Y will be different from all previous observations due to random error. If we repeat this 1000 times, each time will yield a different Y. X is fixed and the Y we get is random.


If we then have new conditions, how do we measure the accuracy? Recall that Y=X*beta. We can look at the difference between the new conditions with our estimated values for beta and the new conditions with the actual true value of beta.




If we assume ergodicity, we can take the sample average and assume it will converge to the ensemble average over repeated sampling… that is, the sample average will converge to the statistical expected value. In order to then measure the accuracy, here we choose to look at the mean squared error. We have the MSE between a random variable  and a constant. Let.





By first using the trick of adding and subtracting, and by then seeing that the cross-term goes to zero with the term, we can see from this derivation that the MSE can be decomposed into the variance and the bias-squared.


If we then assume that the expectation of the noise error is 0, we can show that the Least Squares Estimator is unbiased, that is bias = 0. So in order to minimize the mean squared error, we simply have left to minimize the variance of the estimator.


Optimality of Least Squares


How can we discover whether Least Squares Estimator has an optimally minimal variance and minimizes the mean square error? Let’s apply the variance + bias-squared decomposition to the general combination-of-equations form from above. This time assume that, where  is measurement error.




Now, let



This concludes that the bias = 0, or that this matrix-A estimator is an unbiased estimator. Recall Least Square is therefore also in this category of unbiased estimators, because for Least Square. Next we look at the variance. Recall that the first term on the right hand side is a constant. The variance is therefore as follows.




For Least Square, and




Can we show that (Eqn 1) >= (Eqn 2), that is, in general any A will have equal or higher variance than the Least Square? Again, we refer to Cauchy-Schwartz.




And again, this is completely hand-wavy to me. Where did  come from on the fourth line? I do not know. But still, we see that since, Least Square achieves the lowest value here. In conclusion, under the assumption that we have a correct model, a constant variance, and independent error distributions, the Least Squares Estimator is optimal.


Model Capacity, Training and Testing errors, Bias and Variance


Orthonormal Regressors


Let us assume our are orthonormal, that is, that the inner product between two X will give us zero, whereas the inner product of an X with itself will give us 1. Think: What does it mean for Xs to be orthonormal? I’m not sure, but let’s go on. Now because we assume orthonormal basis, we can do individual projections of Y onto each X.





We can then see that Least Square can decompose our estimate into a signal part and a noise part.




The Model


At this point we would like to take a look at our model. Our best model can only approximate the truth. We will have training data and testing data, and from our training data we hope to predict our testing data accurately. Enter the trade-off between accuracy and simplicity. If we make our model too accurate to our training data, the model will be very complex and will be a poor predictor of our testing data. On the other hand, if our model is simple but not accurate enough, we will miss the patterns in the training data.




To understand this let’s first look at the training error of Least Square.



Here we see that when examining the training data, if we increase the dimensionality d, both bias and variance of the estimator drop. My questions: Why does the cross-term cancel on line 6? And how do we get line 8 from line 6? The steps for the testing error are much clearer. Now let’s examine the testing error.



Now we see that larger d will come back to hurt us. With the testing data, a higher d gives us a lower bias… but it also increases the variance of our estimator. We have two extremes. Too simple a model is dumb, whereas too complex a model is “superstitious,” and captures pure coincidence.


In general, our testing error for estimator can be derived as follows.




We can reduce the center term with some intelligent manipulations on our estimator. One method is called “thresholding,” where we set small  to zero. This is like zeroing out the signal in places where it is much smaller than the noise.




Another method is called “shrinkage,” or the Ridge Estimator. A shrinking parameter for our estimator leads to a smaller error.




The Ridge Estimator is an example of the tradeoff between variance and bias. Instead of minimizing, we are minimizing. With the first term, we do our best to explain the data and with the second term we curb the model from choosing noise. As an example, consider the spline regression. In linear spline, represents a change in slope. We then minimize  to make the curve smooth. Wavelet regression is another example of a more complicated regressor.


Finally, here is a brief note on an interesting result that I have not explored in detail. In normal noise with, the Least Square method (i.e., Maximum Likelihood because of normal noise) can always be beaten by the Stein Estimator.





From these last few sections we must take away that we have an important tradeoff between complexity of the model, degree d, and simplicity of the model in predicting future data. We can take the approach of reducing the actual degree of freedom in the model to make it simpler, even when we have a complex model with high degree d.





My notes from Professor Wu’s STAT 200B course, winter quarter 2005, University of California at Los Angeles. His webpage can be found here:


This homework assignment page:




Kay, Steven. Fundamentals of Statistical Signal Processing: Estimation Theory. 1993.





[i] I would like to apologize for the incomplete proofs and lack of rigor in this report. I plan to use it as a basic reference and to improve the proofs and arguments as time passes. I welcome any suggestions to