**Estimation and Regression**

A report[i] for Statistical Theory class by Laura Balzano, written January/February 2005

In science, engineering and statistics, we are often interested in predicting the future based on the past. When you have data and you would like to best fit a curve to that data, you do a regression. When you have collected samples and you would like to estimate parameterization of those samples, you do estimation. They boil down to the same thing: creating a model for the natural world from data samples such that you can interpolate and extrapolate more information.

**Astronomy**

During the 17^{th} and 18^{th} centuries,
many prominent mathematicians were interested in understanding the orbit of the
planets. Isaac Newton developed a now well-known equation for the gravitational
pull between two masses, _{}. The orbit of the planets can be fairly well described by
using this equation to calculate the gravitational pull between the planet and
the sun, which is the body in our solar system with the most mass. However,
this equation did not give an exact orbit; the mass of some of the planets,
particularly the largest two, Jupiter and Saturn, had an additional affect on
how the orbit traced itself out in the sky.

In 1749 Euler, after setting out to solve the problem of Jupiter’s influence on Saturn, came up with an equation solution. There were many variables and some of them could be calculated from observations.

_{}

_{}were given, and _{}could be easily calculated. There was a set of 75
observations that he used to measure these quantities. Still, that left him
with _{}… six more variables that remained unknown, and he wanted to
estimate them. So with 75 equations and 6 unknowns, mathematicians were now
faced with the problem of how best to solve for the 6 unknown variables.

**Linear Regression and
Combinations of Equations**

It turned out that Euler’s equation was linear in the unknown variables. We can therefore formulate the question in terms of a linear regression—how do we estimate those 6 unknowns in order to best fit the data?

Rewrite the equation as

_{}

_{}

or, _{} with

_{}

Around this time,

**Least Squares**

In 1805, Legendre decided to take a different approach. Instead of trying to find the best way to combine equations, he wanted to start from some criterion which he could optimize. He created an objective equation which finds the difference between what is observed and what is predicted. He decided to minimize this difference. Legendre designed this objective equation to be scalar-- it takes the observed equations and returns one number.

_{}

_{}

This method is named Least Squares. Gauss later claimed that he had been using this method since 1795.

**LS Estimating Equation**

Because Legendre’s objective equation is convex, we can minimize this difference by taking the derivative and setting it to zero. This leaves us with the estimating equation for beta. Watch the matrix and vector notation very carefully—it is tricky to keep it straight.

_{}

This last step is the trickiest matrix-wise and even I have not convinced myself of it …

_{}

If you can be convinced of that… or even if not: We can generalize this estimating equation as follows. Assume we have n equations and d unknowns, and we would like to have d equations.

_{}

To turn our n equations into d, we can multiply both sides
by a matrix A of dimensions d by n, and then solve for beta. Now we see that _{}.

**Projection and the
Cauchy Inequality**

We can view the estimate,_{}, as a projection of Y into the space of the model X. We can
see that _{} will always be smaller
than Y from Cauchy’s inequality.

_{}

I have never had a complete intuition for Cauchy-Schwartz, and thinking of things geometrically has never been my cup of tea. So this short section is not complete, but hopefully, if you are good with this sort of thing, it might give you some insight that I don’t have.

**Hypothetical Repeated
Sampling**

Take the following thought experiment: If we can fix the conditions for all n observations (X), and then measure Y repeatedly, the Y will be different from all previous observations due to random error. If we repeat this 1000 times, each time will yield a different Y. X is fixed and the Y we get is random.

If we then have new conditions_{}, how do we measure the accuracy? Recall that Y=X*beta. We
can look at the difference between the new conditions with our estimated values
for beta and the new conditions with the actual true value of beta.

_{}

If we assume ergodicity, we can take the sample average and
assume it will converge to the ensemble average over repeated sampling… that
is, the sample average will converge to the statistical expected value. In
order to then measure the accuracy, here we choose to look at the mean squared
error. We have the MSE between a random variable _{} and a constant_{}. Let_{}.

_{}

By first using the trick of adding and subtracting_{}, and by then seeing that the cross-term goes to zero with
the_{} term, we can see from this derivation that the MSE can be
decomposed into the variance and the bias-squared.

If we then assume that the expectation of the noise error is 0, we can show that the Least Squares Estimator is unbiased, that is bias = 0. So in order to minimize the mean squared error, we simply have left to minimize the variance of the estimator.

**Optimality of Least
Squares**

How can we discover whether Least Squares Estimator has an
optimally minimal variance and minimizes the mean square error? Let’s apply the
variance + bias-squared decomposition to the general combination-of-equations
form from above. This time assume that_{}, where _{} is measurement error.

_{}

Now, let _{}

_{}

This concludes that the bias = 0, or that this matrix-A
estimator is an unbiased estimator. Recall _{}. Next we look at the variance. Recall that the first term on
the right hand side is a constant. The variance is therefore as follows.

_{}

For _{}and

_{}

Can we show that (Eqn 1) >= (Eqn 2), that is, in general
any A will have equal or higher variance than the _{}? Again, we refer to Cauchy-Schwartz.

_{}

And again, this is completely hand-wavy to me. Where did _{} come from on the
fourth line? I do not know. But still, we see that since_{}, Least Square achieves the lowest value here. In conclusion,
under the assumption that we have a correct model, a constant variance, and
independent error distributions, the Least Squares Estimator is optimal.

**Model Capacity, Training and Testing errors, Bias and
Variance**

**Orthonormal
Regressors**

Let us assume our _{}are orthonormal, that is, that the inner product between two
X will give us zero, whereas the inner product of an X with itself will give us
1. Think: What does it mean for Xs to be orthonormal? I’m not sure, but let’s
go on. Now because we assume orthonormal basis, we can do individual
projections of Y onto each X.

_{}

We can then see that

_{}

**The Model**

At this point we would like to take a look at our model. Our best model can only approximate the truth. We will have training data and testing data, and from our training data we hope to predict our testing data accurately. Enter the trade-off between accuracy and simplicity. If we make our model too accurate to our training data, the model will be very complex and will be a poor predictor of our testing data. On the other hand, if our model is simple but not accurate enough, we will miss the patterns in the training data.

_{}

To understand this let’s first look at the training error of

_{}

Here we see that when examining the training data, if we increase the dimensionality d, both bias and variance of the estimator drop. My questions: Why does the cross-term cancel on line 6? And how do we get line 8 from line 6? The steps for the testing error are much clearer. Now let’s examine the testing error.

_{}

Now we see that larger d will come back to hurt us. With the testing data, a higher d gives us a lower bias… but it also increases the variance of our estimator. We have two extremes. Too simple a model is dumb, whereas too complex a model is “superstitious,” and captures pure coincidence.

In general, our testing error for estimator_{} can be derived as follows.

_{}

We can reduce the center term with some intelligent
manipulations on our estimator. One method is called “thresholding,” where we
set small _{} to zero. This is like
zeroing out the signal in places where it is much smaller than the noise.

_{}

Another method is called “shrinkage,” or the Ridge
Estimator. A shrinking parameter _{}for our estimator leads to a smaller error.

_{}

The Ridge Estimator is an example of the tradeoff between
variance and bias. Instead of minimizing_{}, we are minimizing_{}. With the first term, we do our best to explain the data and
with the second term we curb the model from choosing noise. As an example, consider
the spline regression. In linear spline,_{} represents a change in slope. We then minimize _{} to make the curve
smooth. Wavelet regression is another example of a more complicated regressor.

Finally, here is a brief note on an interesting result that
I have not explored in detail. In normal noise with_{}, the

_{}

From these last few sections we must take away that we have an important tradeoff between complexity of the model, degree d, and simplicity of the model in predicting future data. We can take the approach of reducing the actual degree of freedom in the model to make it simpler, even when we have a complex model with high degree d.

References:

My notes from Professor Wu’s STAT 200B course, winter quarter 2005, University of California at Los Angeles. His webpage can be found here: http://www.stat.ucla.edu/~ywu

This homework assignment page: http://www.stat.ucla.edu/~ywu/teaching/200BHW1.pdf

Wikipedia:

http://en.wikipedia.org/wiki/Gauss-Markov_theorem

http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss

http://en.wikipedia.org/wiki/Least_squares

Kay, Steven. Fundamentals of Statistical Signal Processing: Estimation Theory. 1993.

[i] I would like to apologize for the incomplete proofs and lack of rigor in this report. I plan to use it as a basic reference and to improve the proofs and arguments as time passes. I welcome any suggestions to sunbeam@ee.ucla.edu.