%matplotlib inline
%pylab inline
from scipy import stats
from IPython.html.widgets import interact
Populating the interactive namespace from numpy and matplotlib
Regression means fitting data to a line or a polynomial. The simpliest case is a line where the model is the following:
y=ax+b+ϵwhere ϵ∼N(0,σ2) and we have to determine a and b from the data pairs {(Yi,Xi)}ni=1. There is a subtle point here. The variable x changes, but not as a random variable in this model. Thus, for fixed x, y is a random variable generated by ϵ. To be thorough, perhaps we should denote ϵ as ϵx to make this clear, but because ϵ is an identically-distributed random variable at each fixed x, we leave it out. Because of linearity and the Gaussian additive noise, the distribution of y is completely characterized by its mean and variance.
E(y)=ax+bUsing the maximum likelihood procedure we discussed earlier, we write out the log-likelihood function as
L(a,b)=n∑i=1logN(aXi+b,σ2)∝12σ2n∑i=1(Yi−aXi−b)2Note that I just threw out most of the terms that are irrelevent to the maximum-finding. Taking the derivative of this with respect to a gives the following equation:
∂L(a,b)∂a=2n∑i=1Xi(b+aXi−Yi)=0Likewise, we do the same for the b parameter
∂L(a,b)∂b=2n∑i=1(b+aXi−Yi)=0Solving these two equations for a and b gives the parameters for the line we seek.
a = 6;b = 1 # parameters to estimate
x = linspace(0,1,100)
# x = rand(100)*100 # x does not have to be monotone for this to work
y = a*x + np.random.randn(len(x)) +b
p,var_=np.polyfit(x,y,1,cov=True) # fit data to line
y_ = np.polyval(p,x) # estimated by linear regression
# draw comparative fits and hisogram of errors
fig,axs=subplots(1,2)
fig.set_size_inches((10,2))
ax =axs[0]
ax.plot(x,y,'o',alpha=.5)
ax.plot(x,y_)
ax.set_xlabel("x",fontsize=18)
ax.set_ylabel("y",fontsize=18)
ax.set_title("linear regression; a =%3.3g\nb=%3.3g"%(p[0],p[1]))
ax = axs[1]
ax.hist(y_-y)
ax.set_xlabel(r"$\Delta y$",fontsize=18)
ax.set_title(r"$\hat{y}-y$",fontsize=18)
#ax.set_aspect(1/2)
<matplotlib.text.Text at 0xda558b0>
The graph on the left shows the regression line plotted against the data. The estimated parameters are noted in the title. You can tweak the code in the cell above to try other values for a and b if you want. The histogram on the right shows the errors in the model. Note that the x term does not have to be uniformly monotone. You can tweak the code above as indicated to try a uniform random distribution along the x-axis. It is also interesting to see how the above plots change with more or fewer data points.
We said earlier that we can fix the index and have separate problems of the form
yi=axi+b+ϵwhere ϵ∼N(0,σ2). What could I do with just this one component of the problem? In other words, suppose we had k-samples of this component as in {yi,k}mk=1. Following the usual procedure, we could obtain estimates of the mean of yi as
^yi=1mm∑k=1yi,kHowever, this tells us nothing about the individual parameters a and b because they are not-separable in the terms that are computed, namely, we may have
E(yi)=axi+bbut we still only have one equation and the two unknowns. How about if we consider and fix another component j as in
yj=axj+b+ϵand likewise, we have
E(yj)=axj+bso at least now we have two equations and two unknowns and we know how to estimate the left hand sides of these equations from the data using the estimators ^yi and ^yj. Let's see how this works in the code sample below.
x0 =x[0]
xn =x[80]
y_0 = a*x0 + np.random.randn(20)+b
y_1 = a*xn + np.random.randn(20)+b
fig,ax=subplots()
ax.plot(x0*ones(len(y_0)),y_0,'o')
ax.plot(xn*ones(len(y_1)),y_1,'o')
ax.axis(xmin=-.02,xmax=1)
a_,b_=inv(np.matrix([[x0,1],[xn,1]])).dot(vstack([y_0,y_1]).mean(axis=1)).flat
x_array = np.array([x0,xn])
ax.plot(x_array,a_*x_array+b_,'-ro')
ax.plot(x,a*x+b,'k--');
def plot_2_estimator(n=50):
x0 =x[0]
xn =x[n]
y_0 = a*x0 + np.random.randn(20)+b
y_1 = a*xn + np.random.randn(20)+b
fig,ax=subplots()
ax.plot(x0*ones(len(y_0)),y_0,'o',alpha=.3)
ax.plot(xn*ones(len(y_1)),y_1,'o',alpha=.3)
ax.axis(xmin=-.25,xmax=1,ymax=10,ymin=-2)
a_,b_=inv(np.matrix([[x0,1],[xn,1]])).dot(vstack([y_0,y_1]).mean(axis=1)).flat
x_array = np.array([x0,x[-1]])
ax.grid()
ax.plot(x_array,a_*x_array+b_,'-ro')
ax.plot(x,a*x+b,'k--');
interact(plot_2_estimator,n=(1,99,1));
We can write out the solution for the estimated parameters for this case where x0=0
ˆa=^yi−^y0xiThe expectation of the first estimator is the following
E(ˆa)=axixi=aThe above results show that the estimator ˆa has a variance that decreases as larger points xi are selected which is what we observed in the interactive graph above.
So, where are we now? We have a way of doing regression when there is only one sample at each xi point and we can do the same when we have many samples at each of two x-coordinates. Is there a way to combine the two approaches? This is something that is not considered in the usual discussions about regression, which always consider only the first case.
In this light, let us re-consider the first regression problem. We have only one sample at each xi coordinate and we want to somehow combine these into unbiased estimators of a and b. The explicit formulas in this case are well known as the following:
ˆa=xTy−(∑Xi)(∑Yi)/nxTx−(∑Xi)2/nNote that you can get the ˆb from ˆa by observing that for each component we have
Yi=ˆaXi+ˆband then by summing over the n components we obtain
n∑i=1Yi=ˆan∑i=1Xi+nˆbThen, solving this equation for ˆb with the ˆa above gives the equations shown as below
ˆb=∑ni=1Yi−ˆa∑ni=1XinIn the case where there are only two Xi values, namely X0=0 and Xj, we obtain
ˆa=Xj∑{Xj}Yi−mXj∑YinmX2j−m2X2j/n=∑{Xj}Yi/m−∑{X0}Yi+∑{Xj}YinXj(1−m/n)=ˆyj−ˆy0Xjwhich is the same as our first estimator. This means that the general theory is capable of handling the case of multiple estimates at the same Xj. So what does this mean? We saw that having samples further out along the x-coordinate reduced the variance of the estimator in the two-sample study. Does this mean that in the regression model, where there is only one sample per x-coordinate, that those further along the x-coordinate are likewise more valuable in terms of variance reduction?
This is tricky to see in the above, but if we consider the simplified model without the y-intercept, we can obtain the following:
ˆa=xTyxTxwith corresponding
V(ˆa)=σ2‖x‖2And now the relative value of large x is more explicit.
In vector notation, we can write the following:
y=ax+b1+ϵThen, by taking the inner-product with some x1∈1⊥ we obtain
⟨y,x1⟩=a⟨x,x1⟩+⟨ϵ,x1⟩and then sweeping the expectation over this (recall that E(ϵ)=0)gives
⟨y,x1⟩=a⟨x,x1⟩which we can finally solve for a as
ˆa=⟨y,x1⟩⟨x,x1⟩that was pretty neat but now we have the mysterious x1 vector. Where does this come from? If we project x onto the 1⊥, then we get the minimum-distance (in the L2 sense) approximation to x in the 1⊥ space. Thus, we take
x1=P1⊥(x)Remember that P1⊥ is a projection matrix so the length of x1 is smaller than x. This means that the denominator in the ˆa equation above is really just the length of the x vector in the coordinate system of P1⊥. Because the projection is orthogonal (namely, of minimum length), the Pythagorean theorem gives this length as the following:
⟨x,x1⟩2=⟨x,x⟩−⟨1Tx⟩2The first term on the right is the length of the x vector and last term is the length of x in the coordinate system orthogonal to P1⊥, namely that of 1. This is the same as the denominator of the ˆa estimator we originally wrote. After all that work, we can use this geometric interpretation to understand what is going on in typical linear regression in much more detail. The fact that the denominator is the orthogonal projection of x tells us that this is the choice of x1 that has the strongest effect (i.e. largest value) on reducing the variance of ˆa. This is because it is the term that enters in the denominator which, as we observed earlier, is what is reducing the variance of ˆa. We already know that ˆa is an unbiased estimator and because of this we know that it is additional of minimum variance. Such estimators are know and minimum-variance unbiased estimators (MVUE).
In the same spirit, let's examine the numerator of ˆa. We can write x1 as the following
x1=x−P1xwhere P1 is projection matrix of x onto the 1 vector. Using this, the numerator of ˆa becomes
⟨y,x1⟩=⟨y,x⟩−⟨y,P1x⟩Note that is the outer product of 1 as in the following:
P1=11T1nso that writing this out explicitly gives
⟨y,P1x⟩=(yT1)(1Tx)which, outside of a 1n scale factor I am omitting to reduce notational noise, is the same as
(∑Yi)(∑Xi)/nSo, plugging all of this together gives what we have seen before as
ˆa∝yTx−(∑Yi)(∑Xi)/nThe variance of ˆa is the following:
V(ˆa)=σ2‖x1‖2⟨x,x1⟩2Doing the exact same thing with ˆb gives
ˆb=⟨y,x⊥⟩⟨1,x⊥⟩=⟨y,1−Px(1)⟩⟨1,1−Px(1)⟩where
Px=xxT‖x‖2This is unbiased with variance
V(ˆb)=σ2⟨ξ,ξ⟩⟨1,ξ⟩2where ξ=1−Px(1)
Now, after moving all of that machinery, we should have a very clear understanding of what is going on in every step of the linear regression equation. The payoff for all this work is that now we know where to intercede in the construction to satisfy other requirements we might have to deal with real-world data problems.
n = len(x)
one = ones((n,))
P_1 = ones((n,n))/n-eye(n)*1.35
#P_1 = ones((n,n))/n
P_x = outer(x,x)/dot(x,x)-eye(n)*1.3
x_1 = x-dot(P_1,x)
sumx = sum(x)
o=[]
ofit=[]
for i in range(500):
y = a*x + np.random.randn(n)+b
a_hat = dot(x_1,y)/dot(x_1,x)
b_hat = dot(y,one-dot(P_x,one))/dot(one,one-dot(P_x,one))
o.append((a_hat,b_hat))
ofit.append(tuple(polyfit(x,y,1)))
ofit = array(ofit)
o = array(o)
fig,axs=subplots(2,2)
fig.set_size_inches((6,5))
ax=axs[0,0]
ax.set_title('Trading bias and variance')
ax.hist(o[:,0],20,alpha=.3)
ax.hist(ofit[:,0],20,alpha=.3)
ax=axs[0,1]
ax.plot(o[:,0],ofit[:,0],'.')
ax.plot(linspace(4,10,2),linspace(4,10,2))
ax.set_aspect(1)
ax.set_title('var=%3.3g vs. %3.3g'%(var(o[:,0]),var(ofit[:,0])))
ax=axs[1,0]
ax.hist(o[:,0],20,alpha=.3)
ax.hist(ofit[:,0],20,alpha=.3)
ax=axs[1,1]
ax.plot(o[:,1],ofit[:,1],'.')
ax.plot(linspace(0,1,3),linspace(0,1,3))
ax.set_aspect(1)
ax.set_title('var=%3.3g vs. %3.3g'%(var(o[:,1]),var(ofit[:,1])))
<matplotlib.text.Text at 0xd9e0630>
one = ones((n,))
xi=one-dot(P_x,one)
(a_,b_),var_= polyfit(x,y,1,cov=True)
sigma2_est = var(polyval([a_,b_],x)-y)
b_hat_var = sigma2_est*dot(xi,xi)/dot(one,xi)**2
a_hat_var = sigma2_est*dot(x_1,x_1)/dot(x_1,x)**2
a_hat_lo,a_hat_hi=stats.norm(a_hat,sqrt(a_hat_var)).interval(.95)
b_hat_lo,b_hat_hi=stats.norm(b_hat,sqrt(b_hat_var)).interval(.95)
plot(x,y,'o',alpha=.5,ms=5.,lw=4.,label='data')
plot(x,polyval([a_hat,b_hat],x),lw=4.,label='regularized',alpha=.5)
plot(x,polyval([a,b],x),lw=3.,label='true',alpha=.5)
plot(x,polyval(polyfit(x,y,1),x),lw=3.,label='MVUE',alpha=.5,color='k')
plot(x,polyval([a_hat_hi,b_hat_hi],x),'--c')
plot(x,polyval([a_hat_lo,b_hat_lo],x),'--c')
legend(loc=0);
x = linspace(0,1,n)
sumx=sum(x)
y = a*x + np.random.randn(n)+b
sumy = sum(y)
def plot_lin_regularizer(lam= 0.1):
P_1 = ones((n,n))/n-eye(n)*lam
x_1 = x-dot(P_1,x)
a_hat = dot(x_1,y)/dot(x_1,x)
b_hat = (sumy - a_hat*(sumx))/n
y_hat = polyval([a_hat,b_hat],x)
plot(x,y,'o',alpha=.5,ms=5.,lw=4.,label='data')
plot(x,y_hat,lw=4.,label='regularized',alpha=.5)
title(r'$\lambda$ = %3.3g;MSE=%3.2g,ahat=%3.2f'%(lam,linalg.norm(y_hat-y)**2,
a_hat),fontsize=14)
plot(x,polyval([a,b],x),lw=3.,label='true',alpha=.5)
plot(x,polyval(polyfit(x,y,1),x),lw=3.,label='MVUE',alpha=.5,color='k')
legend(loc=0)
axis((0,1,0,10))
interact(plot_lin_regularizer,lam=(-1.,3.,.05))
<function __main__.plot_lin_regularizer>
Although the procedure for estimating the parameters is straightforward, we can approach it from another angle to get more insight using a multi-dimensional Gaussian model. We denote the vector of the set of {Yi} as y and likewise for x. This means we can write the multi-dimensional Gaussian model as N(ax+bI,Iσ2) where I indicates the identity matrix.
def lin_regress(x,y,lam=0,kap=0,alpha=0.95):
'linear regression with optional regularization'
n = len(x)
sumx = sum(x)
sumy = sum(y)
one = ones((n,))
P_1 = ones((n,n))/n-eye(n)*lam
P_x = outer(x,x)/dot(x,x)-eye(n)*kap
xi=one-dot(P_x,one)
x_1 = x-dot(P_1,x)
a_hat = dot(x_1,y)/dot(x_1,x)
b_hat = dot(y,one-dot(P_x,one))/dot(one,one-dot(P_x,one))
(a_,b_)= polyfit(x,y,1)
sigma2_est = var(polyval([a_,b_],x)-y) # OLS for noise estimate
b_hat_var = sigma2_est*dot(xi,xi)/dot(one,xi)**2
a_hat_var = sigma2_est*dot(x_1,x_1)/dot(x_1,x)**2
a_hat_lo,a_hat_hi=stats.norm(a_hat,sqrt(a_hat_var)).interval(alpha)
b_hat_lo,b_hat_hi=stats.norm(b_hat,sqrt(b_hat_var)).interval(alpha)
return (a_hat,b_hat,a_hat_hi-a_hat_lo,b_hat_hi-b_hat_lo)
def plot_lin_regularizer_band(lam= 0.0,kap=0.0):
fig,ax = subplots()
ax.plot(x,y,'o',alpha=.3)
a_hat,b_hat,adelta,bdelta = lin_regress(x,y,lam=lam,kap=kap)
ax.plot(x,polyval([a_hat,b_hat],x),color='k',lw=3.)
ax.plot(x,polyval([a_hat+adelta/2,b_hat+bdelta/2],x),'--k')
ax.plot(x,polyval([a_hat-adelta/2,b_hat-bdelta/2],x),'--k')
ax.fill_between(x,polyval([a_hat+adelta/2,b_hat+bdelta/2],x),
polyval([a_hat-adelta/2,b_hat-bdelta/2],x),
color='gray',
alpha=.3)
ax.set_title('95% confidence band')
ax.axis(xmin=x[0],xmax=x[-1],ymin=-1,ymax=10)
interact(plot_lin_regularizer_band,lam=(0,0.3,.05),kap=(0,2,.1))