



采样数据$(X,Y)$, 对任意一个自变量$y^i$,有一个对应的模型预测值$\hat{y}^i=h(x^i, \theta)$. 从概率的角度来看,模型估计出来的不是一个值,事实上是一个分布$P(\hat{y}^i|x^i,\theta)$, 并且对每一个自变量$y^i$都有一个分布。而所谓的预测值是该分布的期望$\hat{y}^i=E(\hat{h}^i)$. 残差指的是在分布$P(\hat{h}^i|x^i,\theta)$下,采样得到的数据与期望的偏差$y^i-E(\hat{h}^i)=y^i-\hat{y}^i$, 当然最后的表达式很直接,就是真实值与预测值之间的差。

残差的分布是指在分布$P(\hat{h}^i|x^i,\theta)$下偏差的分布。需要注意的是这里的分布都是针对当前单个采样点$y_i$来说的,即我们假设该采样点$y_i$虽然只有一条数据,但它事实上来自于一个分布。如下图所示,每一个预测点附近都产生了一个分布,通过这个分布采样了一个$y^i$. 当然单个采样点无法绘制出整个分布图,但可以通过取一段区间里面的数值来重现这种分布。或者根据大数定理,可以通过对所有采样点的残差分布来近似当前采样点的残差分布(?).


把误差(残差?)细分为两大类,一类是模型由于估计不足造成的误差,一类是由于采样造成的不可消除误差, 即真实值认为是从从一个分布中采样的数据,而预测值是分布的期望,这两者之间的残差不可消除。如果一个模型达到最优,则误差仅剩不可消除误差。根据大数定理,在大样本统计下,该误差满足高斯分布$\epsilon=P(y^i-\hat{y}^i)\sim N(0, \sigma^{2}) $, 由于$\epsilon=y^i-\hat{y}^i$, 并且对于分布$P(y^i|x^i,\theta)$,$\hat{y}^i=E(\hat{h}^i)$是一个常数, 根据高斯分布的性质则$y^i\sim N(\hat{y}^i, \sigma^{2}) $, 即隐变量满足高斯分布:





The generalized linear model (GLM, 广义线性模型) is a flexible generalization of ordinary linear regression.把自变量的线性预测函数当作因变量的估计值。The target variable y is also called response variable in GLM.




the exponential family

the prototype:

Here, $\eta$ is called the natural parameter(or canonical parameter); $T(y)$ is the sufficient statistic, which is often the case: $T(y)=y$; $a(\eta)$ is log partial function. The quantity $e^{-a(\eta)}$ essentially plays the role of a normalization constant, that makes sure the distribution $p(y;\eta)$ sums/integrates over y to 1. A fixed choice of T,a and b defines a family of destributions that is parameterized by $\eta$;we then get different distributions within this family.

We now show that some common distributions belong to the exponential family.

Gaussian distributions:

Bernoulli distribution:

Thus, the natural parameter is given by $\eta=log(\phi/(1-\phi))$. If we invert this definition for $\eta$ by solving for $\phi$, we obtain $\phi=\frac{1}{1+e^{-\eta}}$. This is the familiar sigmoid function. Actually, the ordinary least squares(最小二乘法) is based on Gaussian distribution, and logistic regression on Bernoulli distribution.

constructing GLM

three assumptions:

  • $y\vert x;\theta ~ExponentialFamily(\eta)$. I.e., given x and $\theta$, the distribution of y follows some exponential family distribution, with parameter $\eta$.
  • $h(x)=E[T(y)|x]$.
  • The natural parameter $\eta$ and the inputs $x$ are related linearly: $\eta=\theta^Tx$, which might be better thought of as a "design choice".

softmax regression

Consider a classification problem in which the response variable $y$ can take any one of k values, so $y\in {1,2,...,k}$. We use k parameters $\phi_1, \cdots,\phi_k$ specifying the probability of each of $y$. we will instead pa- rameterize the multinomial with only $k − 1$ parameters, $\phi_1, \cdots,\phi_{k-1}$, and $p(y=k)=1-\sum_{i=1}^{k-1}\phi_i$.

To express the multinominl as an exponential family distribution, we introduce one more very useful piece of notation. $[1{true}=1,1{false}=0 ]$ and define $[(T(y))_i=1{y=i}]$. Then we have that $E[(T(y))_i]=p(y=i)=\phi_i$. so:


the link function is given (for i=1,...,k) by:


By assumption 3: $\eta=\theta^Tx$:

Our hypothesis will output:

Our hypothesis will output the estimates probability for $p(y=i|x;\theta)$,[$p(y=k)=1-\sum_{i=1}^{k-1}p(y=i)$].The log-likelihood is:
