YeeKal
ml

05_learning_theory

YeeKal
"#ml"
  1. 应该选择哪种模型可以使误差最小?
  2. 在最优模型下应该选择多大的参数?

generalization error: expected error on examples

  • bias: few features
  • variance: more features
  • tradeoff: 权衡,折衷

preliminaries

  • the union bound

  • hoeffding inequality(Chernoff bound): Let $P(Z_i=1)=\varnothing$, and $P(Z_i=0)=1-\varnothing$. Assign $\hat{\phi}=(1/m)\sum_{i=1}^{m}Z_i$ and let any $\lambda> 0$ be fixed. Then:

  • training error/empirical error: 经验误差,训练误差

  • the generalization error
  • empirical risk minimization(ERM): 训练模型中最小化经验误差以更新参数值的过程
  • $\mathcal{H}$, the set of all classifiers

case of finite $\mathcal(H)$

The strategy to give guarantees on the generalization error of $\hat{h}$ will be demonstrated in two parts: $\hat{\varepsilon}(h)$ is a reliable estimate of $\varepsilon(h)$ for all $h$; the upper-bound on the generalization error of $\hat{h}$ will be derived then.

Let $\sigma>0$, how large must m be before we can guarantee that with probability at least $1-\sigma$, training error will be within $\lambda$ of generalization error?

Similarity, if m and $\sigma$ is fixed and solve for $\lambda$ in the previous equation: Define: $\hat{h}=argmin_{h\in\mathcal{H}}\hat{\varepsilon}(h)$, $h^*=argmin_{h\in\mathcal{H}}\varepsilon(h)$ is the best possible hypothesis.

If uniform convergence occurs, then the generalization error of $\hat{h}$ is at most $2\lambda$ worse than the best possible hypothesis in $\mathcal{H}$

case of infinite $\mathcal(H)$

  • growth function: 假设空间H对m个示例所能赋予标记的最大可能结果数
  • dichotomy: 对分, 对于二分类问题的增长函数
  • Vapink-Chervonenkis Dimension: 能被H打散的最大的示例集(数据集)的大小

PAC

  • version space: A hypothesis h is consistent with a set of training examples D if and only if $h(x)=y(x)$ for each training example. The version space is the consistent hypothesis in $\mathcal{H}$, denoted as $VS_{H,D}$
  • consistent learner: the learner outputs hypothesis that perfectly fits the training data.
  • The version space is said to be $\varepsilon-exhausted$ if every hypothesis h in VS has true error less than $\varepsilon$.

Probability that the version space is not $\varepsilon-exhausted$ is at most $|H|e^{-\varepsilon m}$.

Suppose we want this probability to be at most $\sigma$:

A learning algorithm is PAC learnable if it:

  • requires no more than polynomial computation per trainingexample
  • no more than polynomial number of samples

In agnostic learning, $y\notin \mathcal{H}$, then:

algorithm diagnosing

  • training set 60%: to train the models
  • cross validation set 20%: to choose the model with lower error
  • test set 20%: to assess the model

bias vs. variance

  • high bias: underfit, $J_{train}$ and $J_{cv}$ are both high. Also $J_{train}\approx J_{cv}$
  • high variance: $J_{train}$ is low, while $J_{cv}$ is high

regularization parameter

As $\lambda$ increases, the fit becomes more rigid. And as $\lambda$ decrease we trend to overfit the data. The solution is to train the model with a list of $\lambda_s$ and compute the $J_{cv}$ without $\lambda$.

set size