机器学习错题集

Week 2

1. Suppose m=4 students have taken some class, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows:

midterm exam	(midterm exam)2	final exam
89	7921	96
72	5184	74
94	8836	87
69	4761	78

You'd like to use polynomial regression to predict a student's final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form hθ(x)=θ0+θ1x1+θ2x2, where x1 is the midterm score and x2 is (midterm score)2. Further, you plan to use both feature scaling (dividing by the "max-min", or range, of a feature) and mean normalization.

What is the normalized feature x(1)1

解析：归一化化定义：归一化就是要把需要处理的数据经过处理后（通过某种算法）限制在你需要的一定范围内。首先归一化是为了后面数据处理的方便，其次是保证程序运行时收敛加快。数据归一化方法有：

feature scaling 法：离差标准化，归一化后区间为[0,1]，即x-min(x)/(max(x)-min(x))。这里x'=89/（96-69）。我觉得吴恩达老师讲的第一种其实属于这种，只是他讲的例子min=0。

mean normalized 法：平均归一化，使得特征值具有为0的平均值。即x'=(x-u)/s，一般选择u为均值，s为方差。很多人说s为方差，其实这里计算的(max(x)-min(x))，感觉是另一种离差标准化，同样是对x作线性变换。没太弄懂。

按吴恩达老师讲的，计算为u=(89+72+94+69)/4=81,s=94-69=25,x'=(89-81)/25=0.32。

5. Which of the following are reasons for using feature scaling?

A.It prevents the matrix XTX (used in the normal equation) from being no

n-invertable (singular/degenerate).

B.It speeds up gradient descent by making it require fewer iterations to get to a good solution.

C.It speeds up gradient descent by making each iteration of gradient descent less expensive to compute.

D.It is necessary to prevent the normal equation from getting stuck in local optima.

解析：

A XtX不可逆与矩阵行或列向量的线性相关以及特征向量过多有关，A无关系。

B 见上面归一化的定义。具体原理如下图所示，左图中梯度的方向为垂直等高线的方向而走之字形路线，这样会使迭代很慢，步子变多。右图中垂直走就很快。

机器学习错题集

C 感觉应该是减少了迭代次数，但每次迭代计算的代价对同一个电脑来说是一样的。

D 归一化方程得到的就是最优的theta值，没有局部最优化问题。此外，线性回归的代价函数总是一个凸函数，此函数没有局部最优解，只有全局最优解。无论什么时候，这种代价函数使用线性回归/递归下降法得到的结果，都会是收敛到全局最优值的。