Machine Learning by Stanford University Week 2

Andrew Ng在Coursera上开设的机器学习课程第二周的内容,介绍了多元线性回归。

This note is for the Stanford University online course “Machine Learning” taught by Andrew Ng on Coursera.org, 2016 March session.

Environment Setup

Octave and MATLAB are preferred in machine learning.

For more information about Octave and MATLAB, see:
https://www.coursera.org/learn/machine-learning/supplement/Mlf3e/more-octave-matlab-resources

Multivariate Linear Regression

Multiple Features

为什么需要multiple features? 对于许多问题,影响预测结果的并不止一个因素,因此,需要多个变量来反映不同的影响因素。

Multiple Features如何用线性方程序来表示?

$h_theta(x) = theta_0 +theta_1x_1+theta_2x_2+…+theta_nx_n = theta^TX$

Cost Function for Multiple Variables

$theta$ is an $n+1$ -dimention vector

Gradient Descent for Multiple Variables

Repeat{

(simutaneously update $theta_j$ for $j = 0,…,n$

notice that $x_0 = 1$)

}

Feature Scaling

  • Idea: get every feature into approximately a $-1leq {x_i}leq 1$ range

  • Mean normalization: replace $x_i$ with $x_i - mu_i$ to make features have approximately zero mean (Do not apply to $x_0 = 1$)

Learning Rate $alpha$

  • If $alpha$ is too small: slow convergence.
  • If $alpha$ is too large: $J(theta)$ may not decrease on every iteration; may not converge. (slow converge also possible)
  • To choose $alpha$, try: …,0.001, 0.01, 0.1, 1,…

Features and Polynomial Regression

  • Polynomial regression example: $theta_0+theta_1x+theta_2x^2+theta_3x^3$; let $x_1 = x, x_2 = x^2, x_3 = x^3$

  • Other possiblities: $theta_0+theta_1x+theta_2sqrt{x}$, let $x_1 = x, x_2 = sqrt{x}$

Computing Parameters Analytically

Suppose there are $m$ examples; $n$ features

$theta = (X^TX)^{-1}X^Ty$
$(X^TX)^{-1}$ is inverse of matrix $X^TX$

if $m < n$,
$X^TX$ may be non-inversible.

Therefore use pinv in Octave:
pinv(X'*X)*X'*y

Gradient Discent Normal Equation
Need to choose $alpha$ No need to choose $alpha$
Needs many iterations Don’t need to iterate
Works well even when $n$ is large Need to compute $(X^TX)^{-1}$ ($O(n^3)$), slow if $n$ is very large

Therefore, if n is less than 1000, use normal equation. Otherwise, use gradient descent.