Machine Learning Notes Course 3

Course 3 Regression

Example

Stock market forecast : input data about company to regress an out put like Dow Jones Industrial Average tomorrow

Self-driving Car: input sensor data and out put with direction

Example Application

Estimate a Pokémon’s Combat Power value after evolution

Input: a Pokémon with xcp shows its combat power before evolution, xs shows its specie, xhp shows its strength, xw shows its weight and xh shows its height.

Output: y which represents Combat Power after evolution

Step 1: Model

find model from a set of function

suppose we choose a linear model like:
y = b + w × x c p y=b+w \times x_{cp}
among which w and b are parameters.

to sum up, we could determine a function :
y = b + w i × x i y=b+\sum w_i \times x_i
among which xi represents an attribute of input x called feature, b called bias and wi called weight

Step 2: Goodness of Function

using x1 represents a complete data input individual. y1 represents a complete output individual.

Collect many of xi and corresponding yi in pairs
( x i , y i ) (x^{i},y^{i})
which can be plotted in a graph.

在这里插入图片描述

With all training data, we can define the goodness of a function, using a loss function :
Loss function:

Input: A function

Output: Haw bad it is, called Estimated Error

y = b + w × x c p y=b+w \times x_{cp}

L ( f ) = L ( w , b ) = n = 1 10 ( y ^ n ( b + w × x c p n ) ) 2 L(f)=L(w,b)= \sum_{n=1}^{10}(\hat y^n-(b+w \times x^{n}_{cp}))^2

Step 3: Best Function

Choose a best function from the set via loss function:
f = a r g m i n f L ( f ) f^*={arg} min_fL(f)

w , b = a r g m i n w , b L ( w , b ) w^*,b^*=argmin_{w,b}L(w,b)

which means choose the w, b and f that make L(f) and L(w,b) minimum

Using the method: Gradient Descent

Consider loss function L(w) with only one parameter w:

Randomly choose an initial value w0

Compute
d L d w w = w 0 \frac {dL}{dw}|_{w=w^0}
if negative than increase w, else positive decrease w.

在这里插入图片描述

next w1 :
w 1 = w 0 η d L d w w = w 0 w^1=w^0 - \eta\frac {dL}{dw}|_{w=w^0}
Repeat process above until reach a local optimal nut not global optimal.

And about two parameters:
在这里插入图片描述

PS: In linear situation, no local optimal in GD method

Generalization

Choose another 10 Pokémon as test data to calculate error

Another Model

y = b + w 1 × x c p + w 2 × ( x c p ) 2 y=b+w_1\times x_{cp}+w_2 \times (x_{cp})^2

the same method GD is used to calculate a best model

also , some other models like:
y = b + w 1 × x c p + w 2 × ( x c p ) 2 + w 3 × ( x c p ) 3 y=b+w_1\times x_{cp}+w_2 \times (x_{cp})^2+w_3 \times (x_{cp})^3
and using a more complex model may result in a larger error.

Some other factors

Considering Pokémon’s species may have influence, and based on that the model could be redesigned:
Choose different linear function for different species:
E.g. xs represents specie
i f x s = P i d g e y , y = b 1 + w 1 × x c p if\quad x_s=Pidgey, \quad y=b_1+w_1 \times x_{cp}

i f x s = W e e d l e , y = b 2 + w 2 × x c p if \quad x_s=Weedle,\quad y=b_2+w_2 \times x_{cp}

and to all above could be summed into a linear function:
在这里插入图片描述

And more factors like weight, height and some other ones could be taken into consideration which could probably lead to a lower training error but a high testing error since the overfitting could happen.

And to avoid overfitting, the strategy called Regularization could be adapted to the model.
E.g.
L = n ( y ^ n ( b + w i × x i ) ) 2 + λ ( w i ) 2 L=\sum_{n}^{}(\hat y^n-(b+\sum w_i \times x_i))^2+\lambda \sum(w_i)^2
Among that, the part
λ ( w i ) 2 \lambda \sum(w_i)^2
could represent the sensitive level of a function which mainly influenced by the input noise, making the function less influenced by the noise. So the function with a smaller that part could be better, and λ is a parameter.
Larger λ tends to consider the influence of wi more than the difference between outputs and test data which means considering the training error less.