Notes for Deep Learning by Pro. Hung-yi Lee (1)

1737-汪同学

发表文章数:61

热门标签

,
首页 » 算法 » 正文


I will try to use English to write down the knowledge learned in this class, just with the aim to make sure I will not forget this important language tool. (We do not have any English classes in this semester, so my worry is, emmmm, reasonable, right?) I hope I can achieve this target. Maybe I will give up one day, haha. If there exist some wrong things in my notes, no only for the knowledge but for the English gramma, please point out them with no hesitation. I will be so thankful for that.

1. Tip1 for Gradient Descent

The first tip taught by Pro. Li is tuning the learning rates of our gradient descent program. Judging from the following figures, too little learning rates will make our program run in a low speed, but too large learning rates will make our program fail to find the minimum point of the curve. When we are using gradient descent to find the minimum point, we could draw a fugure, which is similar to the right figure of the following picture, showing the relationship between the loss and the number of parameters updates. Obviously, this picture will tell us the most suitable learning rates.

Notes for Deep Learning by Pro. Hung-yi Lee (1)

1.1 Adaptive Learning Rates

However, it is very hard to find the most suitbale values of learing rates. One popular and simpe idea is to reduce the learning rate by some factors every few epochs. There are two targets we want to achieve.

  1. Learing rates can not be one-size-fit-all.
  2. Giving different parameters different learning rates

Pro. Li gives us a more detailed explanation.
Notes for Deep Learning by Pro. Hung-yi Lee (1)
In order to fit in with above thought, a popular method, called Adagrad, was proposed.

Notes for Deep Learning by Pro. Hung-yi Lee (1)

Notes for Deep Learning by Pro. Hung-yi Lee (1)
Notes for Deep Learning by Pro. Hung-yi Lee (1)
However, there exists a contradiction which is shown in the following ppt. One part of the formula tells us larger gradient leads to larger step, but the other part tells us the opposite conclusion.
Notes for Deep Learning by Pro. Hung-yi Lee (1)
Comparison between different parameters will show us the reasons for above formula. In our common mind, larger 1st order derivative means far from the minima. However, for the point a of

w

1

w_1

w1 and point c of

w

2

w_2

w2 (the 1st order derivate of a is obviously smaller than that of point c), point c is closer to the minima point. This phenomenon tells us we should take the second derivative into consideration, which is similar to the statistical distance we learn in the Multivariate Statistical Analysis. So the formula we use to find the point whihc is closest to the minima point is:

F

i

r

s

t

  

D

e

r

i

v

a

t

i

v

e

S

e

c

o

n

d

  

D

e

r

i

v

a

t

i

v

e

/frac{First ~~Derivative}{Second~~Derivative}

Second  DerivativeFirst  Derivative
Notes for Deep Learning by Pro. Hung-yi Lee (1)
In practice, calculating the second derivative is always a difficult tasks. So we will use

Σ

(

g

i

)

2

/Sigma (g^i)^2

Σ(gi)2 to take the place of the second derivative. Why does this make sense? Looking at the following ppt, we sample sorts of points randomly from different distributions. For the distribution with larger second derivative, the value of

(

f

i

r

s

t

  

d

e

r

i

v

a

t

i

v

e

)

2

/sqrt{(first~~derivative)^2}

(first  derivative)2
is larger. For the distribution with smaller second derivative, the value of

(

f

i

r

s

t

  

d

e

r

i

v

a

t

i

v

e

)

2

/sqrt{(first~~derivative)^2}

(first  derivative)2
is smaller. In the light of this statement, we can use

Σ

(

g

i

)

2

/Sigma (g^i)^2

Σ(gi)2 to take the place of the second derivative.
Notes for Deep Learning by Pro. Hung-yi Lee (1)
Above all, we can expain the formula of Adagrad successfully.

2. Tip 2 for Gradient Descent

The second tips of gradient descent taught by Pro. Li is using Stochastic Gradient Descent, always known as SGD, to make the training faster.

Notes for Deep Learning by Pro. Hung-yi Lee (1)

We can also see the differents from the following ppt, which shows the diffrent process of updating the parameters for GD and SGD.
Notes for Deep Learning by Pro. Hung-yi Lee (1)

3. Tip 3 for Gradient Descent

The third tip we learn in the class is feature scaling. What is the meaning of feature scaling? In this class, feature scaling means we want to make different features have the same scaling, which is similar to the defination of normalization. So the next question is the reason of this behaviour. The following figure will tell us the reason. For the left part,

x

1

x_1

x1 and

x

2

x_2

x2 have different scales, which leads to the result that the change of

w

2

w_2

w2 will influence the loss in a more obvious way. The red line stands for the path of

w

1

w_1

w1 and

w

2

w_2

w2 in the process of gradient descent, containing two main problems: 1) the convergence direction is not straightly towards the minma point (which will make the progam waste a lot of time) 2) setting the learning rates to

w

1

w_1

w1 and

w

2

w_2

w2 is apparently not suitable. (if you do not use Adagrad)
Notes for Deep Learning by Pro. Hung-yi Lee (1)
The method for feature scaling is quiet easy, which can be shown as:
Notes for Deep Learning by Pro. Hung-yi Lee (1)

4. Theory Bethind the Gradient Descent

Giving a random point

     θ


     0




   /theta^0


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.814108em; vertical-align: 0em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.814108em;"><span class="" style="top: -3.063em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span></span></span></span></span></span></span></span></span>, we can draw a little circle with the center <span class="katex--inline"><span class="katex"><span class="katex-mathml">





     θ


     0




   /theta^0


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.814108em; vertical-align: 0em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.814108em;"><span class="" style="top: -3.063em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span></span></span></span></span></span></span></span></span> and the radius <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    ϵ



   /epsilon


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.43056em; vertical-align: 0em;"></span><span class="mord mathdefault">ϵ</span></span></span></span></span>. Then we can find a new point with less values and remove the <span class="katex--inline"><span class="katex"><span class="katex-mathml">





     θ


     0




   /theta^0


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.814108em; vertical-align: 0em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.814108em;"><span class="" style="top: -3.063em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span></span></span></span></span></span></span></span></span> to this new point <span class="katex--inline"><span class="katex"><span class="katex-mathml">





     θ


     1




   /theta^1


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.814108em; vertical-align: 0em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.814108em;"><span class="" style="top: -3.063em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span></span></span></span></span></span></span></span></span>.<br /> <img src="https://img-blog.csdnimg.cn/54dfbcfea9c747199a428f4945dd43b7.PNG?x-oss-process&#61;image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAaGVsbG9fSmVyZW15V2FuZw&#61;&#61;,size_19,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="在这里插入图片描述" /><br /> So how to find this new point? Before we explain the reason, we have to introduce Taylor Series, which is shown as follows. (We need to note that the radius of the circle, represented as learning rate, should be small enough)<br /> <img src="https://img-blog.csdnimg.cn/b39b36644e2041359456fdf913389e1b.PNG?x-oss-process&#61;image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAaGVsbG9fSmVyZW15V2FuZw&#61;&#61;,size_20,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="在这里插入图片描述" /><br /> For <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    L


    (


    θ


    )



   L(/theta)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathdefault">L</span><span class="mopen">(</span><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="mclose">)</span></span></span></span></span>, <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    s



   s


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.43056em; vertical-align: 0em;"></span><span class="mord mathdefault">s</span></span></span></span></span>, <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    u



   u


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.43056em; vertical-align: 0em;"></span><span class="mord mathdefault">u</span></span></span></span></span> and <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    v



   v


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.43056em; vertical-align: 0em;"></span><span class="mord mathdefault" style="margin-right: 0.03588em;">v</span></span></span></span></span> is a contant number or vector or matrix because their values only depend on the current point <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (


    a


    ,


    b


    )



   (a,b)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord mathdefault">b</span><span class="mclose">)</span></span></span></span></span>. So the value of <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    L


    (


    θ


    )



   L(/theta)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathdefault">L</span><span class="mopen">(</span><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="mclose">)</span></span></span></span></span> is determined by the inner conduction of vector <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (


    u


    ,


    v


    )



   (u,v)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord mathdefault">u</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord mathdefault" style="margin-right: 0.03588em;">v</span><span class="mclose">)</span></span></span></span></span> and vector <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (



     θ


     1



    −


    a


    ,



     θ


     2



    −


    b


    )



   (/theta_1-a,/theta_2-b)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right: 0.222222em;"></span></span><span class="base"><span class="strut" style="height: 0.88888em; vertical-align: -0.19444em;"></span><span class="mord mathdefault">a</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right: 0.222222em;"></span></span><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathdefault">b</span><span class="mclose">)</span></span></span></span></span>. Finding <span class="katex--inline"><span class="katex"><span class="katex-mathml">





     θ


     1




   /theta_1


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.84444em; vertical-align: -0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span> and <span class="katex--inline"><span class="katex"><span class="katex-mathml">





     θ


     2




   /theta_2


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.84444em; vertical-align: -0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span> in the red circle minimizing <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    L


    (


    θ


    )



   L(/theta)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathdefault">L</span><span class="mopen">(</span><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="mclose">)</span></span></span></span></span>can be concert to finding a new point <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (



     θ


     1



    ,



     θ


     2



    )



   (/theta_1,/theta_2)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span> which lays in the red circle and makes the inner conduction of vector <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (


    u


    ,


    v


    )



   (u,v)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord mathdefault">u</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord mathdefault" style="margin-right: 0.03588em;">v</span><span class="mclose">)</span></span></span></span></span> and vector <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (



     θ


     1



    −


    a


    ,



     θ


     2



    −


    b


    )



   (/theta_1-a,/theta_2-b)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right: 0.222222em;"></span></span><span class="base"><span class="strut" style="height: 0.88888em; vertical-align: -0.19444em;"></span><span class="mord mathdefault">a</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right: 0.02778em;">θ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.301108em;"><span class="" style="top: -2.55em; margin-left: -0.02778em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right: 0.222222em;"></span></span><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathdefault">b</span><span class="mclose">)</span></span></span></span></span> less. It is easy to find the final answer, right? That is the opposite direction of vector <span class="katex--inline"><span class="katex"><span class="katex-mathml">




    (


    u


    ,


    v


    )



   (u,v)


</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mopen">(</span><span class="mord mathdefault">u</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.166667em;"></span><span class="mord mathdefault" style="margin-right: 0.03588em;">v</span><span class="mclose">)</span></span></span></span></span>, the direction of gradient.<br /> <img src="https://img-blog.csdnimg.cn/56112971750d4a18bc6f91e9169fea89.PNG?x-oss-process&#61;image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAaGVsbG9fSmVyZW15V2FuZw&#61;&#61;,size_19,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="在这里插入图片描述" />

未经允许不得转载:作者:1737-汪同学, 转载或复制请以 超链接形式 并注明出处 拜师资源博客
原文地址:《Notes for Deep Learning by Pro. Hung-yi Lee (1)》 发布于2021-10-07

分享到:
赞(0) 打赏

评论 抢沙发

评论前必须登录!

  注册



长按图片转发给朋友

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

Vieu3.3主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。

登录

忘记密码 ?

您也可以使用第三方帐号快捷登录

Q Q 登 录
微 博 登 录