# Notes for Deep Learning Lessons of Pro. Hung-yi Lee (2) 原创

1737-汪同学

                    Today, Knowledge concerning about the optimization of deep learning is written here. What is the meaning of optimaztion? The following ppt shows us the answer.</p>


# 1. SGD with Momentum (SGDM)

Just as the name shows us, SGDM is invented by combining SGD with Momentum. For SGDM, the process of updating parameters is shown in the following ppt. What we should pay attention to, or in the other words what makes us better understand the meanings of SGDM, is

v

i

v^i

vi is actually the weighted sum of all the previous gradient and the closer gradient has more influence on current momentum.

What is the adavantage of adding momentum in SGD program? For SGD, it is easy to lead us to local minima point rather than the global minima point. However, adding momentum takes the history information into account, which means, if we explains it in a more vivid way, SGDM offers us the ability to think whether we are standing at the local minima point.

It has been introduced in last blog. So, I do not want to explain it again. (I am a little bit lazy, haha).

# 3. RMSProp

v

t

v_t

vt is the sum of square of the past gradient. However, in RMSProp,

v

t

=

α

v

t

1

+

(

1

α

)

(

g

t

1

)

2

v_t = /alpha v_{t-1}+(1-/alpha)(g_{t-1})^2

vt=αvt1+(1α)(gt1)2. We can changes the value of

α

/alpha

α to make

v

t

1

v_{t-1}

vt1 have more or less influence on the current gradient. In common situation, we always set

α

/alpha

α as a large number with the afraid of a too large

g

t

1

g_{t-1}

gt1 making

η

v

t

/frac{/eta}{/sqrt{v_t}}

vt
η
too close to zero.

If we ignore some little differences, Adam can be seen as the combination of SGDM and RMSProp. The little change is that we change the form of

m

t

m_t

mt, which can be called as de-biasing. The reason of this change is that the value of

m

t

m_t

mt is too close to zero at the beginning of updating.

Vieu3.3主题

Q Q 登 录