# 梯度下降法

## 描述

${\displaystyle \mathbf {a} _{n+1}=\mathbf {a} _{n}-\gamma \nabla F(\mathbf {a} _{n})}$

${\displaystyle \mathbf {x} _{n+1}=\mathbf {x} _{n}-\gamma _{n}\nabla F(\mathbf {x} _{n}),\ n\geq 0.}$

${\displaystyle F(\mathbf {x} _{0})\geq F(\mathbf {x} _{1})\geq F(\mathbf {x} _{2})\geq \cdots ,}$

${\displaystyle \gamma _{n}={\frac {\left|\left(\mathbf {x} _{n}-\mathbf {x} _{n-1}\right)^{T}\left[\nabla F(\mathbf {x} _{n})-\nabla F(\mathbf {x} _{n-1})\right]\right|}{\left\|\nabla F(\mathbf {x} _{n})-\nabla F(\mathbf {x} _{n-1})\right\|^{2}}}}$

### 例碼

next_x = 6  # We start the search at x=6
gamma = 0.01  # Step size multiplier
precision = 0.00001  # Desired precision of result
max_iters = 10000  # Maximum number of iterations

# Derivative function
def df(x):
return 4 * x ** 3 - 9 * x ** 2

for _ in range(max_iters):
current_x = next_x
next_x = current_x - gamma * df(current_x) # 睇吓個斜率係點。

step = next_x - current_x
if abs(step) <= precision:
break

print("Minimum at ", next_x)

# The output for the above will be something like
# "Minimum at 2.2499646074278457"


### 函數例

${\displaystyle f(x_{1},x_{2})=(1-x_{1})^{2}+100(x_{2}-x_{1}^{2})^{2}.}$

Rosenbrock函數有一個迫窄嘅彎谷，其中包含有嗰個最細值。山谷嘅㞘底好平坦。由於條谷彎曲又平，個最佳化過程係以細𨂾長向個最細值之字噉行。

${\displaystyle F(x,y)=\sin \left({\frac {1}{2}}x^{2}-{\frac {1}{4}}y^{2}+3\right)\cos \left(2x+1-e^{y}\right).}$

### 揀𨂾長同下降方向

${\displaystyle \mathbf {a} _{n+1}=\mathbf {a} _{n}-\gamma _{n}\,\mathbf {p} _{n}}$

${\displaystyle F(\mathbf {a} _{n+1})\leq F(\mathbf {a} _{n})-\gamma _{n}\|\nabla F(\mathbf {a} _{n})\|_{2}\|\mathbf {p} _{n}\|_{2}\left[\cos \theta _{n}-\max _{t\in [0,1]}{\frac {\|\nabla F(\mathbf {a} _{n}-t\gamma _{n}\mathbf {p} _{n})-\nabla F(\mathbf {a} _{n})\|_{2}}{\|\nabla F(\mathbf {a} _{n})\|_{2}}}\right]}$

(1)

• 透過設置${\displaystyle \mathbf {p} _{n}=\nabla F(\mathbf {a_{n}} )}$放棄clever嘅下降方向嘅優勢 ，然之後使用線搜索搵到啱嘅𨂾長${\displaystyle \gamma _{n}}$，譬如滿足Wolfe條件嘅嗰啲。
• 假如話${\displaystyle F}$係兩次可微分得嘅，使用佢個Hessian矩陣${\displaystyle \nabla ^{2}F}$估計${\displaystyle \|\nabla F(\mathbf {a} _{n}-t\gamma _{n}\mathbf {p} _{n})-\nabla F(\mathbf {a} _{n})\|_{2}\approx \|t\gamma _{n}\nabla ^{2}F(\mathbf {a} _{n})\mathbf {p} _{n}\|}$，然之後透過最佳化不等式（1）揀返${\displaystyle \mathbf {p} _{n}}$${\displaystyle \gamma _{n}}$
• 假如話${\displaystyle \nabla F}$Lipschitz ，使用佢個Lipschitz常數${\displaystyle L}$綁定${\displaystyle \|\nabla F(\mathbf {a} _{n}-t\gamma _{n}\mathbf {p} _{n})-\nabla F(\mathbf {a} _{n})\|_{2}\leq Lt\gamma _{n}\|\mathbf {p} _{n}\|}$，然之後透過最佳化不等式（1）揀返${\displaystyle \mathbf {p} _{n}}$${\displaystyle \gamma _{n}}$
• ${\displaystyle F}$ 建立一個自定義模型${\displaystyle \max _{t\in [0,1]}{\frac {\|\nabla F(\mathbf {a} _{n}-t\gamma _{n}\mathbf {p} _{n})-\nabla F(\mathbf {a} _{n})\|_{2}}{\|\nabla F(\mathbf {a} _{n})\|_{2}}}}$，然之後透過最佳化不等式（1）揀返${\displaystyle \mathbf {p} _{n}}$${\displaystyle \gamma _{n}}$
• 喺函數${\displaystyle F}$上有更強嘅假設嘅話，譬如有凸性質，可以實現到先進啲嘅方法，譬如快速梯度下降

## 線性系統嘅解

${\displaystyle A\mathbf {x} -\mathbf {b} =0}$

${\displaystyle F(\mathbf {x} )=\mathbf {x} ^{T}A\mathbf {x} -2\mathbf {x} ^{T}\mathbf {b} ,}$

${\displaystyle \nabla F(\mathbf {x} )=2(A\mathbf {x} -\mathbf {b} ).}$

${\displaystyle F(\mathbf {x} )=\left\|A\mathbf {x} -\mathbf {b} \right\|^{2}.}$

${\displaystyle \nabla F(\mathbf {x} )=2A^{T}(A\mathbf {x} -\mathbf {b} ).}$

{\displaystyle {\begin{aligned}&{\text{repeat in the loop:}}\\&\qquad \mathbf {r} :=\mathbf {b} -\mathbf {Ax} \\&\qquad \gamma :={\mathbf {r} ^{\mathsf {T}}\mathbf {r} }/{\mathbf {r} ^{\mathsf {T}}\mathbf {Ar} }\\&\qquad \mathbf {x} :=\mathbf {x} +\gamma \mathbf {r} \\&\qquad {\hbox{if }}\mathbf {r} ^{\mathsf {T}}\mathbf {r} {\text{ is sufficiently small, then exit loop}}\\&{\text{end repeat loop}}\\&{\text{return }}\mathbf {x} {\text{ as the result}}\end{aligned}}}

{\displaystyle {\begin{aligned}&\mathbf {r} :=\mathbf {b} -\mathbf {Ax} \\&{\text{repeat in the loop:}}\\&\qquad \gamma :={\mathbf {r} ^{\mathsf {T}}\mathbf {r} }/{\mathbf {r} ^{\mathsf {T}}\mathbf {Ar} }\\&\qquad \mathbf {x} :=\mathbf {x} +\gamma \mathbf {r} \\&\qquad {\hbox{if }}\mathbf {r} ^{\mathsf {T}}\mathbf {r} {\text{ is sufficiently small, then exit loop}}\\&\qquad \mathbf {r} :=\mathbf {r} -\gamma \mathbf {Ar} \\&{\text{end repeat loop}}\\&{\text{return }}\mathbf {x} {\text{ as the result}}\end{aligned}}}

## 非線性系統嘅解

${\displaystyle {\begin{cases}3x_{1}-\cos(x_{2}x_{3})-{\tfrac {3}{2}}=0\\4x_{1}^{2}-625x_{2}^{2}+2x_{2}-1=0\\\exp(-x_{1}x_{2})+20x_{3}+{\tfrac {10\pi -3}{3}}=0\end{cases}}}$

${\displaystyle G(\mathbf {x} )={\begin{bmatrix}3x_{1}-\cos(x_{2}x_{3})-{\tfrac {3}{2}}\\4x_{1}^{2}-625x_{2}^{2}+2x_{2}-1\\\exp(-x_{1}x_{2})+20x_{3}+{\tfrac {10\pi -3}{3}}\\\end{bmatrix}},}$

${\displaystyle \mathbf {x} ={\begin{bmatrix}x_{1}\\x_{2}\\x_{3}\\\end{bmatrix}}.}$

${\displaystyle F(\mathbf {x} )={\frac {1}{2}}G^{\mathrm {T} }(\mathbf {x} )G(\mathbf {x} )={\frac {1}{2}}\left[\left(3x_{1}-\cos(x_{2}x_{3})-{\frac {3}{2}}\right)^{2}+\left(4x_{1}^{2}-625x_{2}^{2}+2x_{2}-1\right)^{2}+\left(\exp(-x_{1}x_{2})+20x_{3}+{\frac {10\pi -3}{3}}\right)^{2}\right],}$

${\displaystyle \mathbf {x} ^{(0)}=\mathbf {0} ={\begin{bmatrix}0\\0\\0\\\end{bmatrix}}.}$

${\displaystyle \mathbf {x} ^{(1)}=\mathbf {0} -\gamma _{0}\nabla F(\mathbf {0} )=\mathbf {0} -\gamma _{0}J_{G}(\mathbf {0} )^{\mathrm {T} }G(\mathbf {0} ),}$

Jacobian矩陣${\displaystyle J_{G}}$藉由噉樣得出：

${\displaystyle J_{G}(\mathbf {x} )={\begin{bmatrix}3&\sin(x_{2}x_{3})x_{3}&\sin(x_{2}x_{3})x_{2}\\8x_{1}&-1250x_{2}+2&0\\-x_{2}\exp {(-x_{1}x_{2})}&-x_{1}\exp(-x_{1}x_{2})&20\\\end{bmatrix}}.}$

${\displaystyle J_{G}(\mathbf {0} )={\begin{bmatrix}3&0&0\\0&2&0\\0&0&20\end{bmatrix}},\qquad G(\mathbf {0} )={\begin{bmatrix}-2.5\\-1\\10.472\end{bmatrix}}.}$

${\displaystyle \mathbf {x} ^{(1)}=\mathbf {0} -\gamma _{0}{\begin{bmatrix}-7.5\\-2\\209.44\end{bmatrix}},}$
${\displaystyle F(\mathbf {0} )=0.5\left((-2.5)^{2}+(-1)^{2}+(10.472)^{2}\right)=58.456.}$

${\displaystyle F\left(\mathbf {x} ^{(1)}\right)\leq F\left(\mathbf {x} ^{(0)}\right)=F(\mathbf {0} ).}$

${\displaystyle \mathbf {x} ^{(1)}={\begin{bmatrix}0.0075\\0.002\\-0.20944\\\end{bmatrix}}.}$

${\displaystyle F\left(\mathbf {x} ^{(1)}\right)=0.5\left((-2.48)^{2}+(-1.00)^{2}+(6.28)^{2}\right)=23.306.}$

${\displaystyle F(\mathbf {0} )=58.456}$減少到下一步嘅值，

${\displaystyle F\left(\mathbf {x} ^{(1)}\right)=23.306}$

## 模改

### 快速梯度法

Yurii Nesterov提出咗[18]一個簡單嘅模改，幫凸最佳化問題收斂得愈發快；呢種模改又得到咗進一步嘅一般化（generalized）。對於冇約束到嘅平滑問題（unconstrained smooth problems），呢個方法叫做快速梯度法（FGM）或者加速梯度法（AGM）。具體來講，若果微分函數${\displaystyle F}$係凸嘅，${\displaystyle \nabla F}$係Lipschitz ，而且唔假定到${\displaystyle F}$強凸嘅（strongly convex），噉喺每𨂾步${\displaystyle k}$中生成到嘅嗰個目標值、目標值嘅誤差透過梯度下降法係有${\textstyle {\mathcal {O}}\left({\tfrac {1}{k}}\right)}$複雜度 。使用Nesterov加速技術，誤差就會以${\textstyle {\mathcal {O}}\left({\tfrac {1}{k^{2}}}\right)}$嘅複雜度降低。[19]已知損失函數下降比率${\displaystyle {\mathcal {O}}\left({k^{-2}}\right)}$係最啱一階最佳化方法嘅。儘管如此，照舊有機會透過減細常數因子來改進個演算法。最佳化梯度法（OGM） [20]捉常數減少到1/2，係噉成爲咗一種最佳嘅一階方法攞嚟解決啲規模大嘅問題。[21]

## 考

1. Hill Climbing Algorithms (and gradient descent variants) IRL 互聯網檔案館歸檔，歸檔日期2020年3月27號，..
2. Lemaréchal, C. (2012). "Cauchy and the Gradient Method" (PDF). Doc Math Extra: 251–254. 原著 (PDF)喺2018年12月29號歸檔. 喺2021年6月1號搵到.
3. Hadamard, Jacques (1908). "Mémoire sur le problème d'analyse relatif à Véquilibre des plaques élastiques encastrées,". Mémoires présentés par divers savants éstrangers à l'Académie des Sciences de l'Institut de France. 33.
4. Courant, Richard (January 1943). "Variational methods for the solution of problems of equilibrium and vibrations". Bull. Amer. Math. Soc. 49: 1–23. doi:10.1090/S0002-9904-1943-07818-4 –透過Project Euclid.
5. Curry, Haskell B. (1944). "The Method of Steepest Descent for Non-linear Minimization Problems". Quart. Appl. Math. 2 (3): 258–261. doi:10.1090/qam/10667.
6. Polyak, Boris (1987). Introduction to Optimization (英文).
7. Akilov, G. P.; Kantorovich, L. V. (1982). Functional Analysis (第2版). Pergamon Press. ISBN 0-08-023036-9.
8. Barzilai, Jonathan; Borwein, Jonathan M. (1988). "Two-Point Step Size Gradient Methods". IMA Journal of Numerical Analysis. 8 (1): 141–148. doi:10.1093/imanum/8.1.141.
9. Fletcher, R. (2005). "On the Barzilai–Borwein Method". 出自 Qi, L.; Teo, K.; Yang, X. (編). Optimization and Control with Applications. Applied Optimization.第96卷. Boston: Springer. pp. 235–256. ISBN 0-387-24254-6.
10. Wolfe, Philip (1969-04-01). "Convergence Conditions for Ascent Methods". SIAM Review. 11 (2): 226–235. doi:10.1137/1011036. ISSN 0036-1445.
11. Bernstein. "On the distance between two neural networks and the stability of learning". {{cite arxiv}}: |arxiv= required (help)
12. Haykin, Simon S. Adaptive filter theory. Pearson Education India, 2008. - p. 108-142, 217-242
13. Saad, Yousef (2003). Iterative methods for sparse linear systems (第2版). Philadelphia, Pa.: Society for Industrial and Applied Mathematics. pp. 195. ISBN 978-0-89871-534-7.
14. Bouwmeester, Henricus; Dougherty, Andrew; Knyazev, Andrew V. (2015). "Nonsymmetric Preconditioning for Conjugate Gradient and Steepest Descent Methods". Procedia Computer Science. 51: 276–285. doi:10.1016/j.procs.2015.05.241.
15. Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. (1992). Numerical Recipes in C: The Art of Scientific Computing (第2版). New York: Cambridge University Press. ISBN 0-521-43108-5.
16. Strutz, T. (2016). Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond (第2版). Springer Vieweg. ISBN 978-3-658-11455-8.
17. Ross, I. M. (2019-07-01). "An optimal control theory for nonlinear optimization". Journal of Computational and Applied Mathematics (英文). 354: 39–51. doi:10.1016/j.cam.2018.12.044. ISSN 0377-0427.
18. Nesterov, Yu. (2004). Introductory Lectures on Convex Optimization : A Basic Course. Springer. ISBN 1-4020-7553-7.
19. Vandenberghe, Lieven (2019). "Fast Gradient Methods" (PDF). Lecture notes for EE236C at UCLA.
20. Kim, D.; Fessler, J. A. (2016). "Optimized First-order Methods for Smooth Convex Minimization". Math. Prog. 151 (1–2): 81–107. arXiv:1406.5468. doi:10.1007/s10107-015-0949-3. PMC 5067109. PMID 27765996.
21. Drori, Yoel (2017). "The Exact Information-based Complexity of Smooth Convex Minimization". Journal of Complexity. 39: 1–16. arXiv:1606.01424. doi:10.1016/j.jco.2016.11.001.
22. Qian, Ning (January 1999). "On the momentum term in gradient descent learning algorithms" (PDF). Neural Networks. 12 (1): 145–151. CiteSeerX 10.1.1.57.5612. doi:10.1016/S0893-6080(98)00116-6. PMID 12662723. 原著 (PDF)喺8 May 2014歸檔. 喺17 October 2014搵到.
23. "Momentum and Learning Rate Adaptation". Willamette University. 喺17 October 2014搵到.
24. Combettes, P. L.; Pesquet, J.-C. (2011). "Proximal splitting methods in signal processing". 出自 Bauschke, H. H.; Burachik, R. S.; Combettes, P. L.; Elser, V.; Luke, D. R.; Wolkowicz, H. (編). Fixed-Point Algorithms for Inverse Problems in Science and Engineering. New York: Springer. pp. 185–212. arXiv:0912.3522. ISBN 978-1-4419-9568-1.