長短期記憶

長短期記憶（long short-term memory，LSTM）係一種能夠記住長期記憶嘅遞迴神經網絡：一般嘅遞迴網絡曉處理連串性嘅資訊，例如係用自然語言寫嘅句子噉，但現實世界嘅解難好多時都要求一個個體能夠睇得出時間上差距大嘅因同果，例如要完全了解一個故仔嘅結局，可能要諗返起故仔開頭嘅情節。事實證明咗，普通嘅遞迴神經網絡好多時都難以應付呢啲問題^[1]。

LSTM 網絡特徵係內置一啲閘控神經細胞（gated neuron），呢啲特殊嘅人工神經細胞有「閘」（gate），曉決定要儲起乜嘢資訊同埋幾時讀取、寫入、或者刪除資訊：喺 LSTM 網絡嘅碼當中，會有演算法幫閘控細胞決定幾時要改變自身嘅啟動程度；相比之下，一般嘅遞迴神經網絡冇能力控制現存嘅過去資訊會唔會喺新資訊嚟到嗰陣被替換。呢個特點令到 LSTM 網絡有能力儲資訊儲好耐，而且有能力由輸入嘅數值當中搵出時間上相距不定嘅因同果^[1]^[2]。研究表明咗，LSTM 網絡可以用嚟教電腦做一啲要求長期記憶嘅作業，例如係閱讀理解呀噉^[1]^[3]。

結構

基本LSTM

上圖係一幅示意圖顯示到LSTM個記憶單位（memory unit）嘅。隻單位主要由條流水線即狀態路（頂頭橫路，分椏出右下角輸出）同埋控制路（左下低啲橫路）組成。狀態路表示細胞 $c$ 嘸同時間啲狀態嘅訊息流動（記憶）同埋狀態嘅輸出（做預測）；控制路表示舊輸出值佮新輸入值對流水線嘅控制。從左到右三副「 $\sigma \rightarrow \otimes$ 」即表示控制到相關嘢嘅遺忘閘、輸入閘、輸出閘^[1]。呢三道閘當中每一道都可以當係一粒子細胞噉嚟睇，有各自嘅啟動程度值，會受到往時輸出 $h_{t-1}$ （或者「隱藏狀態」）同埋噉時輸入 $x_{t}$ 影響，決定自身嘅啟動程度（呢度係攞 $\sigma$ 表示嘅Sigmoid，輸出係0~1），再按自身嘅啟動程度睇吓要做啲乜嘢畀條流水線：^[4]^[5]

遺忘閘（forget gate； $f$ ）：查實係「袂忘閘」，相當於喺近0時閘住上一時狀態（ $c_{t-1}$ ）嚟忘唨佢（清零），近1時開閘嚟保持；0~1之間就睇程度要保持到乜程度。
輸入閘（input gate； $i$ ）：查實係「記埋閘」，決定係咪要寫入或者話記埋個新狀態變化（ ${\tilde {c}}_{t}$ ）同埋要記到乜程度。
輸出閘（output gate； $o$ ）：即決定係咪要輸出 $h_{t}$ 同埋要輸出到乜程度。

其中，非閘控值亦即係寫入值 ${\tilde {c}}_{t}$ 、輸出狀態值 ${c}_{t}^{o}$ 係由tanh 函數得出。寫入值係舊輸出 $h_{t-1}$ 佮新輸入 $x_{t}$ 攞權重乘埋再挃落 $tanh$ 函數得到；輸出狀態值係新狀態 ${c}_{t}$ 攞權重乘埋再挃落 tanh函數得到。

所有啲公式可以噉表示（ $\odot$ 表示Hadamard積，即逐項積）：

{\begin{aligned}f_{t}&=sigm(W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\i_{t}&=sigm(W_{i}x_{t}+U_{i}h_{t-1}+b_{i})\\o_{t}&=sigm(W_{o}x_{t}+U_{o}h_{t-1}+b_{o})\\{\tilde {c}}_{t}&=tanh(W_{c}x_{t}+U_{c}h_{t-1}+b_{c})\\c_{t}&=f_{t}\odot c_{t-1}+i_{t}\odot {\tilde {c}}_{t}\\{c}_{t}^{o}&=tanh(c_{t})\\h_{t}&=o_{t}\odot {c}_{t}^{o}\end{aligned}}

或者更加常見嘅係，捉舊輸入 $h_{t-1}$ 、新輸入 $x_{t}$ 錔喺一齊再攞權重乘埋^[6]：

{\begin{aligned}f_{t}&=sigm(W_{f}[h_{t-1},x_{t}]+b_{f})\\i_{t}&=sigm(W_{i}[h_{t-1},x_{t}]+b_{i})\\o_{t}&=sigm(W_{o}[h_{t-1},x_{t}]+b_{o})\\{\tilde {c}}_{t}&=tanh(W_{c}[h_{t-1},x_{t}]+b_{c})\\c_{t}&=f_{t}\odot c_{t-1}+i_{t}\odot {\tilde {c}}_{t}\\{c}_{t}^{o}&=tanh(c_{t})\\h_{t}&=o_{t}\odot {c}_{t}^{o}\end{aligned}}

貓眼 LSTM

喺進一步嘅貓眼 LSTM （Peephole LSTM）當中， $h_{t-1}$ 都係攞細胞狀態 $c_{t-1}$ 去代表，相當於幫單元加有「貓眼」去睇返上一次嘅狀態 $c_{t-1}$ 並佮埋新輸入 $x_{t}$ 做計算。貓眼 LSTM 當中， $c_{t}$ 變到 ${c}_{t}^{o}$ 係嘸經過 $tanh$ 函數，即 ${c}_{t}^{o}=c_{t}$ ^[8]^[9]。所以總嘅啲公式係：

{\begin{aligned}f_{t}&=sigm(W_{f}c_{t-1}+b_{f})\\i_{t}&=sigm(W_{i}c_{t-1}+b_{i})\\o_{t}&=sigm(W_{o}c_{t-1}+b_{o})\\{\tilde {c}}_{t}&=tanh(W_{c}x_{t}+b_{c})\\c_{t}&=f_{t}\odot c_{t-1}+i_{t}\odot {\tilde {c}}_{t}\\h_{t}&=o_{t}\odot c_{t}\end{aligned}}

睇埋

攷

↑ ^1.0 ^1.1 ^1.2 ^1.3 Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9(8): 1735–1780.
↑ Bayer, Justin; Wierstra, Daan; Togelius, Julian; Schmidhuber, Jürgen (14 September 2009). Evolving Memory Cell Structures for Sequence Learning (PDF). Artificial Neural Networks – ICANN 2009. Lecture Notes in Computer Science. 5769. Springer, Berlin, Heidelberg. pp. 755–764.
↑ Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143.
↑ Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "Sequence labelling in structured domains with hierarchical recurrent neural networks". In Proc. 20th Int. Joint Conf. On Artificial Intelligence, Ijcai 2007: 774–779.
↑ Graves, Alex; Fernández, Santiago; Gomez, Faustino (2006). "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". In Proceedings of the International Conference on Machine Learning, ICML 2006: 369–376.
↑ Olah, Christopher (2015). "Understanding LSTM Networks". colah's blog.{{cite web}}: CS1 maint: url-status (link)
↑ Klaus Greff; Rupesh Kumar Srivastava; Jan Koutník; Bas R. Steunebrink; Jürgen Schmidhuber (2015). "LSTM: A Search Space Odyssey". IEEE Transactions on Neural Networks and Learning Systems. 28 (10): 2222–2232. arXiv:1503.04069. Bibcode:2015arXiv150304069G. doi:10.1109/TNNLS.2016.2582924. PMID 27411231. S2CID 3356463.
↑ Gers, F.; Schraudolph, N.; Schmidhuber, J. (2002). "Learning precise timing with LSTM recurrent networks" (PDF). Journal of Machine Learning Research. 3: 115–143.
↑ Gers, F.; Schraudolph, N.; Schmidhuber, J. (2002). "Learning precise timing with LSTM recurrent networks" (PDF). Journal of Machine Learning Research. 3: 115–143.

[LSTM1997-1] 1.0 ^1.1 ^1.2 ^1.3 Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9(8): 1735–1780.

[2] Bayer, Justin; Wierstra, Daan; Togelius, Julian; Schmidhuber, Jürgen (14 September 2009). Evolving Memory Cell Structures for Sequence Learning (PDF). Artificial Neural Networks – ICANN 2009. Lecture Notes in Computer Science. 5769. Springer, Berlin, Heidelberg. pp. 755–764.

[3] Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143.

[4] Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "Sequence labelling in structured domains with hierarchical recurrent neural networks". In Proc. 20th Int. Joint Conf. On Artificial Intelligence, Ijcai 2007: 774–779.

[5] Graves, Alex; Fernández, Santiago; Gomez, Faustino (2006). "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". In Proceedings of the International Conference on Machine Learning, ICML 2006: 369–376.

[6] Olah, Christopher (2015). "Understanding LSTM Networks". colah's blog.{{cite web}}: CS1 maint: url-status (link)

[ASearchSpaceOdyssey-7] Klaus Greff; Rupesh Kumar Srivastava; Jan Koutník; Bas R. Steunebrink; Jürgen Schmidhuber (2015). "LSTM: A Search Space Odyssey". IEEE Transactions on Neural Networks and Learning Systems. 28 (10): 2222–2232. arXiv:1503.04069. Bibcode:2015arXiv150304069G. doi:10.1109/TNNLS.2016.2582924. PMID 27411231. S2CID 3356463.

[peephole2002-8] Gers, F.; Schraudolph, N.; Schmidhuber, J. (2002). "Learning precise timing with LSTM recurrent networks" (PDF). Journal of Machine Learning Research. 3: 115–143.

[peephole20022-9] Gers, F.; Schraudolph, N.; Schmidhuber, J. (2002). "Learning precise timing with LSTM recurrent networks" (PDF). Journal of Machine Learning Research. 3: 115–143.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]