遞迴神經網絡

遞迴神經網絡（英文：RNN），又叫循環神經網絡，係一種源自 1980 年代嘅人工神經網絡^[1]，指個網絡嘅運算圖包含有方向嘅循環，簡單講就係有至少一部份嘅人工神經細胞嘅啟動程度能夠影響自己甚至係打前嘅細胞嘅啟動程度，令到一個時間點嘅資訊能夠對未來嘅運算作出影響。即係話攞一粒人工神經細胞嚟睇，佢嘅啟動程度 $t$ 可以用類似以下噉嘅式計^[2]^[3]：

t=W_{1}A_{1}+W_{2}A_{2}...

；（啟動函數）

喺呢條式當中， $A_{n}$ 代表影響到 $t$ 嘅神經細胞當中第 $n$ 粒嘅啟動程度，而 $W_{n}$ 就係其他神經細胞當中第 $n$ 粒嘅權重（指嗰粒神經細胞有幾影響到 $t$ ）。 $A_{n}$ 當中可以包括同排或者後排嘅細胞，令個網絡喺一個時間點收到嘅資訊有可能影響個網絡喺未來時間點嘅行為－呢點係遞迴網絡同前饋網絡（FNN）最唔一樣嘅地方^[3]^[4]。

遞迴神經網絡有能力令一個時間點嘅資訊影響未來嘅運算，所以遞迴神經網絡能夠用嚟處理（前饋神經網絡搞唔掂嘅）有連串性嘅資訊^{[註 1]}，喺自然語言處理同機械翻譯（一隻字嘅意思可以受句句子之前嘅字影響）等嘅領域上相當有用^[5]^[6]，而且遞迴神經網絡喺教人工智能玩遊戲方面都有用，例如喺 2016 年打低咗九段圍棋棋手李世石而出名嘅人工智能程式 AlphaGo 就用咗遞迴神經網絡^[7]。

基礎概念

遞迴神經網絡建基於前饋神經網絡。喺一個前饋神經網絡當中，每粒人工神經細胞嘅啟動函數只會包含前一排嗰啲細胞嘅啟動程度， $t=W_{1}A_{1}+W_{2}A_{2}...$ ，當中 $A_{1},A_{2},...A_{n}$ 淨係包括前一排嘅細胞嘅啟動程度，所以一個前饋神經網絡每次做由輸入去到輸出嘅運算都係獨立（independent）嘅－喺接收一個輸入嗰陣，程式會按個網絡嘅啟動函數同權重值計個輸出，而下一次再有輸入嗰時，個網絡就會完全忘記上一次輸入留低嘅資訊。相比之下，一個遞迴神經網絡嘅細胞裏面有至少一部份嘅會有能力影響佢哋之前嗰幾層嘅人工神經細胞嘅啟動程度， $t=W_{1}A_{1}+W_{2}A_{2}...$ ，當中 $A_{1},A_{2},...A_{n}$ 可以包括同排或者後排嘅細胞喺之前時間點嘅啟動程度，令資訊能夠喺個網絡入面「遞迴」^[2]。

例如以下呢幅圖，就係一個展開咗（unrolled / unfolded）嘅遞迴神經網絡^[8]：

當中 $h$ 係個遞迴神經網絡啲隱藏層， $x_{t}$ 係時間點 $t$ 嘅輸入， $o_{t}$ 係時間點 $t$ 嘅輸出，而 $v$ 就係上一個時間點個網絡交去下一個時間點嘅自己嘅資訊。用數學公式描述嘅話，一個遞迴神經網絡嘅輸入輸出關係如下：

o_{t}=f(h_{t};\theta )

，

[1]

h_{t}=g(h_{t-1},x_{t};\theta )

，

[2]

$h_{t}$ 係個遞迴神經網絡啲隱藏層喺時間點 $t$ 嘅狀態，而 $\theta$ 係個網絡啲參數；公式 $[1]$ 同前饋神經網絡一樣（網絡輸出取決於隱藏層狀態同網絡嘅參數），而公式 $[2]$ 反映遞迴網絡同前饋網絡嘅差異－公式 $[2]$ 包括咗 $h_{t-1}$ （隱藏層喺上一個時間點嘅狀態）呢一個前饋網絡唔會考慮嘅變數^[4]。

因為遞迴神經網絡呢種結構上嘅特徵，遞迴神經網絡能夠處理喺時間上前後倚賴嘅資訊，例如係語言噉：一個人講一句嘢嗰陣，句嘢嘅第二個字會影響到第一個字嘅意思（例如「家吓」同「家庭」嘅意思就唔同嗮），所以喺分析喺時間點 $t$ 聽到嘅字嗰陣，要考慮埋喺打前嗰啲時間點－ $(t-1),(t-2),...$ －聽到嘅字，先至可以理解到嗮成句嘢嘅意思－理論上，遞迴神經網絡能夠暫時記住上幾次運算接收到嘅資訊，所以理應比起前饋神經網絡更加能夠做到呢一點。事實係喺廿一世紀初，已經有科學家成功噉運用遞迴神經網絡嚟令到電腦學識處理人類用自然語言（指好似廣東話同閩南話呢啲日常傾偈用嘅語言）講嘅嘢，形成機械翻譯等嘅技術－呢類工作用前饋神經網絡做唔到^[9]。

架構類型

簡單遞迴網絡

簡單遞迴網絡（simple recurrent network，SRN）係最基本嗰種遞迴神經網絡，分做兩種：艾文網絡同佐敦網絡^[10]。一個艾文網絡（Elman network）分三層－ $x$ 、 $y$ 同 $z$ －而柞 $u$ 係所謂嘅語境單位（context units），啲權重冚唪唥都係 1。喺每個時間點 $t$ ，個網絡會由輸入同網絡參數嗰度計輸出以及用學習法則更新權重值，而每粒隱藏層神經細胞（ $y_{i}$ ， $i=1,2,...,l$ ）都有粒相應嘅語境單位，粒語境單位會記低嗰粒隱藏層細胞喺時間點 $t$ 嘅啟動值，而喺時間點 $t+1$ ，語境單位記低咗嘅數值將會能夠左右隱藏層神經細胞嘅啟動，所以艾文網絡能夠一定程度上處理有連串性嘅資訊^[11]^[12]。

艾文網絡可以想像成以下嘅式^[13]^[14]：

{\begin{aligned}h_{t}&=\sigma _{h}(W_{h}x_{t}+U_{h}h_{t-1}+b_{h})\\y_{t}&=\sigma _{y}(W_{y}h_{t}+b_{y})\end{aligned}}

當中

h_{t}

係指隱藏層喺時間點

t

嘅狀態，

x_{t}

係指時間點

t

嘅輸入，

y_{t}

係指時間點

t

嘅輸出；

W

同

U

係柞權重值，

b

係指偏向（bias）－即係嗰粒細胞本身喺啟動上嘅傾向，例如如果有某一粒細胞嘅

b

係正數而且數值大，佢就會傾向無論輸入係幾多都有強烈嘅啟動；

\sigma _{h}

同

\sigma _{y}

係相應嘅啟動函數^[15]。

佐敦網絡（Jordan network）係艾文網絡嘅一個變種，同艾文網絡嘅分別在於佐敦網絡入面嗰啲 $u$ 記嘅唔係隱藏層喺上一個時間點嘅狀態，而係輸出層喺上一個時間點嘅狀態，即係話佐敦網絡可以想像成以下噉嘅式^[11]：

{\begin{aligned}h_{t}&=\sigma _{h}(W_{h}x_{t}+U_{h}y_{t-1}+b_{h})\\y_{t}&=\sigma _{y}(W_{y}h_{t}+b_{y})\end{aligned}}

呢兩類式同艾文網絡嗰兩條嘅唯一分別係， $h_{t}$ 取決於 $y_{t-1}$ （上一個時間點嘅輸出），而唔係取決於 $h_{t-1}$ （隱藏層喺上一個時間點嘅狀態）^[16]。

雙向遞迴網絡

雙向遞迴網絡（bidirectional recurrent network，BRN）係一種進階嘅遞迴神經網絡。雙向遞迴網絡嘅特徵係有兩個彼此之間唔相連嘅隱藏層，分別叫

向前狀態（forward states； $h_{f}$ ）同
向後狀態（backward states； $h_{b}$ ）。

一個雙向遞迴網絡每次會讀取 $n$ 個輸入，foreach $t$ ， $t\in 1,2,...,n$ ^{[註 2]}^[17]^[18]：

{\begin{aligned}h_{f}&=\sigma _{h}(W_{h}x_{t}+U_{h}h_{f_{t-1}}+b_{h})\\h_{b}&=\sigma _{h}(W_{h}x_{t}+U_{h}h_{b_{t+1}}+b_{h})\\y_{i}&=\sigma _{y}(W_{y_{f}}h_{f}+W_{y_{b}}h_{b}+b_{y})\end{aligned}}

最先嗰個時間點嘅 $h_{f_{t-1}}$ 以及最後嗰個時間點嘅 $h_{b_{t+1}}$ 可以會當做一啲事先設定好嘅常數。一個雙向遞迴網絡（右）嘅結構圖解如下^[17]：

一個雙向遞迴網絡嘅輸出層會由過去同未來嗰度攞資訊－例如想像一個做機械翻譯嘅遞迴神經網絡，佢會攞一連串嘅英文字母做輸入，然後輸出就係一段相應嘅粵文字。喺每個時間點 $t$ ，佢會攞 10 個字，然後 foreach 字，佢會有一個輸出，個輸出會取決於打前嘅字同打後嘅字。雙向遞迴網絡最大嘅特徵係能夠埋考慮「未來」嘅資訊，而因為未來嘅資訊好多時都對做預測有用（尤其係喺語言處理上），所以雙向遞迴網絡能夠做到一啲普通嘅遞迴神經網絡做唔到嘅預測^[19]^[20]。

訓練演算法

越時反向傳播

梯度下降法係最常用嚟訓練人工神經網絡嘅演算法之一。梯度下降法係疊代嘅，重點係喺每一次疊代當中，按「點樣改啲參數會令誤差下降得快啲」改變參數數值，希望最後達致最佳嘅參數值^[21]^[22]：

想像一個遞迴網絡，有 $i$ 個權重值（參數），以向量 $\mathbf {w}$ 表示，而家用某個指標 $z$ 量度網絡表現，呢個指標係數值愈細愈理想嘅（例如係個網絡嘅誤差）。
$\mathbf {w}$ 喺個程式初始化嗰陣設咗做某啲數值；
喺每一步，個演算法可以加減 $\mathbf {w}$ 入面是但一個數值，所以個網絡有 $i\times 2$ 咁多個可能改變方法；個演算法計算 $z_{1},z_{2},z_{3},z_{4},...$ －當中 $z_{j}$ 係移去第 $j$ 個方向會得到嘅 $z$ 值－然後個演算法會「揀 $z$ 值跌幅最勁嗰個方向」^{[註 3]}^[23]^[24]。

如果一切順利，噉做會令到權重慢慢改變，變成會令到誤差細嘅數值^[25]。

喺一般嘅梯度下降法當中，每次權重改變都淨係會視乎嗰個時間點嘅誤差（即係 $\Delta w_{t}=f(z_{t})$ ），而越時反向傳播算法（BPTT）就係每次權重改變都考慮而家同打前 $n$ 個時間點嘅累計誤差（即係話 $\Delta w_{t}=f(z_{t},z_{t-1},z_{t-2},...,z_{t-n})$ ）。越時反向傳播算法係一般反向傳播算法嘅廣義化，因為一般反向傳播可以想像成 $n=0$ 嘅越時反向傳播^[26]。

基本演算法

越時反向傳播算法基本做法如下^[26]^[27]：

俾個網絡睇 $n$ 對輸入同輸出，每對輸入輸出有個特定嘅時間點（「係第幾個出現嘅輸入」）；
Foreach 對輸入輸出，計一個誤差值，而呢啲誤差值會累積；
按累積嘅誤差值，更新權重；
返去步驟 1。

對於簡單嘅RNN：

h_{t}=\sigma (W\cdot x_{t}+U\cdot h_{t-1}+b)

反向傳播 $n$ 𨂾透過 $h_{t-1}\dots h_{t-n}$ ，求導狀態 $h_{t}$ 分別對啲 $W,U,x_{t-n}$ ，可以攞下式表示：

{\begin{aligned}{\dfrac {\partial h_{t}}{\partial U}}&=\sum _{i=1}^{n}{\dfrac {\partial \sigma \left(z_{t-i+1}\right)}{\partial z_{t-i+1}}}\cdot U^{i-1}\cdot h_{t-i}\\{\dfrac {\partial h_{t}}{\partial W}}&=\sum _{i=1}^{n}{\dfrac {\partial \sigma \left(z_{t-i+1}\right)}{\partial z_{t-i+1}}}\cdot U^{i-1}\cdot x_{t-i+1}\\{\dfrac {\partial h_{t}}{\partial x_{t-n}}}&={\dfrac {\partial \sigma \left(z_{t-n}\right)}{\partial z_{t-n}}}\cdot W\cdot \prod _{i=1}^{n-1}{\dfrac {\partial \sigma \left(z_{t-i+1}\right)}{\partial z_{t-i+1}}}\cdot U\end{aligned}}

其中 $z_{t-i+1}$ 係 $i$ 𨂾前嘅啟動函數個輸入； ${\dfrac {\partial \sigma \left(z_{t-i+1}\right)}{\partial z_{t-i+1}}}$ 係啟動函數對應𨂾個輸出對輸入嘅求導。可以睇到每回溯一𨂾就乘多一個 $U$ 。

全局最佳化

遺傳演算法係最常用嚟同遞迴神經網絡做全局最佳化嘅做法^[28]^[29]：全局最佳化指用數學方法搵出成個函數嘅最大值或者最細值；梯度下降法同類似嘅方法好多時會搞到個 $z$ 撠住咗係一個局部最細點嗰度－例如誤差值到咗一個最低點（局部最細點），呢點周圍嘅點嘅誤差值都高啲，但查實成個函數仲有一個全局最細點－一個誤差值更加細嘅點，不過個全局最細點唔喺個局部最細點附近，所以個演算法唔曉移去嗰度（睇埋梯度消失問題）。好似遺傳演算法等嘅全局最佳化做法就唔同，唔會齋靠睇現時嗰個點周圍嘅 $z$ 值嚟決定點樣改變啲參數^[30]。

想像 Y 軸代表誤差值，X 軸代表權重向量；局部最細點係喺佢周圍範圍內最低嗰點，但查實仲有個更低嘅全局最細點：

遺傳演算法係一種建基於進化論物競天擇概念嘅機械學習演算法：喺進化論上，一個族群內部嘅生物個體（例如一群人類）彼此之間或多或少噉喺遺傳上有所差異，而呢啲差異會引致佢哋喺表現型（包括外表、行為、同體質等）上有個體差異，當中佢哋有啲生存同繁殖會比較叻，所以就更加有機會將自己啲遺傳基因傳俾下一代。假設環境唔變，個族群就會一代一代噉喺遺傳上有變異，變到愈發適合喺嗰個環境生存同繁衍。遺傳演算法就係受呢個理論啟發嘅一種演算法。做法如下^[31]^[32]：

整一大柞同類嘅數學模型出嚟，當中每個啲參數都有唔同；
叫每個數學模型做若干次嘅預測，每個按佢做預測陣時嘅準確度得返個分數 $s$ ，分數愈高表示佢表現愈好；
揀選分數 $s$ 最高嗰柞模型，將其餘嘅模型淘汰；
做「繁殖」嘅過程－用最高分嗰柞模型做「父母」，生產下一代嘅模型。啲仔喺參數上會似佢哋嘅父母（「每個仔嘅每粒參數」都係「佢父母嘅同位參數」嘅函數）；
再做過上述過程，重複若干代；
如果一切順利，若干代之後手上嘅模型會係一啲預估估得啱嘅模型^[33]^[34]。

霍菲特網絡

霍菲特網絡係一種有高度內部連繫嘅遞迴網絡：喺一個霍菲特網絡入面，每粒細胞都同全體細胞間有連繫（不過一粒細胞唔會影響自己啟動程度）。如果個網絡有 $K$ 咁多粒細胞，成個網絡裏面就會有 $K(K-1)$ 條連繫。即係話，如果當 $w_{ij}$ 係細胞 $i$ 同細胞 $j$ 之間嘅權重^{[註 4]}^[35]：

w_{ii}=0,\forall i

（冇任何細胞能夠影響自己嘅啟動程度）

w_{ij}=w_{ji},\forall i,j

（所有細胞間嘅連繫都係對稱嘅）

霍菲特網絡係動態嘅：一旦霍菲特網絡嘅其中一粒細胞受啟動，佢會引致第啲細胞跟住啟動，然後第啲細胞又可以啟動返佢轉頭^[35]。即係話：

s_{i}\leftarrow \left\{{\begin{array}{ll}+1&{\mbox{if }}\sum _{j}{w_{ij}s_{j}}\geq \theta _{i},\\-1&{\mbox{otherwise.}}\end{array}}\right.

當中：

$s_{i}$ 係細胞 $i$ 嘅狀態；
$\theta _{i}$ 係細胞 $i$ 嘅門檻值（細胞 $i$ 嘅啟動程度大過 $\theta _{i}$ 狀態會變成 1，否則狀態變成 0）。

於是啲細胞就會係噉互傳訊號，一路係噉改變啟動程度，最後去到一個穩定嘅狀態－即係變成一個每粒細胞嘅啟動程度都恆定或者大致恆定嘅狀態^[36]^[37]－呢一點同人腦有些少相似，而認知科學上嘅研究表明咗，霍菲特網絡嘅變種可以用嚟模擬人腦嘅某啲記憶功能^[38]。

雙向聯想記憶

霍菲特網絡仲可以進階做雙向聯想記憶（BAM）。雙向聯想記憶概念上建基於喺人身上觀察到嘅聯想式記憶（可以睇吓古典制約同操作制約），一個雙向聯想記憶網絡會有兩浸霍菲特網絡， $X$ 同 $Y$ ，兩浸細胞層完全連繫住－foreach $X$ 裏面嘅細胞， $x_{ij}$ ， $y_{ij}$ 都同 $x_{ij}$ 有連繫， $y_{ij}\in Y$ （ $y_{ij}$ 係 $Y$ 嘅任何一粒細胞），反之亦然^[39]；研究表明咗，呢種架構可以用嚟訓練部電腦記住「邊啲輸入同邊啲輸入之間有關聯」^{[註 5]}^[40]。

玻茲曼機

玻茲曼機係一種同霍菲特網絡好相似嘅隨機遞迴網絡。一部玻茲曼機啲細胞之間有對稱連繫，而每一粒細胞都會用帶有隨機性嘅演算法決定係咪要啟動^[41]^[42]。一部玻茲曼機會有以下噉嘅式^[43]：

z_{i}=b_{i}+\sum _{j}s_{j}w_{ij}

\Pr(s_{i}=1)={\frac {1}{1+e^{-z_{i}}}}

^{[註 6]}

當中 $z_{i}$ 係第 $i$ 粒神經細胞收到嘅輸入值； $b_{i}$ 係偏向（bias；細胞 $i$ 本身嘅啟動偏向）； $s_{j}$ 係第啲細胞當中第 $j$ 粒嘅狀態（數值只可以係 1 或者 0），而 $w_{ij}$ 係由細胞 $j$ 去細胞 $i$ 嘅權重。 $\Pr(s_{i}=1)$ 係指「 $s_{i}$ （細胞 $i$ 嘅狀態）等如 1 嘅機會率」，而根據第二條式， $\Pr(s_{i}=1)$ 數值取決於 $z_{i}$ ^[41]。

玻茲曼機既有理論價值，又有實用價值：一方面，有唔少理論性嘅認知科學研究者都對玻茲曼機嘅行為同特性感興趣，因為玻茲曼機喺好多方面－例如係唔分層噉－都（同一般嘅人工神經網絡比）更加似生物神經網絡，所以有唔少認知科學會用玻茲曼機嚟模擬心靈相關嘅現象，進行理論研究^[44]；另一方面，機械學習上嘅研究說明咗，只要有適當嘅訓練演算法，玻茲曼機亦都有可以有足夠嘅效率，有可能攞得嚟做某啲實際嘅應用^[45]^[46]。

受限玻茲曼機

受限玻茲曼機（RBM）係玻茲曼機嘅一個變種。一部受限玻茲曼機同一般嘅玻茲曼機一樣，分隱藏細胞同可見細胞，重點在於受限玻茲曼機嗰啲細胞之間嘅連繫「受限制」－隱藏細胞彼此之間唔可以有連繫，而可見細胞之間亦都唔可以有連繫，每一條連繫都係連接住一粒隱藏細胞同埋一粒可見細胞嘅；每粒可見細胞都同所有隱藏細胞有連繫。例如下圖就係一個有四粒可見細胞（柞 $v_{i}$ ）同三粒隱藏細胞（柞 $h_{i}$ ）嘅受限玻茲曼機^[47]^[48]：

想像而家一個研究者噉做：佢俾一柞輸入落去柞可見細胞嗰度，可見細胞狀態成 $\mathbf {v} _{0}$ ，再等柞隱藏細胞按 $\mathbf {v} _{0}$ 同權重啟動成狀態 $\mathbf {h} _{0}$ （向前傳遞；forward pass）；然後第二步係做重構（reconstruction）－將隱藏細胞嘅狀態 $\mathbf {h} _{0}$ 做輸入，等可見細胞按呢柞輸入同權重變成狀態 $\mathbf {v} _{1}$ ；因為啲權重一般會喺初始化嗰陣設做隨機數值，所以 $\mathbf {v} _{1}$ 同 $\mathbf {v} _{0}$ 之間嘅差異（重構誤差）會相當大－「重構原初輸入」嘅工作失敗；喺呢個過程當中，部受限玻茲曼機會俾出兩樣資訊^[47]^[49]：

喺向前傳遞途中，個網絡可以俾有關 $\Pr(\mathbf {h} _{0}|\mathbf {v} _{0})$ 嘅資訊（由 $\mathbf {v} _{0}$ 估 $\mathbf {h} _{0}$ 嘅概率分佈）；而
喺重構途中，個網絡就可以俾有關 $\Pr(\mathbf {v} _{1}|\mathbf {h} _{0})$ 嘅資訊（由 $\mathbf {h} _{0}$ 估 $\mathbf {v} _{1}$ 嘅概率分佈）。

受限玻茲曼機可以攞嚟做深度學習：想像而家行咗學習演算法，部受限玻茲曼機能夠可靠噉重構原初輸入，每次都係 $\mathbf {v} _{1}$ ≈ $\mathbf {v} _{0}$ （可以睇埋自編碼器）；呢部受限玻茲曼機嘅輸入係一幅有若干像素嘅圖像（ $\mathbf {v}$ ），而隱藏層嘅細胞（ $\mathbf {h}$ ）表示嘅係「幅圖入面有啲乜嘢身體部位」；下一步，研究者再用隱藏層 $\mathbf {h}$ 做輸入，砌多個隱藏層（ $\mathbf {\text{new h}}$ ）上去， $\mathbf {\text{new h}}$ 表示嘅係「幅圖係乜嘢動物」。而最後得出呢個網絡能夠做到以下嘅嘢：

準確噉計 $\Pr(\mathbf {h} _{0}|\mathbf {v} _{0})$ ，等同「見到呢幅圖，估幅圖入面有邊啲身體部位」；
準確噉計 $\Pr(\mathbf {v} _{1}|\mathbf {h} _{0})$ ，等同「諗到呢啲身體部位，幅圖大致會係點嘅樣」；
準確噉計 $\Pr(\mathbf {\text{new h}} _{0}|\mathbf {h} _{0})$ ，等同「按照幅圖入面有嘅身體部位，估計幅圖係乜嘢動物」（例：如果幅圖有四隻腳，噉嗰隻嘢應該唔會係昆蟲）；
準確噉計 $\Pr(\mathbf {h} _{0}|\mathbf {\text{new h}} _{0})$ ，等同「已知手上有呢種動物，呢種動物有乜身體部位」

－呢一個網絡成功做到分層嘅知識表示嘅效果（睇埋深度信念網絡）^[49]^[50]。

長短期記憶

長短期記憶（LSTM）係一種能夠記住長期記憶嘅遞迴神經網絡：一般嘅遞迴網絡曉處理連串嘅資訊，例如係用自然語言寫嘅句子噉，但現實世界嘅解難好多時都要求一個個體能夠睇得出時間上差距大嘅因同果，例如要完全了解一個故仔嘅結局，可能要諗返起故仔開頭嘅情節。事實證明咗，普通嘅遞迴神經網絡好多時都難以應付呢啲問題^[51]。

LSTM 網絡特徵係內置一啲門控神經細胞（gated neuron），呢啲特殊嘅人工神經細胞有「門」（gate），曉決定要儲起乜嘢資訊同埋幾時讀取、寫入、或者刪除資訊：喺 LSTM 網絡嘅碼當中，會有演算法幫門控細胞決定幾時要改變自身嘅啟動程度；相比之下，一般嘅遞迴神經網絡冇能力控制現存嘅過去資訊會唔會喺新資訊嚟到嗰陣被替換。呢個特點令到 LSTM 網絡有能力儲資訊儲好耐，而且有能力由輸入嘅數值當中搵出時間上相距不定嘅因同果^[51]^[52]。

以下係一個基本 LSTM 入面嘅一粒細胞；方格代表成粒細胞， $x_{t}$ 係喺時間點 $t$ 嘅輸入，而 $h_{t}$ 係粒細胞喺時間點 $t$ 嘅輸出，細胞 A 有箭咀指去細胞 B 表示 A 能夠影響 B，而交叉就代表「做嘢」。

一粒 LSTM 嘅人工神經細胞除咗主細胞（ $c$ ）之外仲有幾個元件^[51]：

輸出門（ $o$ ）、
輸入門（ $i$ ）、同埋
遺忘門（ $f$ ）。

呢三道門當中每一道都可以當係一粒子細胞噉嚟睇，有各自嘅啟動程度值，會受到第啲人工神經細胞影響，決定自身嘅啟動程度，再按自身嘅啟動程度睇吓係咪要「做嘢」。輸出門「做嘢」嘅話會左右甚至封鎖粒細胞嘅輸出，輸入門「做嘢」嘅話會左右甚至封鎖粒細胞嘅輸入，而遺忘門「做嘢」嗰陣就可以完全刪除粒主細胞入面嘅資訊－即係重設粒主細胞嘅啟動值或者權重等^[53]^[54]。研究表明咗，LSTM 網絡可以用嚟教電腦做一啲要求長期記憶嘅作業，例如係閱讀理解呀噉^[51]^[55]。

閘門遞迴單位

長短期記憶仲喺 2014 年啟發咗閘門遞迴單位（GRU）嘅做法。閘門遞迴單位同長短期記憶好相似，不過閘門遞迴單位用嘅淨係得兩道「門」^[56]^[57]－

重設門（reset gate）位於細胞 $i$ 同下一刻嘅 $i$ 之間，決定粒細胞係咪要重設（忘記上一刻嘅狀態）；
更新門（update gate）決定 $i$ 跟住嘅啟動有幾受現時嘅輸入狀態影響；
閘門遞迴單位基本上就係長短期記憶冇咗輸出門。

想像下圖：

當中^[57]

o_{t}

係時間點

t

嘅輸出。

x_{t}

係時間點

t

嘅輸入。

h_{t}

係時間點

t

嘅隱藏層狀態。

R_{t}

係重設門。

Z_{t}

係更新門。

神經圖靈機

神經圖靈機（NTM）係一種起源於 2014 年嘅記憶增強（memory-augmented；指有連接住網絡外嘅記憶單位）遞迴神經網絡架構，一部神經圖靈機主要有幾個部份^[58]^[59]：

一個神經網絡控制器（controller）；呢個部份係一個多層感知機或者遞迴神經網絡^{[註 7]}，負責同外界互動（由外界攞向量輸入或者俾向量輸出去外界）同控制成個系統（會向系統其餘部份下指令）。
一個記憶；呢個部份負責儲住啲記憶。
一個或者多個記憶讀取器（read head）、同
一個或者多個記憶存寫器（write head）。

神經圖靈機嘅結構圖解如下^[60]：

記憶讀取：將記憶想像成一個 $N\times M$ 嘅矩陣，當中 $N$ 係記憶位置嘅數量，而 $M$ 係每個記憶位置嘅維度（簡單講就係嗰個位置可以記住幾多個數值）。設 $\mathbf {M} _{t}$ 做呢個矩陣喺時間點 $t$ 嘅狀態， $\mathbf {w} _{t}$ 係一個由記憶讀取器去嗰 $N$ 個位置嘅權重向量，而且做咗標準化，即係話

\sum _{i}w_{t}(i)=1,

0\leq w_{t}(i)\leq 1,\forall i

；

喺每次做記憶讀取嗰陣，個控制器都會攞到一個 $M$ 維嘅向量 $\mathbf {r} _{t}$ ，而

\mathbf {r} _{t}\leftarrow \sum _{i}w_{t}(i)\mathbf {M} _{t}(i)

；

上述嘅嘢簡單講如下：記憶讀取器同記憶之間會有個權重，呢啲權重影響「當記憶讀取器要求提取記憶嗰陣，會有邊份記憶受到提取，而提取咗嗰份記憶就係 $\mathbf {r} _{t}$ 」^[58]。

記憶存寫：一個記憶存寫有兩個部份－清除（erase）同添加（add）；喺每一個時間點 $t$ ，記憶存寫器都會俾出三個向量－

一個 $N$ 維嘅權重向量 $\mathbf {w} _{t}$ 、
一個 $M$ 維嘅清除向量 $\mathbf {e} _{t}$ 、同
一個 $M$ 維嘅添加向量 $\mathbf {a} _{t}$ 。

每一個記憶向量 $\mathbf {M} _{t-1}(i)$ 都會受到清除向量改造：

{\text{temp}}\mathbf {M} _{t}(i)=\mathbf {M} _{t-1}(i)[\mathbf {1} -w_{t}(i)]\mathbf {e} _{t}

（

\mathbf {1}

係一個完全由 1 組成嘅向量）；

然後個添加向量就會做嘢：

\mathbf {M} _{t}(i)={\text{temp}}\mathbf {M} _{t}(i)+w_{t}(i)\mathbf {a} _{t}

。

喺應用上，神經圖靈機可以攞嚟模擬工作記憶（記憶功能嘅一種，指個心靈暫時儲住要用嘅資訊）涉及嘅認知過程^[58]^[60]。

應用

遊戲人工智能

1982 年嘅食鬼遊戲；啲鬼（NPC）唔曉猜測玩家嘅行為，淨係曉按簡單法則行動。

遞迴神經網絡喺電子遊戲人工智能上都有用：

一個有認知能力嘅系統（例如一個人腦或者一個人工智能）可以按感知到嘅資訊建立描述世界嘅內在模型（internal model of the world）；一個內在模型係描述「世界點運作」嘅數學模型，而認知系統能夠用呢啲模型預測未來^{[註 8]}^[61]。遊戲人工智能方面嘅研究者有唔少都想教遊戲人工智能建立內在模型、以及用內在模型預測遊戲世界嘅變化並且作出決策，務求令電子遊戲入面嘅人工智能更加似真實嘅人類智能。
電子遊戲入面嘅事件係有連串性嘅，例如一隻射擊遊戲，一個玩家喺開咗若干槍之後，佢會做「重新上子彈」呢個動作嘅機率理應會升，所以有遊戲人工智能研究者指出，零舍擅長處理連串資訊嘅遞迴神經網絡理應可以有效噉攞嚟教遊戲嘅 NPC 預測未來；簡單嘅例子有一個遞迴神經網絡，以「遊戲世界現時狀態」做輸入、「預計遊戲世界下一秒嘅狀態」做輸出^[62]^[63]。

例子：VMC 模型

有研究者就試過整一個遞迴神經網絡系統用嚟做電子遊戲人工智能，個系統有以下嘅部份^[62]：

V（visual）：一個視覺感官系統，呢個系統以「控制緊嗰個 NPC 嘅鏡頭見到嘅影像」做 input，俾出「遊戲世界而家嘅狀態」做 output；
M（memory）：一個記憶系統，呢個系統係一個遞迴神經網絡，input 係「遊戲世界而家嘅狀態」，output 係「預計遊戲世界未來嘅狀態」；
C（decision）：一個決策系統，呢個系統會攞 V 同 M 嘅 output 做自己 input，決定要採取乜嘢行動（「行動」係 output）。

呢個系統嘅圖解如下：

喺每一個時間點 $t$ ，呢個網絡會做以下嘅嘢：

環境（environment）會俾一幅「個鏡頭見到嘅影像」（實用上會係一個好多維嘅向量 ${\text{image}}_{t}$ ）去 V 嗰度；
V 可以係一個神經網絡（睇卷積神經網絡同埋深度學習），output（ $v_{t}$ ）嘅數值表示「見到啲乜」（例：1 代表障礙物、2 代表敵人...）；而家假設 V 用監督式學習訓練咗，能夠作出準確嘅判斷；
M 會攞自己上一個時間點嘅狀態（ $m_{t-1}$ ）同 $v_{t}$ 做 input，output（ $m_{t}$ ）嘅向量表示「預計嘅 ${\text{image}}_{t+1}$ （下一刻會見到嘅嘢）」，而如果考慮到遊戲世界同玩家行為嘅隨機性嘅話， $m_{t}$ 可以變成「預計嘅 ${\text{image}}_{t+1}$ 嘅概率分佈」；
C 會係另一個遞迴神經網絡，攞 $v_{t}$ 同 $m_{t}$ 做自己嘅 input，並且計出邊個行動能夠帶嚟最大嘅收益同決定要採取邊個行動（output）；喺電子遊戲當中，可能嘅行動－要撳邊個掣－數量一般相當有限，所以可以用維度唔高嘅向量表示；睇埋強化學習。
C 嘅 output 會引致環境改變，返去步驟 1。

研究表明咗，噉嘅神經網絡系統能夠教人工智能玩遊戲^[62]。

第啲應用

機械翻譯^[64]
機械人嘅控制^[65]
對時間序列作出預測^[66]
對時間序列作出異常檢測^[67]
教人工智能作曲^[68]
教人工智能學文法^[69]^[70]
手寫辨識^[71]^[72]

... 等等。

睇埋

文獻

Elman, Jeffrey L. (1990). "Finding Structure in Time" (PDF). Cognitive Science. 14 (2): 179–211.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines (PDF). arXiv preprint arXiv:1410.5401.
Hu, S. G., Liu, Y., Liu, Z., Chen, T. P., Wang, J. J., Yu, Q., ... & Hosaka, S. (2015). Associative memory realized by a reconfigurable memristive Hopfield neural network (PDF). Nature communications, 6(1), 1-8.
Li, Shuai; Li, Wanqing; Cook, Chris; Zhu, Ce; Yanbo, Gao (2018). "Independently Recurrent Neural Network (IndRNN): Building a Longer and Deeper RNN" (PDF).
Siegelmann, H. T., & Sontag, E. D. (1992, July). On the computational power of neural nets (PDF). In Proceedings of the fifth annual workshop on Computational learning theory (pp. 440-449).

註釋

↑ 有連串性（sequential）指前面嘅資訊會影響後面嘅資訊嘅意思。
↑ 「 $a\in (...)$ 」意思係「 $a$ 喺 $(...)$ 呢柞嘢當中。」
↑ 即係計誤差隨每個權重嘅偏導數（ $\partial z/\partial w_{i}$ ）。
↑ 「 $\forall$ 」呢個數學符號意思係「for all」；「 $a,\forall i$ 」意思就係「無論係邊個 $i$ ， $a$ 都係真確。」
↑ 例：權重設好咗， $X$ 有某個啟動規律傾向引致 $Y$ 又有某個特定嘅啟動規律。
↑ 而唔係好似霍菲特網絡噉，睇輸入大唔大過門檻值決定啟唔啟動。
↑ 就算個控制器係一個前饋網絡都好，因為成部神經圖靈機整體有記憶功能，所以部神經圖靈機整體依然係一個遞迴網絡。
↑ 即係計算 $\Pr({\text{meih lòih }}|{\text{gwo heui}})$ 。可以睇吓貝葉斯網絡。

引述

↑ Williams, Ronald J.; Hinton, Geoffrey E.; Rumelhart, David E. (October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.
↑ ^2.0 ^2.1 Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117.
↑ ^3.0 ^3.1 Mandic, Danilo P. & Chambers, Jonathon A. (2001). Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley.
↑ ^4.0 ^4.1 Recurrent Neural Network. Brilliant.org.
↑ Li, Xiangang; Wu, Xihong (2014-10-15). "Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition".
↑ Graves, Alex; Liwicki, Marcus; Fernandez, Santiago; Bertolami, Roman; Bunke, Horst; Schmidhuber, Jürgen (2009). "A Novel Connectionist System for Improved Unconstrained Handwriting Recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868.
↑ Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484.
↑ Siegelmann, H.T.; Sontag, E.D. (1991). "Turing computability with neural nets". Appl. Math. Lett. 4 (6): 77–80.
↑ Sak, Hasim; Senior, Andrew; Beaufays, Francoise (2014). "Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling 互聯網檔案館嘅歸檔，歸檔日期2018年4月24號，." (PDF).
↑ Miikkulainen R. (2011) Simple Recurrent Network. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA.
↑ ^11.0 ^11.1 Cruse, H. (1996). Neural networks as cybernetic systems. Stuttgart: Thieme.
↑ Cheng, Y. C., Qi, W. M., & Cai, W. Y. (2002, November). Dynamic properties of Elman and modified Elman neural network. In Proceedings. International Conference on Machine Learning and Cybernetics (Vol. 2, pp. 637-640). IEEE.
↑ Elman, Jeffrey L. (1990). "Finding Structure in Time". Cognitive Science. 14 (2): 179–211.
↑ Haselsteiner, E. (1998, May). What Elman networks cannot do. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227) (Vol. 2, pp. 1245-1249). IEEE.
↑ Ren, G., Cao, Y., Wen, S., Huang, T., & Zeng, Z. (2018). A modified Elman neural network with a new learning rate scheme. Neurocomputing, 286, 11-18.
↑ Jordan, Michael I. (1997-01-01). Serial Order: A Parallel Distributed Processing Approach. Advances in Psychology. Neural-Network Models of Cognition. 121. pp. 471–495.
↑ ^17.0 ^17.1 Schuster, Mike, and Kuldip K. Paliwal. "Bidirectional recurrent neural networks." Signal Processing, IEEE Transactions on 45.11 (1997): 2673-2681.2. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
↑ Understanding Bidirectional RNN in PyTorch. Towards Data Science.
↑ T. Robinson, M. Hochberg, and S. Renals, "The use of recurrent neural networks in continuous speech recognition," in Automatic Speech Recognition: Advanced Topics, C. H. Lee, F. K. Soong, and K. K. Paliwal, Eds. Boston, MA: Kluwer, 1996, pp. 233–258.
↑ Graves, Alex; Schmidhuber, Jürgen (2005-07-01). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures". Neural Networks. IJCNN 2005. 18 (5): 602–610.
↑ Mei, Song (2018). "A mean field view of the landscape of two-layer neural networks". Proceedings of the National Academy of Sciences. 115 (33): E7665–E7671.
↑ Hochreiter, Sepp; et al. (15 January 2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kolen, John F.; Kremer, Stefan C. (eds.). A Field Guide to Dynamical Recurrent Networks. John Wiley & Sons.
↑ Dreyfus, Stuart (1962). "The numerical solution of variational problems". Journal of Mathematical Analysis and Applications. 5 (1): 30–45.
↑ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.
↑ Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, pp. 111–114.
↑ ^26.0 ^26.1 Werbos, Paul J. (1988). "Generalization of backpropagation with application to a recurrent gas market model". Neural Networks. 1 (4): 339–356.
↑ Williams, Ronald J.; Zipser, D. (1 February 2013). "Gradient-based learning algorithms for recurrent networks and their computational complexity". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation: Theory, Architectures, and Applications. Psychology Press.
↑ Gomez, Faustino J.; Schmidhuber, Jürgen; Miikkulainen, Risto (June 2008). "Accelerated Neural Evolution Through Cooperatively Coevolved Synapses". Journal of Machine Learning Research. 9: 937–965.
↑ Gomez, Faustino J.; Miikkulainen, Risto (1999), "Solving non-Markovian control tasks with neuroevolution" (PDF), IJCAI 99, Morgan Kaufmann.
↑ Global Optimization. Wolfram MathWorld.
↑ Goldberg, David E.; Holland, John H. (1988). "Genetic algorithms and machine learning". Machine Learning. 3 (2): 95–99.
↑ Michie, D.; Spiegelhalter, D. J.; Taylor, C. C. (1994). "Machine Learning, Neural and Statistical Classification". Ellis Horwood Series in Artificial Intelligence.
↑ Zhang, Jun; Zhan, Zhi-hui; Lin, Ying; Chen, Ni; Gong, Yue-jiao; Zhong, Jing-hui; Chung, Henry S.H.; Li, Yun; Shi, Yu-hui (2011). "Evolutionary Computation Meets Machine Learning: A Survey" (PDF). Computational Intelligence Magazine. 6 (4): 68–75.
↑ Syed, Omar (May 1995). "Applying Genetic Algorithms to Recurrent Neural Networks for Learning Network Parameters and Architecture". M.Sc. thesis, Department of Electrical Engineering, Case Western Reserve University, Advisor Yoshiyasu Takefuji.
↑ ^35.0 ^35.1 Hopfield Networks are useless. Here's why you should learn them. Towards Data Science.
↑ Park, J. H., Kim, Y. S., Eom, I. K., & Lee, K. Y. (1993). Economic load dispatch for piecewise quadratic cost function using Hopfield neural network. IEEE transactions on power systems, 8(3), 1030-1038.
↑ Zhu, Y., & Yan, Z. (1997). Computerized tumor boundary detection using a Hopfield neural network. IEEE transactions on medical imaging, 16(1), 55-67.
↑ Hu, S. G., Liu, Y., Liu, Z., Chen, T. P., Wang, J. J., Yu, Q., ... & Hosaka, S. (2015). Associative memory realized by a reconfigurable memristive Hopfield neural network (PDF). Nature communications, 6(1), 1-8.
↑ Rojas, Rául (1996). Neural networks: a systematic introduction. Springer. p. 336.
↑ Kosko, Bart (1988). "Bidirectional associative memories". IEEE Transactions on Systems, Man, and Cybernetics. 18 (1): 49–60.
↑ ^41.0 ^41.1 Ackley, David H; Hinton Geoffrey E; Sejnowski, Terrence J (1985), "A learning algorithm for Boltzmann machines" (PDF), Cognitive Science, 9 (1): 147–169,
↑ Boltzmann Machines (PDF).
↑ Boltzmann machine. Scholarpedia.
↑ Aarts, E. H., & Korst, J. H. (1989). Boltzmann machines for travelling salesman problems^{[失咗效嘅鏈]} (PDF). European Journal of Operational Research, 39(1), 79-95.
↑ Hjelm, R. D., Calhoun, V. D., Salakhutdinov, R., Allen, E. A., Adali, T., & Plis, S. M. (2014). Restricted Boltzmann machines for neuroimaging: an application in identifying intrinsic networks. NeuroImage, 96, 245-260.
↑ Sui, C., Bennamoun, M., & Togneri, R. (2015). Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines (PDF). In Proceedings of the IEEE International Conference on Computer Vision (pp. 154-162).
↑ ^47.0 ^47.1 A Beginner's Guide to Restricted Boltzmann Machines (RBMs). Pathmind.
↑ Larochelle, H.; Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines (PDF). Proceedings of the 25th international conference on Machine learning - ICML '08.
↑ ^49.0 ^49.1 Restricted Boltzmann Machines - Simplified. Towards Data Science.
↑ Bengio, Y. (2009). "Learning Deep Architectures for AI". Foundations and Trends in Machine Learning. 2 (1): 1–127.
↑ ^51.0 ^51.1 ^51.2 ^51.3 Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9(8): 1735–1780.
↑ Bayer, Justin; Wierstra, Daan; Togelius, Julian; Schmidhuber, Jürgen (14 September 2009). Evolving Memory Cell Structures for Sequence Learning (PDF). Artificial Neural Networks – ICANN 2009. Lecture Notes in Computer Science. 5769. Springer, Berlin, Heidelberg. pp. 755–764.
↑ Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "Sequence labelling in structured domains with hierarchical recurrent neural networks". In Proc. 20th Int. Joint Conf. On Artificial Intelligence, Ijcai 2007: 774–779.
↑ Graves, Alex; Fernández, Santiago; Gomez, Faustino (2006). "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". In Proceedings of the International Conference on Machine Learning, ICML 2006: 369–376.
↑ Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143.
↑ GRUs vs. LSTMs.
↑ ^57.0 ^57.1 Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation".
↑ ^58.0 ^58.1 ^58.2 Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014). "Neural Turing Machines" (PDF). arXiv:1410.5401.
↑ Collier, Mark; Beel, Joeran (2018), "Implementing Neural Turing Machines" (PDF), Artificial Neural Networks and Machine Learning - ICANN 2018, Springer International Publishing, pp. 94–104,
↑ ^60.0 ^60.1 NTM: Neural Turing Machines. Medium.
↑ G. W. Maus, J. Fischer, and D. Whitney. Motion-dependent representation of space in area MT+. Neuron, 78(3):554–562, 2013.
↑ ^62.0 ^62.1 ^62.2 Ha, D., & Schmidhuber, J. (2018). Recurrent world models facilitate policy evolution (PDF). In Advances in Neural Information Processing Systems (pp. 2450-2462).
↑ J. Schmidhuber. Making the world differentiable: On using supervised learning fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technische Universität München Tech. Report: FKI-126-90, 1990.
↑ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks". Electronic Proceedings of the Neural Information Processing Systems Conference. 27: 5346.
↑ Mayer, Hermann; Gomez, Faustino J.; Wierstra, Daan; Nagy, Istvan; Knoll, Alois; Schmidhuber, Jürgen (October 2006). A System for Robotic Heart Surgery that Learns to Tie Knots Using Recurrent Neural Networks. 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 543–548.
↑ Wierstra, Daan; Schmidhuber, Jürgen; Gomez, Faustino J. (2005). "Evolino: Hybrid Neuroevolution/Optimal Linear Search for Sequence Learning". Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh: 853–858.
↑ Malhotra, Pankaj; Vig, Lovekesh; Shroff, Gautam; Agarwal, Puneet (April 2015). "Long Short Term Memory Networks for Anomaly Detection in Time Series". European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning — ESANN 2015.
↑ Eck, Douglas; Schmidhuber, Jürgen (2002-08-28). Learning the Long-Term Structure of the Blues. Artificial Neural Networks — ICANN 2002. Lecture Notes in Computer Science. 2415. Berlin, Heidelberg: Springer. pp. 284–289.
↑ Schmidhuber, Jürgen; Gers, Felix A.; Eck, Douglas (2002). "Learning nonregular languages: A comparison of simple recurrent networks and LSTM". Neural Computation. 14 (9): 2039–2041.
↑ Gers, Felix A.; Schmidhuber, Jürgen (2001). "LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages 互聯網檔案館嘅歸檔，歸檔日期2020年7月10號，." (PDF). IEEE Transactions on Neural Networks. 12 (6): 1333–1340.
↑ Graves, Alex; Schmidhuber, Jürgen (2009). "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks". Advances in Neural Information Processing Systems 22, NIPS'22. Vancouver (BC): MIT Press: 545–552.
↑ Graves, Alex; Fernández, Santiago; Liwicki, Marcus; Bunke, Horst; Schmidhuber, Jürgen (2007). Unconstrained Online Handwriting Recognition with Recurrent Neural Networks. Proceedings of the 20th International Conference on Neural Information Processing Systems. NIPS'07. Curran Associates Inc. pp. 577-584.

拎

Seq2SeqSharp LSTM/BiLSTM/Transformer recurrent neural networks framework running on CPUs and GPUs for sequence-to-sequence tasks.

[5] 有連串性（sequential）指前面嘅資訊會影響後面嘅資訊嘅意思。

[18] 「 $a\in (...)$ 」意思係「 $a$ 喺 $(...)$ 呢柞嘢當中。」

[25] 即係計誤差隨每個權重嘅偏導數（ $\partial z/\partial w_{i}$ ）。

[38] 「 $\forall$ 」呢個數學符號意思係「for all」；「 $a,\forall i$ 」意思就係「無論係邊個 $i$ ， $a$ 都係真確。」

[44] 例：權重設好咗， $X$ 有某個啟動規律傾向引致 $Y$ 又有某個特定嘅啟動規律。

[49] 而唔係好似霍菲特網絡噉，睇輸入大唔大過門檻值決定啟唔啟動。

[66] 就算個控制器係一個前饋網絡都好，因為成部神經圖靈機整體有記憶功能，所以部神經圖靈機整體依然係一個遞迴網絡。

[68] 即係計算 $\Pr({\text{meih lòih }}|{\text{gwo heui}})$ 。可以睇吓貝葉斯網絡。

[1] Williams, Ronald J.; Hinton, Geoffrey E.; Rumelhart, David E. (October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.

[sch-2] 2.0 ^2.1 Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117.

[mandic-3] 3.0 ^3.1 Mandic, Danilo P. & Chambers, Jonathon A. (2001). Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley.

[brill-4] 4.0 ^4.1 Recurrent Neural Network. Brilliant.org.

[6] Li, Xiangang; Wu, Xihong (2014-10-15). "Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition".

[7] Graves, Alex; Liwicki, Marcus; Fernandez, Santiago; Bertolami, Roman; Bunke, Horst; Schmidhuber, Jürgen (2009). "A Novel Connectionist System for Improved Unconstrained Handwriting Recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868.

[8] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484.

[9] Siegelmann, H.T.; Sontag, E.D. (1991). "Turing computability with neural nets". Appl. Math. Lett. 4 (6): 77–80.

[sak-10] Sak, Hasim; Senior, Andrew; Beaufays, Francoise (2014). "Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling 互聯網檔案館嘅歸檔，歸檔日期2018年4月24號，." (PDF).

[11] Miikkulainen R. (2011) Simple Recurrent Network. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA.

[cruse1996-12] 11.0 ^11.1 Cruse, H. (1996). Neural networks as cybernetic systems. Stuttgart: Thieme.

[13] Cheng, Y. C., Qi, W. M., & Cai, W. Y. (2002, November). Dynamic properties of Elman and modified Elman neural network. In Proceedings. International Conference on Machine Learning and Cybernetics (Vol. 2, pp. 637-640). IEEE.

[14] Elman, Jeffrey L. (1990). "Finding Structure in Time". Cognitive Science. 14 (2): 179–211.

[15] Haselsteiner, E. (1998, May). What Elman networks cannot do. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227) (Vol. 2, pp. 1245-1249). IEEE.

[16] Ren, G., Cao, Y., Wen, S., Huang, T., & Zeng, Z. (2018). A modified Elman neural network with a new learning rate scheme. Neurocomputing, 286, 11-18.

[17] Jordan, Michael I. (1997-01-01). Serial Order: A Parallel Distributed Processing Approach. Advances in Psychology. Neural-Network Models of Cognition. 121. pp. 471–495.

[schuster1997-19] 17.0 ^17.1 Schuster, Mike, and Kuldip K. Paliwal. "Bidirectional recurrent neural networks." Signal Processing, IEEE Transactions on 45.11 (1997): 2673-2681.2. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan

[20] Understanding Bidirectional RNN in PyTorch. Towards Data Science.

[21] T. Robinson, M. Hochberg, and S. Renals, "The use of recurrent neural networks in continuous speech recognition," in Automatic Speech Recognition: Advanced Topics, C. H. Lee, F. K. Soong, and K. K. Paliwal, Eds. Boston, MA: Kluwer, 1996, pp. 233–258.

[22] Graves, Alex; Schmidhuber, Jürgen (2005-07-01). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures". Neural Networks. IJCNN 2005. 18 (5): 602–610.

[23] Mei, Song (2018). "A mean field view of the landscape of two-layer neural networks". Proceedings of the National Academy of Sciences. 115 (33): E7665–E7671.

[24] Hochreiter, Sepp; et al. (15 January 2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kolen, John F.; Kremer, Stefan C. (eds.). A Field Guide to Dynamical Recurrent Networks. John Wiley & Sons.

[26] Dreyfus, Stuart (1962). "The numerical solution of variational problems". Journal of Mathematical Analysis and Applications. 5 (1): 30–45.

[27] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.

[28] Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, pp. 111–114.

[werbos1988-29] 26.0 ^26.1 Werbos, Paul J. (1988). "Generalization of backpropagation with application to a recurrent gas market model". Neural Networks. 1 (4): 339–356.

[30] Williams, Ronald J.; Zipser, D. (1 February 2013). "Gradient-based learning algorithms for recurrent networks and their computational complexity". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation: Theory, Architectures, and Applications. Psychology Press.

[31] Gomez, Faustino J.; Schmidhuber, Jürgen; Miikkulainen, Risto (June 2008). "Accelerated Neural Evolution Through Cooperatively Coevolved Synapses". Journal of Machine Learning Research. 9: 937–965.

[32] Gomez, Faustino J.; Miikkulainen, Risto (1999), "Solving non-Markovian control tasks with neuroevolution" (PDF), IJCAI 99, Morgan Kaufmann.

[33] Global Optimization. Wolfram MathWorld.

[34] Goldberg, David E.; Holland, John H. (1988). "Genetic algorithms and machine learning". Machine Learning. 3 (2): 95–99.

[35] Michie, D.; Spiegelhalter, D. J.; Taylor, C. C. (1994). "Machine Learning, Neural and Statistical Classification". Ellis Horwood Series in Artificial Intelligence.

[36] Zhang, Jun; Zhan, Zhi-hui; Lin, Ying; Chen, Ni; Gong, Yue-jiao; Zhong, Jing-hui; Chung, Henry S.H.; Li, Yun; Shi, Yu-hui (2011). "Evolutionary Computation Meets Machine Learning: A Survey" (PDF). Computational Intelligence Magazine. 6 (4): 68–75.

[37] Syed, Omar (May 1995). "Applying Genetic Algorithms to Recurrent Neural Networks for Learning Network Parameters and Architecture". M.Sc. thesis, Department of Electrical Engineering, Case Western Reserve University, Advisor Yoshiyasu Takefuji.

[hopfieldTDS-39] 35.0 ^35.1 Hopfield Networks are useless. Here's why you should learn them. Towards Data Science.

[parketal1993-40] Park, J. H., Kim, Y. S., Eom, I. K., & Lee, K. Y. (1993). Economic load dispatch for piecewise quadratic cost function using Hopfield neural network. IEEE transactions on power systems, 8(3), 1030-1038.

[41] Zhu, Y., & Yan, Z. (1997). Computerized tumor boundary detection using a Hopfield neural network. IEEE transactions on medical imaging, 16(1), 55-67.

[42] Hu, S. G., Liu, Y., Liu, Z., Chen, T. P., Wang, J. J., Yu, Q., ... & Hosaka, S. (2015). Associative memory realized by a reconfigurable memristive Hopfield neural network (PDF). Nature communications, 6(1), 1-8.

[43] Rojas, Rául (1996). Neural networks: a systematic introduction. Springer. p. 336.

[45] Kosko, Bart (1988). "Bidirectional associative memories". IEEE Transactions on Systems, Man, and Cybernetics. 18 (1): 49–60.

[ackley85-46] 41.0 ^41.1 Ackley, David H; Hinton Geoffrey E; Sejnowski, Terrence J (1985), "A learning algorithm for Boltzmann machines" (PDF), Cognitive Science, 9 (1): 147–169,

[47] Boltzmann Machines (PDF).

[boltzmannscholarpedia-48] Boltzmann machine. Scholarpedia.

[50] Aarts, E. H., & Korst, J. H. (1989). Boltzmann machines for travelling salesman problems^{[失咗效嘅鏈]} (PDF). European Journal of Operational Research, 39(1), 79-95.

[51] Hjelm, R. D., Calhoun, V. D., Salakhutdinov, R., Allen, E. A., Adali, T., & Plis, S. M. (2014). Restricted Boltzmann machines for neuroimaging: an application in identifying intrinsic networks. NeuroImage, 96, 245-260.

[52] Sui, C., Bennamoun, M., & Togneri, R. (2015). Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines (PDF). In Proceedings of the IEEE International Conference on Computer Vision (pp. 154-162).

[pathmindRBM-53] 47.0 ^47.1 A Beginner's Guide to Restricted Boltzmann Machines (RBMs). Pathmind.

[54] Larochelle, H.; Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines (PDF). Proceedings of the 25th international conference on Machine learning - ICML '08.

[RBMtowardsDS-55] 49.0 ^49.1 Restricted Boltzmann Machines - Simplified. Towards Data Science.

[56] Bengio, Y. (2009). "Learning Deep Architectures for AI". Foundations and Trends in Machine Learning. 2 (1): 1–127.

[LSTM1997-57] 51.0 ^51.1 ^51.2 ^51.3 Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9(8): 1735–1780.

[58] Bayer, Justin; Wierstra, Daan; Togelius, Julian; Schmidhuber, Jürgen (14 September 2009). Evolving Memory Cell Structures for Sequence Learning (PDF). Artificial Neural Networks – ICANN 2009. Lecture Notes in Computer Science. 5769. Springer, Berlin, Heidelberg. pp. 755–764.

[59] Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "Sequence labelling in structured domains with hierarchical recurrent neural networks". In Proc. 20th Int. Joint Conf. On Artificial Intelligence, Ijcai 2007: 774–779.

[60] Graves, Alex; Fernández, Santiago; Gomez, Faustino (2006). "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". In Proceedings of the International Conference on Machine Learning, ICML 2006: 369–376.

[61] Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143.

[62] GRUs vs. LSTMs.

[cho2014-63] 57.0 ^57.1 Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation".

[graves2014-64] 58.0 ^58.1 ^58.2 Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014). "Neural Turing Machines" (PDF). arXiv:1410.5401.

[65] Collier, Mark; Beel, Joeran (2018), "Implementing Neural Turing Machines" (PDF), Artificial Neural Networks and Machine Learning - ICANN 2018, Springer International Publishing, pp. 94–104,

[mediumonNTM-67] 60.0 ^60.1 NTM: Neural Turing Machines. Medium.

[69] G. W. Maus, J. Fischer, and D. Whitney. Motion-dependent representation of space in area MT+. Neuron, 78(3):554–562, 2013.

[ha2018-70] 62.0 ^62.1 ^62.2 Ha, D., & Schmidhuber, J. (2018). Recurrent world models facilitate policy evolution (PDF). In Advances in Neural Information Processing Systems (pp. 2450-2462).

[71] J. Schmidhuber. Making the world differentiable: On using supervised learning fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technische Universität München Tech. Report: FKI-126-90, 1990.

[72] Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks". Electronic Proceedings of the Neural Information Processing Systems Conference. 27: 5346.

[73] Mayer, Hermann; Gomez, Faustino J.; Wierstra, Daan; Nagy, Istvan; Knoll, Alois; Schmidhuber, Jürgen (October 2006). A System for Robotic Heart Surgery that Learns to Tie Knots Using Recurrent Neural Networks. 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 543–548.

[74] Wierstra, Daan; Schmidhuber, Jürgen; Gomez, Faustino J. (2005). "Evolino: Hybrid Neuroevolution/Optimal Linear Search for Sequence Learning". Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh: 853–858.

[75] Malhotra, Pankaj; Vig, Lovekesh; Shroff, Gautam; Agarwal, Puneet (April 2015). "Long Short Term Memory Networks for Anomaly Detection in Time Series". European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning — ESANN 2015.

[76] Eck, Douglas; Schmidhuber, Jürgen (2002-08-28). Learning the Long-Term Structure of the Blues. Artificial Neural Networks — ICANN 2002. Lecture Notes in Computer Science. 2415. Berlin, Heidelberg: Springer. pp. 284–289.

[77] Schmidhuber, Jürgen; Gers, Felix A.; Eck, Douglas (2002). "Learning nonregular languages: A comparison of simple recurrent networks and LSTM". Neural Computation. 14 (9): 2039–2041.

[78] Gers, Felix A.; Schmidhuber, Jürgen (2001). "LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages 互聯網檔案館嘅歸檔，歸檔日期2020年7月10號，." (PDF). IEEE Transactions on Neural Networks. 12 (6): 1333–1340.

[79] Graves, Alex; Schmidhuber, Jürgen (2009). "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks". Advances in Neural Information Processing Systems 22, NIPS'22. Vancouver (BC): MIT Press: 545–552.

[80] Graves, Alex; Fernández, Santiago; Liwicki, Marcus; Bunke, Horst; Schmidhuber, Jürgen (2007). Unconstrained Online Handwriting Recognition with Recurrent Neural Networks. Proceedings of the 20th International Conference on Neural Information Processing Systems. NIPS'07. Curran Associates Inc. pp. 577-584.

[1]

[2]

[3]

[4]

[註 1]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[註 2]

[17]

[18]

[19]

[20]

[21]

[22]

[註 3]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[註 4]

[35]

[36]

[37]

[38]

[39]

[註 5]

[40]

[41]

[42]

[43]

[註 6]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[註 7]

[60]

[註 8]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]