BLEU

雙語替換評測（粵拼：soeng1 jyu5 tai3 wun6 ping4 caak1；bilingual evaluation understudy，BLEU）係機械翻譯上一種用嚟評估機翻演算法用嘅基準。

想像家陣有隻機翻演算法，研究者可以攞住隻演算法出嘅 output，攞啲 output 去同專業翻譯者畀嘅 output 對比吓。如果隻演算法出嘅 output 同專業翻譯者畀嘅 output 愈似，就愈表示隻演算法掂。

概論

靠人手嚟評 MT 系統有一大唔好處，就係又貴又慢：要搵「識來源語言又識目標語言、仲對翻譯有啲認識」嘅人唔容易，而請呢啲人幫手評 MT 系統就要使錢；除此之外，要請人睇成數以千計嘅 output 句子，實會花好多時間^[1]。因為噉，有啲 MT 研究者就想採用自動化嘅 MT 評估，簡單噉講即係諗啲數出嚟畀電腦計，部電腦會攞個 MT 系統出嘅字做 input，再根據嗰啲計法計若干個數值出嚟，呢啲數值會反映個系統嘅翻譯「質素有幾高」。

當中 BLEU 可以話係廿一世紀初最常用嗰隻自動評估做法^{[註 1]}。首先，BLEU 建基於一個諗頭^[2]：

「

攞住一段由專家做嘅翻譯（對照翻譯），一個質素高嘅 MT 系統出嘅 output 理應會有返咁上下似對照翻譯。

」

用 BLEU 做評估，步驟大致上係噉嘅^[2]^[3]：

攞住
- 一句 input 句子、
- 若干句對照翻譯 $(y^{(1)},...,y^{(N)})$ ^{[註 2]}、同埋
- 由被評系統對句嘢畀嘅 output（ ${\hat {y}}$ ）。
計吓 ${\hat {y}}$ ${\hat {y}}$ 入面嘅字，有幾多 % 係有喺 $(y^{(1)},...,y^{(N)})$ $(y^{(1)},...,y^{(N)})$ 裏面出現嘅。
- 步驟 2 得出嘅數，唔可以用嚟做評估，例如 ${\hat {y}}$ 係貓貓貓貓，而 $y^{(1)}$ 係隻貓瞓緊－ ${\hat {y}}$ 喺步驟 2 會攞到高分，但 ${\hat {y}}$ 明顯唔係一個好嘅翻譯。
Foreach ${\hat {y}}$ ${\hat {y}}$ 入面嘅字，計吓佢喺 $(y^{(1)},...,y^{(N)})$ $(y^{(1)},...,y^{(N)})$ 裏面最多出現咗幾多次（ $~m_{max}$ $~m_{max}$ ）。
- 例如 ${\hat {y}}$ 係貓貓貓貓， $y^{(1)}$ 係隻貓瞓緊， $y^{(2)}$ 係隻貓喺度瞓，貓一字喺 $(y^{(1)},...,y^{(N)})$ 當中最多出現咗 1 次－ $~m_{max}=1$ 。
1-gram 準確度分數 $={\frac {1}{4}}=0.25$
重複 2 同 3 噉嘅步驟，不過用嘅係 2-gram 同 3-gram... 等等。可以睇吓 N-gram 呢個概念。
短得滯嘅 ${\hat {y}}$ ${\hat {y}}$ 「要受罰」，
- 設 $r$ 做對照嘅長度而 $c$ 做 ${\hat {y}}$ 嘅長度，如果 $c<r$ ，個系統就要受罰扣分；
- 喺最基本嗰種計法下，要扣嘅分數 $=e^{1-r/c}$ ，當中 $e$ 係自然底數；
重複到（例如）4-gram 之後，用幾何平均值計吓個系統平均攞到幾高分，分數愈高就表示個系統愈掂。

睇埋

註釋

↑ 研究仲指，BLEU 嘅判斷同人類專家做嘅判斷，有強烈嘅正統計相關。
↑ 呢啲對照可以係嚟自語料庫嘅。

參考資料

↑ Hovy, E. H. (1999). Toward finely differentiated evaluation metrics for machine translation. In Proceedings of the EAGLES Workshop on Standards and Evaluation Pisa, Italy, 1999.
↑ ^2.0 ^2.1 Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation (PDF). In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
↑ Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006. pp. 249-256.

[2] 研究仲指，BLEU 嘅判斷同人類專家做嘅判斷，有強烈嘅正統計相關。

[5] 呢啲對照可以係嚟自語料庫嘅。

[1] Hovy, E. H. (1999). Toward finely differentiated evaluation metrics for machine translation. In Proceedings of the EAGLES Workshop on Standards and Evaluation Pisa, Italy, 1999.

[pap2002-3] 2.0 ^2.1 Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation (PDF). In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).

[4] Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006. pp. 249-256.

[1]

[註 1]

[2]

[3]

[註 2]