# Okapi BM25

Okapi BM25（當中 BMbest matching 嘅簡稱）係一種用嚟做資訊提取函數。呢套演算法會攞用家問嘅嘢（${\displaystyle Q}$）做 input，然後同每份文件（${\displaystyle D}$）計個分數（${\displaystyle {\text{score}}}$）反映件文件對用家條問題嚟講幾有啦更[1][2]

## 算式

Okapi BM25 條式係噉嘅：

${\displaystyle {\text{score}}(D,Q)=\sum _{i=1}^{n}{\text{IDF}}(q_{i})\cdot {\frac {f(q_{i},D)\cdot (k_{1}+1)}{f(q_{i},D)+k_{1}\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}}$，當中[註 1]
• ${\displaystyle q_{1},q_{2},...q_{i}}$${\displaystyle Q}$ 入面嘅每隻關鍵字
• ${\displaystyle f(q_{i},D)}$${\displaystyle q_{i}}$${\displaystyle D}$ 入面出現得有幾密（相對於 ${\displaystyle D}$ 嘅長度）；
• ${\displaystyle |D|}$${\displaystyle D}$ 嘅長度（以字數計）；
• ${\displaystyle {\text{avgdl}}}$ 係摷咗嗰啲文件嘅平均長度；

${\displaystyle k_{1}}$${\displaystyle b}$參數，好多時冇做最佳化嘅話就設做 ${\displaystyle k_{1}\in [1.2,2.0]}$${\displaystyle b=0.75}$ [3]

${\displaystyle {\text{IDF}}(q_{i})}$ 呢個分計法如下－

${\displaystyle {\text{IDF}}(q_{i})=\ln \left({\frac {N-n(q_{i})+0.5}{n(q_{i})+0.5}}+1\right)}$
• 當中 ${\displaystyle N}$ 係摷咗嘅文件嘅數量，
• 當中 ${\displaystyle n(q_{i})}$ 摷咗嘅文件當中有幾多份係有 ${\displaystyle q_{i}}$ 喺裏面嘅，
• 如果 ${\displaystyle q_{i}}$ 係一隻常用字（例如英文入面嘅 in 或者 of 呀噉），噉佢嘅 ${\displaystyle {\text{IDF}}(q_{i})}$ 分數理應會低（${\displaystyle N-n(q_{i})}$ 數值細）；所以 ${\displaystyle {\text{IDF}}(q_{i})}$ 呢嚿嘢嘅存在係為咗阻止啲常用字干擾搜尋結果。

## 註釋

1. ${\displaystyle \sum }$加總

## 攷

1. Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management. 36 (6): 779-808.
2. Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 2". Information Processing & Management. 36 (6): 809-840.
3. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval, Cambridge University Press, 2009, p. 233.