🔍 What is BM25Retriever?
BM25Retriever is a type of retriever used in information retrieval systems like question answering (QA) and search. It helps find the most relevant documents from a collection based on a user’s query.
It is based on a classic ranking function called BM25 (Best Matching 25), which scores how relevant a document is to a given query.
📚 Use Case Example
Imagine you’re building a Q&A system using your private notes. You store all your notes as documents. When someone asks a question, BM25Retriever fetches the most relevant notes based on word matches.
🤖 How BM25 Works
- Tokenization: Break query and documents into words.
- Scoring: Compute a score between the query and each document.
- Ranking: Return the top N documents with the highest scores.
🧪 BM25 Scoring Formula
For a document D
and query Q = {q₁, q₂, ..., qₙ}
, the score is:
Score(D, Q) = Σ IDF(qᵢ) * [(f(qᵢ, D) * (k + 1)) / (f(qᵢ, D) + k * (1 - b + b * (|D| / avgDL)))]
Where:
f(qᵢ, D)
= frequency of termqᵢ
in documentD
|D|
= length of documentD
avgDL
= average document lengthk
andb
= hyperparameters (usuallyk ≈ 1.2
andb ≈ 0.75
)IDF(qᵢ)
= inverse document frequency ofqᵢ
:
IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5) + 1)
Where:
N
= total number of documentsn(q)
= number of documents containingq
🧠 Intuition Behind the Formula
- Rare words (high IDF) are more valuable.
- Words that appear more often in a document get more weight (TF).
- Long documents are penalized slightly to prevent bias.
📝 Example
Query: "quantum gravity"
- Doc A:
"quantum gravity is cool"
→ short and focused. - Doc B: Long doc with many unrelated words, mentions
"quantum"
20× and"gravity"
10×.
BM25 may give Doc A a higher score because it’s short and directly relevant, while Doc B may score lower due to dilution by length.
🛠️ Using BM25Retriever in Code (e.g., LangChain)
from langchain.retrievers import BM25Retriever
retriever = BM25Retriever.from_documents(docs)
results = retriever.get_relevant_documents(query)
📌 BM25 vs Other Retrievers
Feature | BM25 | Dense Embedding (e.g., FAISS, Pinecone) |
---|---|---|
Based on | Word frequency | Vector similarity |
Requires training | ❌ No | ✅ Usually requires pre-trained models |
Synonym handling | ❌ No | ✅ Yes (to some extent) |
Performance | ⚡ Fast | 🧠 More accurate (but slower) |
🧩 Summary of BM25 Scoring
Component | Description |
---|---|
TF | Boosts score for frequent terms |
IDF | Boosts score for rare terms |
Length factor | Penalizes long documents |
k , b | Tune how much to weigh frequency and length |
BM25 is a great baseline retriever—simple, fast, and effective for many real-world applications.