🔍 What is BM25Retriever?
BM25Retriever is a type of retriever used in information retrieval systems like question answering (QA) and search. It helps find the most relevant documents from a collection based on a user’s query.
It is based on a classic ranking function called BM25 (Best Matching 25), which scores how relevant a document is to a given query.
📚 Use Case Example
Imagine you’re building a Q&A system using your private notes. You store all your notes as documents. When someone asks a question, BM25Retriever fetches the most relevant notes based on word matches.
🤖 How BM25 Works
- Tokenization: Break query and documents into words.
- Scoring: Compute a score between the query and each document.
- Ranking: Return the top N documents with the highest scores.
🧪 BM25 Scoring Formula
For a document D and query Q = {q₁, q₂, ..., qₙ}, the score is:
Score(D, Q) = Σ IDF(qᵢ) * [(f(qᵢ, D) * (k + 1)) / (f(qᵢ, D) + k * (1 - b + b * (|D| / avgDL)))]
Where:
f(qᵢ, D)= frequency of termqᵢin documentD|D|= length of documentDavgDL= average document lengthkandb= hyperparameters (usuallyk ≈ 1.2andb ≈ 0.75)IDF(qᵢ)= inverse document frequency ofqᵢ:
IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5) + 1)
Where:
N= total number of documentsn(q)= number of documents containingq
🧠 Intuition Behind the Formula
- Rare words (high IDF) are more valuable.
- Words that appear more often in a document get more weight (TF).
- Long documents are penalized slightly to prevent bias.
📝 Example
Query: "quantum gravity"
- Doc A:
"quantum gravity is cool"→ short and focused. - Doc B: Long doc with many unrelated words, mentions
"quantum"20× and"gravity"10×.
BM25 may give Doc A a higher score because it’s short and directly relevant, while Doc B may score lower due to dilution by length.
🛠️ Using BM25Retriever in Code (e.g., LangChain)
from langchain.retrievers import BM25Retriever
retriever = BM25Retriever.from_documents(docs)
results = retriever.get_relevant_documents(query)📌 BM25 vs Other Retrievers
| Feature | BM25 | Dense Embedding (e.g., FAISS, Pinecone) |
|---|---|---|
| Based on | Word frequency | Vector similarity |
| Requires training | ❌ No | ✅ Usually requires pre-trained models |
| Synonym handling | ❌ No | ✅ Yes (to some extent) |
| Performance | ⚡ Fast | 🧠 More accurate (but slower) |
🧩 Summary of BM25 Scoring
| Component | Description |
|---|---|
TF | Boosts score for frequent terms |
IDF | Boosts score for rare terms |
| Length factor | Penalizes long documents |
k, b | Tune how much to weigh frequency and length |
BM25 is a great baseline retriever—simple, fast, and effective for many real-world applications.