🔍 What is BM25Retriever?

BM25Retriever is a type of retriever used in information retrieval systems like question answering (QA) and search. It helps find the most relevant documents from a collection based on a user’s query.

It is based on a classic ranking function called BM25 (Best Matching 25), which scores how relevant a document is to a given query.


📚 Use Case Example

Imagine you’re building a Q&A system using your private notes. You store all your notes as documents. When someone asks a question, BM25Retriever fetches the most relevant notes based on word matches.


🤖 How BM25 Works

  1. Tokenization: Break query and documents into words.
  2. Scoring: Compute a score between the query and each document.
  3. Ranking: Return the top N documents with the highest scores.

🧪 BM25 Scoring Formula

For a document D and query Q = {q₁, q₂, ..., qₙ}, the score is:


Score(D, Q) = Σ IDF(qᵢ) * [(f(qᵢ, D) * (k + 1)) / (f(qᵢ, D) + k * (1 - b + b * (|D| / avgDL)))]

Where:

  • f(qᵢ, D) = frequency of term qᵢ in document D
  • |D| = length of document D
  • avgDL = average document length
  • k and b = hyperparameters (usually k ≈ 1.2 and b ≈ 0.75)
  • IDF(qᵢ) = inverse document frequency of qᵢ:

IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5) + 1)

Where:

  • N = total number of documents
  • n(q) = number of documents containing q

🧠 Intuition Behind the Formula

  • Rare words (high IDF) are more valuable.
  • Words that appear more often in a document get more weight (TF).
  • Long documents are penalized slightly to prevent bias.

📝 Example

Query: "quantum gravity"

  • Doc A: "quantum gravity is cool" → short and focused.
  • Doc B: Long doc with many unrelated words, mentions "quantum" 20× and "gravity" 10×.

BM25 may give Doc A a higher score because it’s short and directly relevant, while Doc B may score lower due to dilution by length.


🛠️ Using BM25Retriever in Code (e.g., LangChain)

from langchain.retrievers import BM25Retriever
 
retriever = BM25Retriever.from_documents(docs)
results = retriever.get_relevant_documents(query)

📌 BM25 vs Other Retrievers

FeatureBM25Dense Embedding (e.g., FAISS, Pinecone)
Based onWord frequencyVector similarity
Requires training❌ No✅ Usually requires pre-trained models
Synonym handling❌ No✅ Yes (to some extent)
Performance⚡ Fast🧠 More accurate (but slower)

🧩 Summary of BM25 Scoring

ComponentDescription
TFBoosts score for frequent terms
IDFBoosts score for rare terms
Length factorPenalizes long documents
k, bTune how much to weigh frequency and length

BM25 is a great baseline retriever—simple, fast, and effective for many real-world applications.