BM25 Retriever

🔍 What is BM25Retriever?

BM25Retriever is a type of retriever used in information retrieval systems like question answering (QA) and search. It helps find the most relevant documents from a collection based on a user’s query.

It is based on a classic ranking function called BM25 (Best Matching 25), which scores how relevant a document is to a given query.

📚 Use Case Example

Imagine you’re building a Q&A system using your private notes. You store all your notes as documents. When someone asks a question, BM25Retriever fetches the most relevant notes based on word matches.

🤖 How BM25 Works

Tokenization: Break query and documents into words.
Scoring: Compute a score between the query and each document.
Ranking: Return the top N documents with the highest scores.

🧪 BM25 Scoring Formula

For a document D and query Q = {q₁, q₂, ..., qₙ}, the score is:


Score(D, Q) = Σ IDF(qᵢ) * [(f(qᵢ, D) * (k + 1)) / (f(qᵢ, D) + k * (1 - b + b * (|D| / avgDL)))]

Where:

f(qᵢ, D) = frequency of term qᵢ in document D
|D| = length of document D
avgDL = average document length
k and b = hyperparameters (usually k ≈ 1.2 and b ≈ 0.75)
IDF(qᵢ) = inverse document frequency of qᵢ:


IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5) + 1)

Where:

N = total number of documents
n(q) = number of documents containing q

🧠 Intuition Behind the Formula

Rare words (high IDF) are more valuable.
Words that appear more often in a document get more weight (TF).
Long documents are penalized slightly to prevent bias.

📝 Example

Query: "quantum gravity"

Doc A: "quantum gravity is cool" → short and focused.
Doc B: Long doc with many unrelated words, mentions "quantum" 20× and "gravity" 10×.

BM25 may give Doc A a higher score because it’s short and directly relevant, while Doc B may score lower due to dilution by length.

🛠️ Using BM25Retriever in Code (e.g., LangChain)

from langchain.retrievers import BM25Retriever
 
retriever = BM25Retriever.from_documents(docs)
results = retriever.get_relevant_documents(query)

📌 BM25 vs Other Retrievers

Feature	BM25	Dense Embedding (e.g., FAISS, Pinecone)
Based on	Word frequency	Vector similarity
Requires training	❌ No	✅ Usually requires pre-trained models
Synonym handling	❌ No	✅ Yes (to some extent)
Performance	⚡ Fast	🧠 More accurate (but slower)

🧩 Summary of BM25 Scoring

Component	Description
`TF`	Boosts score for frequent terms
`IDF`	Boosts score for rare terms
Length factor	Penalizes long documents
`k`, `b`	Tune how much to weigh frequency and length

BM25 is a great baseline retriever—simple, fast, and effective for many real-world applications.

🪴 Sanjeed's Garden

Explorer