Tutorial coming soon

A walkthrough of the full benchmark workflow

What is RAGScope

RAGScope is a benchmarking harness for Retrieval-Augmented Generation pipelines. Its purpose is measurement: you upload a document corpus, ask a question, run five retrieval strategies, and see which one produces the most faithful and relevant answer on your specific data.

Most RAG systems are evaluated qualitatively ("it seems to work") or with proxy metrics that do not capture the full picture. RAGScope uses RAGAS, an open-source evaluation framework that applies an LLM judge (GPT-4o-mini) to score three properties simultaneously: whether the answer is grounded in the retrieved context, whether the retrieved context was relevant, and whether the answer addressed the question.

RAGScope is not a production RAG system. It is a measurement instrument. You use it to answer "which strategy should I use for this corpus?" before committing to an architecture.

How it works

RAGScope has three phases. The ingest phase processes your documents once. The benchmark phase runs your question through one or more retrieval strategies and evaluates the results. The live chat phase lets you query the corpus interactively using any strategy.

Phase 1: Ingest

Ingest flow

Upload

PDF or TXT

Chunk

Split text

Embed

text-embedding-3-small

Store

pgvector index

Your uploaded files are passed to the appropriate ingestor (PDF or plain text), which extracts raw text. The text is split into chunks by the chunker strategy you choose. Each chunk is embedded using OpenAI text-embedding-3-small and stored in a pgvector index alongside the original text. A BM25 sparse index is also built over the same chunks for hybrid retrieval. This happens once per corpus. Re-uploading the same files returns the cached result immediately.

Phase 2: Benchmark

Benchmark flow

Question

User query

Retrieve

Top-k chunks

Generate

GPT-4o-mini answer

Evaluate

RAGAS metrics

You submit a question and select one or more retrieval strategies. Each strategy searches the pgvector index for the most relevant chunks, optionally applies contextual compression, then passes the chunks to GPT-4o-mini which generates an answer constrained to the retrieved context. RAGAS evaluates the question, answer, and context together and produces three scores. Each strategy runs as a separate FastAPI background task so the HTTP response returns immediately with a list of run IDs that you poll independently. Selecting N strategies counts as N runs against the guest daily limit.

Phase 3: Live chat

After benchmarking, you can query your corpus interactively using the winning strategy or any strategy you choose. This is a lightweight retrieval and generation step - not a full-scale chatbot. There is no conversation memory and no multi-turn context: each message is an independent retrieval and generation step using only your question and the retrieved chunks. Use it to explore how different strategies answer follow-up questions on your corpus, not to hold an ongoing dialogue.

Retrieval strategies

RAGScope benchmarks four retrieval methods. Each method is a distinct approach to finding relevant chunks from your corpus. They differ in how the query is constructed and how chunks are ranked.

Naive RAG

Baseline

Use when

Default starting point. Use when query vocabulary closely matches document vocabulary.

Avoid when

Queries phrased differently from the documents they target.

LLM calls: 1 (embed)Latency: Fast

HyDE

Hypothesis-driven

Use when

Questions where the query phrasing is very different from how the answer is written in the corpus.

Avoid when

Factual lookups where the query uses the same terms as the document.

LLM calls: 2 (complete + embed)Latency: Moderate

Multi-Query

Multi-perspective

Use when

Ambiguous questions or when you suspect the user may phrase the question differently from the corpus.

Avoid when

High-latency budgets where extra LLM calls are not acceptable.

LLM calls: 1 complete + N embedsLatency: Moderate

Hybrid BM25 + Dense

Hybrid

Use when

Technical corpora with precise identifiers, product names, codes, or rare terms that semantic search alone misses.

Avoid when

Purely narrative or conversational corpora with no domain-specific keywords.

LLM calls: 1 (embed)Latency: Fast to Moderate
Post-retrieval processor

Contextual Compression

Contextual compression is not a retrieval method and is not in the retrieval registry. It is a post-retrieval processor that runs after any of the four methods above have selected their chunks.

When enabled, each retrieved chunk is passed through GPT-4o-mini, which extracts only the sentences directly relevant to your question. This reduces noise in the context window and tends to improve faithfulness scores at the cost of one additional LLM call per chunk.

Contextual compression can be toggled on top of any of the four retrieval methods. Enabling or disabling it does not consume a benchmark run and does not affect the guest daily limit.

Use when

Corpus chunks are long and contain many irrelevant sentences alongside the relevant ones.

Avoid when

Short chunks or when every part of a chunk is relevant to the question.

LLM calls: 1 per chunk (complete)Latency: Slow (scales with k)

Understanding metrics

All three metrics are scored by an LLM judge (GPT-4o-mini) and return a value between 0.0 and 1.0. Higher is better for all three. Scores should be interpreted comparatively across strategies rather than as absolute quality thresholds.

Faithfulness

For each claim in the generated answer, RAGAS asks the LLM judge whether it is supported by the retrieved chunks. Faithfulness is the fraction of claims that are supported. A score of 1.0 means the answer contains no statements that go beyond what the retrieved context says.

=number of answer claims traceable to a retrieved chunk
=total distinct factual claims extracted from the generated answer

Low score means

The model is hallucinating. It is making claims not supported by the retrieved documents. This is the most dangerous failure mode in production RAG.

High score means

Every statement in the answer is traceable to the retrieved context. The retrieval strategy is doing its job.

Context Utilization

Measures how much of the retrieved context was actually used when generating the answer. If you retrieved 5 chunks but only 1 contributed to the answer, context utilization is low. If all 5 were used, it is high. Unlike context precision, this metric requires no ground-truth reference answer.

=total number of retrieved chunks
=fraction of the top k chunks that were used in generating the answer
=1 if the chunk at position k was used in the answer, 0 otherwise (judged by gpt-4o-mini)

Low score means

The retriever is returning chunks the LLM ignored. The context window is noisy and the model had to filter it internally.

High score means

Almost everything retrieved was referenced when composing the answer. The retriever is returning exactly what is needed.

Answer Relevancy

Measures whether the answer addresses the question that was actually asked. RAGAS generates several synthetic questions from the answer and checks whether they resemble the original question using embedding cosine similarity. An answer that is factually correct but off-topic scores low.

=number of synthetic questions generated from the answer (typically 3)
=the original user question
=the i-th synthetic question generated by gpt-4o-mini from the answer alone
=unit-normalised embedding vector; cosine similarity equals their dot product

Low score means

The answer contains accurate information but does not directly address the question. The model may have retrieved good context but misread the intent.

High score means

The answer is on-topic and directly responds to what was asked.

Access tiers

RAGScope has three access levels. All tiers can run benchmarks and compare results. The tiers differ in how many runs you can make and where the LLM calls come from.

Tier 1 - Guest

  • 12 strategy runs per day, reset at midnight UTC (selecting all 4 strategies counts as 4 runs)
  • 5 live chat questions per day across all strategies combined
  • 10 MB combined corpus upload limit
  • Uses the RAGScope shared OpenAI API key
  • No account or API key required

The daily limit protects the shared API key quota. It resets every midnight UTC regardless of your local timezone. Enabling or disabling contextual compression does not count as a run.

Tier 2 - BYOK

  • Unlimited benchmark runs
  • Unlimited chat questions
  • Full corpus size (limited only by the embedding model context window)
  • LangSmith trace export enabled
  • Your API key stays in browser localStorage and is never sent to RAGScope servers
  • Compatible with OpenAI and Anthropic keys

To activate BYOK, click the settings icon in the top navigation bar and paste your API key. You can remove it at any time.

Tier 0 - Developer

  • Unlimited runs with no rate limiting
  • Uses a hashed token in the X-Dev-Token request header
  • Bypasses all daily limits at the backend level
  • Intended for contributors and the project maintainer

Developer access is by invitation. Contact ImtiazX on LinkedIn to request a token.

FAQ

How does RAGScope differ from a standalone RAGAS evaluation script?

RAGScope is a comparative harness, not just a scorer. You run multiple retrieval strategies against the same question and corpus in one session and see the ranked result. A standalone RAGAS script gives you a score for one pipeline run. RAGScope gives you scores for all four retrieval methods and a visual comparison.

How many benchmark runs do I get as a guest?

Guest users receive 12 strategy runs per day, reset at midnight UTC. Selecting all 4 retrieval strategies in one submission counts as 4 runs. Enabling or disabling contextual compression does not count as a run and does not affect this limit.

How many live chat questions do I get as a guest?

5 chat questions per day across all strategies combined, reset at midnight UTC. This is a separate limit from benchmark runs. Add your own API key to remove both limits.

Does enabling contextual compression count as an extra run?

No. Contextual compression is a post-retrieval processor, not a retrieval method. Toggling it on or off does not consume a run and does not affect the guest daily limit.

Is contextual compression a fifth retrieval strategy?

No. There are exactly 4 retrieval methods: Naive RAG, HyDE, Multi-Query, and Hybrid BM25+Dense. Contextual compression is a separate orthogonal post-retrieval step that can be applied on top of any of those 4 methods. It is not in the retrieval registry and does not appear alongside the 4 methods in benchmarking.

Does RAGScope store my uploaded documents?

Yes. Document chunks and their embeddings are stored in a Postgres database with the pgvector extension. They are keyed by a SHA-256 hash of the original file bytes. Uploading the same files twice returns the cached corpus without re-processing.

What API key is used for guest evaluation runs?

Guest runs use a shared OpenAI API key provisioned for the RAGScope service. The key is never exposed to the browser. Guest users are limited to 12 strategy runs per day to protect the shared quota.

Can I add a new retrieval strategy without changing the backend?

Yes. Create a class in backend/retrieval/ that extends BaseRetriever, implement the retrieve() method, and decorate with @register. The API and frontend discover it automatically via the registry. No other files need to change.

How accurate are the RAGAS scores?

RAGAS uses GPT-4o-mini as an LLM judge. LLM-as-judge evaluations correlate well with human judgement on average but have variance on individual items. Treat scores as directional signals across strategies, not as absolute ground truth. Consistent differences of more than 0.1 between strategies are meaningful; differences smaller than that may be noise.

Is BYOK key usage logged anywhere?

No. BYOK keys are stored exclusively in your browser localStorage and used directly in API calls made from your browser. The RAGScope backend never receives or logs your key. You can verify this by inspecting the network tab in your browser developer tools.