Hybrid RAG and Long-context

Code for our preprint, Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization.

Setup

Data

We provide the HuggingFace dataset loading script in artifacts/misc/summhay/. Before loading the dataset, use the original release to download the source data (see README).

Experiments

See src/configs/ for the full list of retrievers, summarizers and datasets used in our experiments.

Downloading models

Download retriever and embedding models from HuggingFace using bash_scripts/download.slurm.

Baselines

Script to run baselines is provided at bash_scripts/optimal_baselines.sh. This includes full-context and RAG based on RULER and HELMET task-based estimates. For HELMET, we use two task-based estimates, LongQA and summarization. We use two retrievers, GTE-Qwen-1.5B and GTE-Qwen-7B.

Proposed

To get optimal estimates for retrieval context length, we use a randomly sampled subset of the evaluation data. We replace the gold summaries in the HuggingFace dataset with silver candidates from our LLM panel (see src/sample_examples.py).

For the full experimental setup, see bash_scripts/all_scripts.sh. This includes the following steps,

sample a subset of the evaluation dataset for context length estimation
retrieval using GTE embeddings
collect silver references using a LLM panel
estimate optimal context length by evaluating systems (retriever + summarizer) on the randomly sampled subset (bash_scripts/estimate.sh)
run prediction on the full evaluation dataset using the optimal context length

Below, we provide the outline for one experimental configuration (Qwen 2.5 7B summarizer with GTE 1.5B retriever).

# NOTE: in our experiments, we compare against full context and baseline RAG setups
# therefore, we first run retrieval and get silver references on the full dataset before sampling a subset.
# however, this order can be reversed by first sampling a subset before running retrieval and getting silver references.

# first, retrieval using GTE 1.5B embeddings
DATASET=SummHay; SPLIT=test; for CONTEXT in 8K 16K 24K 32K 40K 48K 56K 64K 72K 80K; do bash bash_scripts/retrieve.sh Flat_${CONTEXT}_GTE_1p5B $DATASET $SPLIT; done;

# now we prepare silver references using our LLM panel
DATASET=SummHay; SPLIT=test; bash bash_scripts/sum.sh Llama33_70B $DATASET $SPLIT;
DATASET=SummHay; SPLIT=test; bash bash_scripts/sum.sh Qwen25_14B_1M $DATASET $SPLIT;
DATASET=SummHay; SPLIT=test; bash bash_scripts/sum.sh Qwen25_72B_128K $DATASET $SPLIT;
DATASET=SummHay; SPLIT=test; bash bash_scripts/sum.sh Jamba15Mini $DATASET $SPLIT;
DATASET=SummHay; SPLIT=test; bash bash_scripts/sum.sh ProLong512K $DATASET $SPLIT;

# now, we sample a subset of the evaluation dataset for context length estimation
# for each retriever context length, we generate a new dataset (SummHay_5SysPool_Silver) by replacing the gold summaries with the silver candidates from our LLM panel
DATASET=SummHay; SPLIT=test; for CONTEXT in 8K 16K 24K 32K 40K 48K 56K 64K 72K 80K; do bash bash_scripts/sample.sh Flat_${CONTEXT}_GTE_1p5B $DATASET $SPLIT; done;

# prediction on the sampled subset
# for this illustration, we use Qwen25_7B as the summarizer
DATASET=SummHay_5SysPool_Silver; SPLIT=test; for MODEL in Qwen25_7B; do for CONTEXT in 8K 16K 24K 32K 40K 48K 56K 64K 72K 80K; do bash bash_scripts/sum.sh ${MODEL}_Flat_${CONTEXT}_GTE_1p5B $DATASET $SPLIT; done; done;
# score against the silver references
# we use max pooling over target summaries
DATASET=SummHay_5SysPool_Silver; SPLIT=test; POOLING=max; for MODEL in Qwen25_7B; do for CONTEXT in 8K 16K 24K 32K 40K 48K 56K 64K 72K 80K; do bash bash_scripts/score.sh ${MODEL}_Flat_${CONTEXT}_GTE_1p5B $DATASET $SPLIT $POOLING; done; done;

# collect scores
FILEPATH=outputs/SummHay_5SysPool_Silver_${EMB}_test_scores.csv
head -n1 outputs/SummHay_5SysPool_Silver_Qwen25_7B_Flat_8K_GTE_1p5B_test_scores.txt | sed 's/ \+/,/g' > $FILEPATH
DATASET=SummHay_5SysPool_Silver; SPLIT=test; for MODEL in Qwen25_7B; do for CONTEXT in 8K 16K 24K 32K 40K 48K 56K 64K 72K 80K; do sed 's/ \+/,/g' <(tail -n1 outputs/${DATASET}_${MODEL}_Flat_${CONTEXT}_GTE_1p5B_${SPLIT}_scores.txt) >> $FILEPATH; done; done;

# get estimates
python src/get_estimates.py \
    --file-path outputs/SummHay_5SysPool_Silver_GTE_1p5B_test_scores.csv

# this gives the script to run on the full evaluation dataset
cat outputs/SummHay_5SysPool_Silver_GTE_1p5B_test_scores_most_efficient.sh;

License

This project is licensed under the MIT License. See the LICENSE file.

Reference

If you find this work useful, please consider citing our paper.

@misc{pratapa-mitamura-2025-estimating,
    title={Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization},
    author={Adithya Pratapa and Teruko Mitamura},
    year={2025},
    eprint={2504.12972},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2504.12972},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
artifacts/misc/summhay		artifacts/misc/summhay
bash_scripts		bash_scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid RAG and Long-context

Setup

Data

Experiments

Downloading models

Baselines

Proposed

License

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hybrid RAG and Long-context

Setup

Data

Experiments

Downloading models

Baselines

Proposed

License

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages