AI Engineer · Agentic RAG & Reranking · LLM Fine-Tuning & RL · Domain-Specific AI
I work on LLM systems for domain-specific applications in Finance, Bio-Medical, and Legal AI, spanning retrieval, agents and model training. I’ve contributed to Haystack, MTEB, HuggingFace, and scikit-learn, and co-authored MMTEB, published at ICLR 2025. Developing open-source AI at AVNLP.
Developing Open-Source AI @ AVNLP
| Repository | Description |
|---|---|
| BioThink | Self-Reflective Bio-Medical Question Answering system - trains Qwen3-1.7B with QLoRA + GRPO using 5 custom reward functions (Relevance, Grounding, Utility token enforcement + XML structure + GEval correctness); evaluated across 7 metrics including faithfulness and answer correctness via LLM-as-a-Judge. |
| LLM-Finetuning | Fine-tuning pipelines covering SFT, DPO, ORPO, KTO, and PPO; comparative benchmarking of QLoRA, LoRA, DoRA, P-Tuning, and Prefix-Tuning across ARC, FactScore, TriviaQA, and PopQA. |
| GRPO | GRPO implementations comparing reward functions (format/correctness), training frameworks (DeepSpeed and PyTorch), and reference-model handling strategies. |
| RAG-Model-Training | Fine-tuning LLMs for 6 RAG paradigms - Adaptive-RAG, Corrective RAG, RQ-RAG, Self-RAG, Agentic RAG, ReZero - via SFT and GRPO; uses Llama-3.2, and Llama-3-8B across finance, biomedical, and open-domain QA datasets. |
| Repository | Description |
|---|---|
| RAG-Pipelines | Agentic RAG pipelines with metadata enrichment, contextual reranking and structured generation. |
| DSPy-Optimizers | DSPy-based RAG optimization framework using MIPRO, COPRO, and BootstrapFewShot on FreshQA, HotpotQA, TriviaQA, PubMedQA. |
| VectorDB | Production Haystack and LangChain pipelines for Hybrid Search, Parent-Child Retrieval, MMR, Metadata Filtering, Multi-Tenancy, and Re-ranking across Pinecone, Weaviate, Milvus, Qdrant, and Chroma - with benchmarks on TriviaQA, ARC, PopQA, FactScore, and Earnings Calls. |
| Repository | Description |
|---|---|
| LLM Rankers | LLM re-ranking library for IR and RAG. Implements Pairwise, Setwise, and Listwise ranking with RankZephyr and RankLlama; supports sliding windows, efficient sorting, and zero-shot inference. |
| Pairwise Ranking Prompting | Zero-shot pairwise reranking library (Heapsort, Sliding Window, All-Pairs strategies) with bidirectional comparison for position-bias mitigation; Pydantic-validated. |
| Reciprocal Rank Fusion and LLM Rankers | Hybrid retrieval with Reciprocal Rank Fusion (RRF); evaluates Diversity, Lost-in-the-Middle, and Similarity rankers against the BEIR suite (NDCG, MAP, Recall, Precision). |
| LLM-Blender | Ensembling framework combining PairRanker (pairwise ranking) and GenFuser (output merging) to synthesize superior responses from multiple open-source models. |
| Project | Contributions |
|---|---|
| Haystack | Evaluation Framework: Designed and built Haystack's pipeline evaluation from scratch - StatisticalEvaluator, EvaluationResult, and six metrics: Exact Match, F1, Semantic Answer Similarity, Recall, MRR, and MAPHuggingFace TEI Embedders: Components supporting self-hosted Docker, free Inference API, and paid HF Inference Endpoints Diversity Ranker: Document reranker optimizing for maximum semantic diversity via sentence-transformer embeddings |
| Haystack Core Integrations | INSTRUCTOR Embedders: Task- and domain-specific embedding components with instructable prompt prefixes HF Optimum: Embedding inference with ONNX and TensorRT runtimes Llama.cpp Generator: Text generation with quantized models Pinecone: Vector DB integration with advanced metadata filtering |
| voyage-embedders-haystack | Haystack integration for Voyage AI embedding and reranking models |
| MTEB | LegalBench: Added the complete LegalBench benchmark suite - 160+ legal domain classification and retrieval datasets; Integrated Japanese embedding benchmarks JMTEB and JSICK. |
| HuggingFace Transformers | BioGPTForSequenceClassification implementation; ViT pre-training scripts without the Trainer class; HuggingFace Evaluate + scikit-learn integration docs. |
| scikit-learn, imbalanced-learn | Out-of-bag scores for Gradient Boosting; sparse matrix support for Silhouette Score; multi-class Average Precision (One-vs-Rest). |
MMTEB: Massive Multilingual Text Embedding Benchmark (ICLR 2025)
Largest multilingual text embedding benchmark: 500+ tasks across 250+ languages and 10 task categories. Contributed the complete LegalBench suite - 160+ legal domain classification and retrieval datasets.




