Arjun Jagdale - AI Engineer & Open Source Contributor

👋 About Me

Myself Arjun Jagdale, turning research into production-ready ML systems. I'm an AI engineer who codes at the intersection of deep learning research and production engineering — building everything from anti-spoofing CNNs to parameter-efficient transformers, while actively contributing to core Hugging Face libraries.

class ArjunJagdale:
    def __init__(self):
        self.role = "AI Engineer & Open Source Contributor"
        self.focus_areas = ["Deep Learning", "Computer Vision", "NLP", "MLOps"]
        self.current_work = "Contributing to Hugging Face core libraries"
        self.interests = ["RAG Systems", "Model Compression", "Cloud-Native ML"]
    
    def get_expertise(self):
        return {
            "frameworks": ["PyTorch", "HuggingFace", "scikit-learn", "TensorFlow"],
            "specializations": ["Parameter-Efficient Fine-Tuning", "CNNs", "Transformers"],
            "cloud": ["IBM Cloud", "Google Cloud", "Docker", "Kubernetes"],
            "tools": ["LangChain", "LlamaIndex", "Gradio", "Git"]
        }

🔥 What I'm Working On

🚀 Contributing to Hugging Face — datasets & dataset-viewer libraries (7 merged PRs)
🧠 Research — Published paper on Retrieval-Augmented Systems with Dynamic Learning
🛠️ Building — Production ML pipelines with real-time inference and GPU optimization
📚 Learning — Parameter-efficient methods, vision-language models, cloud-native deployments

🛠️ Tech Arsenal

Languages & Core

🐍 Python

📜 JavaScript

ML & AI Frameworks

🔥 PyTorch

🤗 HuggingFace

📊 scikit-learn

🧠 TensorFlow

🦜 LangChain

🦙 LlamaIndex

Cloud & DevOps

☁️ IBM Cloud

🌐 Google Cloud

🐳 Docker

☸️ Kubernetes

🌟 Open Source Contributions

Merged PRs

Repositories

500+

Lines Changed

📦 huggingface/datasets

MERGED

#7831 • Fix ValueError in train_test_split with NumPy 2.0+

Resolved compatibility issue with NumPy 2.0+ by wrapping stratify column array access with np.asarray(). Maintains backward compatibility with NumPy 1.x while fixing array copy errors.

bug-fix compatibility numpy

MERGED

#7648 • Fix misleading docstring examples across multiple methods

Updated docstrings for add_column(), select_columns(), select(), filter(), shard(), and flatten() to clarify that these methods return new datasets instead of modifying in-place. Significantly improves API documentation clarity.

documentation api-improvement datasets

MERGED

#7623 • Fix: Raise error when data_dir and data_files are missing

Added validation check in FolderBasedBuilder to prevent silent fallback to current directory when loading folder-based datasets without required parameters. Improves user experience by catching errors early.

bug-fix validation datasets

🔍 huggingface/dataset-viewer

MERGED

#3223 • Add support for Date features in Croissant schema

Implemented support for Date, UTCDate, and UTCTime features in Croissant schema generation. Automatically infers correct dataType (sc:Date, sc:Time, or sc:DateTime) based on format string.

feature croissant schema

MERGED

#3219 • Refactor: Replace get_empty_str_list with CONSTANT.copy

Eliminated shared mutable default values in dataclass fields by replacing helper functions with explicit constant copies. Makes configuration behavior more explicit and prevents subtle bugs.

refactor best-practices config

MERGED

#3218 • Test: Add unit tests for get_previous_step_or_raise

Implemented comprehensive unit tests for cache retrieval function covering successful cache hits, missing cache scenarios, and error status handling. Improves code coverage and reliability.

testing unit-tests coverage

MERGED

#3206 • Refactor: Use HfApi.update_repo_settings for gated datasets

Removed redundant custom implementations of update_repo_settings() across test utilities by leveraging official huggingface_hub API. Cleaned up 222 lines of code while maintaining full functionality.

refactor code-cleanup testing

View All Contributions →

🚀 Featured Projects

🎭 Anti-Spoof Face Classification

Custom CNN Architecture for Real vs. Spoof Detection

results = {
    "validation_accuracy": "99.39%",
    "test_accuracy": "99.29%",
    "dataset_size": "38K images",
    "parameters": "4.88M"
}

Built SpoofNet from scratch with BatchNorm, Dropout2D
Real-time inference with MediaPipe + OpenCV on GPU
Production-ready with only 2 false negatives, 40 false positives

Tech: PyTorch • OpenCV • MediaPipe • Gradio

🔬 LoRA Fine-Tuning for BERT

Low Rank Adaptation Fine-Tuning for Named Entity Recognition

efficiency = {
    "parameter_reduction": "99.49%",
    "trainable_params": "641,347",
    "f1": "0.6744",
    "Precision" : "0.6539"
    "speedup": "750x faster training"
}

Fine-tuned RoBERTa-base for Named Entity Recognition on the Few-NERD
Reduced trainable parameters by 194× while maintaining test F1 of 0.67 and test loss of 0.24.
Deployed the model end-to-end on an AWS EC2 t3.micro instance

Tech: PyTorch • HuggingFace • LoRA • NER

💬 YouTube RAG Chatbot

Semantic Retrieval & QA Over YouTube Comment Corpora

pipeline = {
    "embedding_model": "all-MiniLM-L6-v2",
    "index": "FAISS",
    "retriever_k": 10,
    "splitter": "RecursiveCharacterTextSplitter"
}

RAG pipeline built with LangChain using scraped YouTube comments
FAISS vector search with HuggingFaceEmbeddings for fast semantic lookup
Custom RetrievalQA chain with tuned prompts + ChatOpenAI (OpenRouter)

Tech: LangChain • FAISS • HuggingFaceEmbeddings • RAG • OpenRouter

🌐 Multi-URL RAG Pipeline

Graph-Orchestrated Retrieval & QA Across Arbitrary Web Sources

workflow = {
    "orchestrator": "LangGraph StateGraph",
    "memory": "MemorySaver",
    "vector_store": "FAISS",
    "ui": "Gradio (configurable lengths)"
}

Conditional routing + retry logic via LangGraph StateGraph
Chunking with RecursiveCharacterTextSplitter & FAISS-based retrieval
LCEL pipeline with ChatOpenAI + prompt templates for adaptive responses

Tech: LangChain • LangGraph • FAISS • HuggingFaceEmbeddings • Gradio • RAG

🎬 LSTM Sentiment Analysis

Bidirectional Sequence Modeling for IMDB Reviews

performance = {
    "test_accuracy": "85.19%",
    "train_test_gap": "4%",
    "architecture": "2-layer BiLSTM",
    "embedding_dim": 128
}

Bidirectional LSTM with 50% dropout
Gradient clipping for stable training
Gradio deployment for real-time inference

Tech: PyTorch • LSTM • Gradio • NLP

🍷 Wine Quality Prediction

Classical ML with Advanced Feature Engineering

model_comparison = {
    "Random_Forest": "97.2%",
    "Logistic_Regression": "96.5%",
    "SVM": "95.8%",
    "after_tuning": "+1-2%"
}

Comprehensive EDA with correlation heatmaps
GridSearchCV hyperparameter optimization
Multi-class classification on 178 samples

Tech: scikit-learn • Pandas • Matplotlib • NumPy

📚 Research & Publications

Retrieval-Augmented System with Dynamic Learning from Web Content

Published research on RAG systems that dynamically learn from web content, combining retrieval mechanisms with adaptive learning strategies for improved information access and knowledge synthesis.

🎓 Certifications

🏆

IBM Machine Learning Specialist

🔥

PyTorch Certification

📚

Git & GitHub

💬 Let's Connect

Building something interesting? I'm always open to collaborating on ML research, open source contributions, or production ML systems.

💼 LinkedIn ✉️ Email Me 📁 Portfolio 💻 LeetCode

💡 Currently Exploring: RAG Systems • Model Compression • Cloud-Native ML Deployments

📍 Location: Pune, Maharashtra, India

🎓 Education: B.E. Electronics & Telecommunication @ Savitribai Phule Pune University

Made with ❤️ by Arjun Jagdale