Skip to content
View soumilshah1995's full-sized avatar

Highlights

  • Pro

Block or report soumilshah1995

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
soumilshah1995/README.md

πŸ‘‹ Hey, I'm Soumil Nitin Shah

Lead Software Engineer β€’ Data Lakes Expert β€’ Lakehouse Architect β€’ Tech Educator

Typing SVG

Website LinkedIn YouTube Medium GitHub Email


πŸš€ About Me

I'm a Lead Software Engineer at Zeta Global and a recognized Data Lakes & Lakehouse Architecture Expert with 6+ years of hands-on experience building production-grade data platforms. I specialize in Apache Hudi, Apache Iceberg, AWS EMR, and Spark, architecting solutions that process terabytes of data daily while achieving significant cost reductions.

"Making sophisticated data engineering accessible to everyone through code, content, and community"

πŸ† Key Achievements

πŸ—οΈ Creator of LakeBoost β€” Production framework integrating Apache Hudi with AWS Glue ETL, powering 50+ tables with 466 daily jobs (11,832 monthly), achieving 40% operational overhead reduction and 4-5x cost savings

πŸ“ Featured on AWS Storage Blog β€” How Zeta Global scales multi-tenant data ingestion with Amazon S3 Tables

πŸ’° Delivered $200K+ Cost Savings β€” Led Lakehouse adoption achieving ~50% cost reduction while processing 55TB+/month with sub-10-minute refresh latency

⚑ Performance Optimization Expert β€” Reduced data processing from hours to minutes; achieved 80x faster searches through Elasticsearch optimization (4.4s β†’ 125ms)


πŸ’Ό What I Do at Zeta Global

🎯 Role & Impact

interface Engineer {
  role: "Lead Software Engineer";
  focus: ["Data Lakes", "Lakehouse", "Platform"];
  stack: ["Hudi", "Iceberg", "Spark", "AWS"];
}

Leading enterprise-scale data transformation:

  • πŸ—οΈ Architecting Lakehouse-as-a-Service platform
  • πŸ“Š Processing 60-120 GB/hour β†’ 55TB+/month
  • πŸ—ƒοΈ Managing 10,000+ Iceberg tables in production
  • ⚑ Hours β†’ Minutes processing (4-5x cost reduction)
  • πŸ’° $200K+ savings through platform optimization

πŸš€ Technical Achievements

metrics = {
    "data_volume": "1.3TB daily | 55TB monthly",
    "table_count": "10,000+ Iceberg tables",
    "job_throughput": "466 daily | 11,832 monthly",
    "cost_reduction": "~50% | $200K+ saved",
    "performance": "2Γ— faster ingestion",
    "latency": "<10 min refresh"
}

Core expertise:

  • Multi-tenant data ingestion at scale
  • Incremental ETL pipeline architecture
  • Real-time analytics & streaming processing
  • Cost-efficient cloud-native solutions

πŸ› οΈ Tech Stack & Expertise

Data Lakehouse & Processing

Apache Hudi Apache Iceberg Apache Spark Delta Lake

AWS Services

AWS AWS Glue Amazon EMR S3 Step Functions

Languages & Tools

Python PySpark SQL Kafka Elasticsearch


πŸ“Ί Content Creator & Educator β€” Making Data Engineering Accessible

🌟 2025 Community Impact

Platform Reach & Engagement
🎬 YouTube 46,188 Subscribers | 432.9K Views | 16.4K Hours Watch Time
πŸ“Ή Videos Created 1,800+ Tutorials (3.9M Impressions, 5.5% CTR)
✍️ Medium Blog 422 Followers | 18.4K Presentations | 5.2K Views | 3K Reads
πŸ“ Blog Posts 100+ Articles on Data Engineering
πŸ’Ό LinkedIn 10,922 Followers | 1,000+ Posts
πŸ“‚ GitHub 300+ Open Source Repositories
🌐 Website Technical Resources & Portfolio
⏱️ Avg View Duration 2:21 minutes (High engagement)

πŸ“š Content I Create & Teach

  • πŸ—οΈ Data Lake & Lakehouse Architecture β€” Apache Hudi, Iceberg, Delta Lake
  • ☁️ AWS Big Data Engineering β€” Glue, EMR, S3, Step Functions, Lambda
  • ⚑ Real-time Data Pipelines β€” Kafka, Kinesis, Streaming ETL
  • πŸ”§ Performance & Cost Optimization β€” Query tuning, incremental processing
  • πŸŽ“ End-to-End Production Projects β€” From ingestion to analytics
  • πŸ’Ύ Data Management Best Practices β€” Table formats, partitioning, clustering
  • 🐍 PySpark & Python β€” Distributed computing, big data processing

🎯 Mission

Creating 1,800+ videos and 100+ blog posts to democratize data engineering knowledge and help engineers worldwide build better data platforms.


πŸ† Featured Work, Publications & Impact

πŸ“° AWS Storage Blog Feature

How Zeta Global scales multi-tenant data ingestion with Amazon S3 Tables

Deep dive into production architecture handling 10,000+ Iceberg tables, 55TB+ monthly processing, and multi-tenant data ingestion at scale with sub-10-minute refresh latency

πŸ“ Recent Technical Articles on Medium


πŸŽ“ Education & Continuous Learning

Degree Field Institution Achievement
πŸŽ“ M.S. Electrical Engineering University of Bridgeport 4.0 GPA
πŸŽ“ M.S. Computer Engineering University of Bridgeport Best Academic Award
πŸŽ“ B.S. Electronic Engineering K.J. Somaiya Institute

πŸ… Honors: Best Academic Achievement Award (4.0 GPA) β€’ 3rd Place UB Hackathon β€’ The Builder Award


🀝 Let's Connect & Collaborate

πŸ’¬ I'm Available to Discuss

Data Lakes & Lakehouses β€’ Apache Hudi & Iceberg β€’ AWS Data Platforms β€’ Spark Optimization Multi-tenant Architecture β€’ Cost Optimization β€’ Content Creation β€’ Technical Speaking Open Source Collaboration β€’ Mentorship & Career Guidance

🎀 Open to Speaking & Collaboration Opportunities

I'm actively seeking speaking engagements, technical collaborations, content partnerships, and consulting opportunities around:

  • Data Engineering & Lakehouse Architecture
  • AWS Big Data Technologies & Cost Optimization
  • Building Production Data Platforms at Scale
  • Content Creation & Developer Education

πŸ“ž Get in Touch

Email LinkedIn YouTube Medium Website GitHub

Location: New York City Metropolitan Area


⚑ Fun Facts About My Journey

  • πŸ“Ή Created 1,800+ educational videos in a single year β€” that's averaging 5 videos every single day!
  • 🎬 Accumulated 432.9K views and 16.4K hours of watch time helping engineers worldwide
  • πŸ“ Authored 100+ technical blog posts reaching 18.4K presentations and 5.2K views
  • πŸ’» Built 300+ GitHub repositories with production-ready code and tutorials
  • 🌍 Teaching data engineering in 4 languages: English, Hindi, Gujarati, and Marathi
  • πŸ† Turned data processing from hours to minutes and saved companies $200K+ in cloud costs
  • πŸš€ Managing 10,000+ Iceberg tables processing 55TB+ monthly in production
  • 🎯 Created LakeBoost framework now running 466 jobs daily (11,832/month) in production

πŸ“ˆ By the Numbers

432.9K     YouTube Views
16.4K      Hours Watch Time
46,188     YouTube Subscribers
1,800+     Videos Published
10,922     LinkedIn Followers
5.2K       Medium Blog Views
300+       GitHub Repositories
100+       Technical Blog Posts
55TB       Data Processed Monthly
$200K+     Cost Savings Delivered

Profile Views

Made with ❀️ and countless hours of code by Soumil Shah

"Building data platforms by day, creating educational content by night, making data engineering accessible 24/7"


Last updated: January 2025

Pinned Loading

  1. Smart-way-to-Capture-Jobs-and-Process-Meta-Data-Using-DynamoDB-Project-Demo-Python-Templates Smart-way-to-Capture-Jobs-and-Process-Meta-Data-Using-DynamoDB-Project-Demo-Python-Templates Public

    Smart way to Capture Jobs and Process Meta Data Using DynamoDB | Project Demo | Python Templates

    CSS 4 2

  2. Project-Using-Apache-Hudi-Deltastreamer-and-AWS-DMS-Hands-on-Lab Project-Using-Apache-Hudi-Deltastreamer-and-AWS-DMS-Hands-on-Lab Public

    Project : Using Apache Hudi Deltastreamer and AWS DMS Hands on Labs

    4 1

  3. Python-Flask-Redis-Celery-Docker Python-Flask-Redis-Celery-Docker Public

    Learn how to use Python with Flask Redis and Celery and pack everything into Docker Container

    Python 78 65

  4. An-easy-to-use-Python-utility-class-for-accessing-incremental-data-from-Hudi-Data-Lakes An-easy-to-use-Python-utility-class-for-accessing-incremental-data-from-Hudi-Data-Lakes Public

    An easy-to-use Python utility class for accessing incremental data from Hudi Data Lakes

    Python 3

  5. LakeBoost LakeBoost Public

    LakeBoost

    Python 9 3

  6. emr-apache-iceberg-workshop emr-apache-iceberg-workshop Public

    emr-apache-iceberg-workshop

    Python 4