I'm a Lead Software Engineer at Zeta Global and a recognized Data Lakes & Lakehouse Architecture Expert with 6+ years of hands-on experience building production-grade data platforms. I specialize in Apache Hudi, Apache Iceberg, AWS EMR, and Spark, architecting solutions that process terabytes of data daily while achieving significant cost reductions.
"Making sophisticated data engineering accessible to everyone through code, content, and community"
ποΈ Creator of LakeBoost β Production framework integrating Apache Hudi with AWS Glue ETL, powering 50+ tables with 466 daily jobs (11,832 monthly), achieving 40% operational overhead reduction and 4-5x cost savings
π Featured on AWS Storage Blog β How Zeta Global scales multi-tenant data ingestion with Amazon S3 Tables
π° Delivered $200K+ Cost Savings β Led Lakehouse adoption achieving ~50% cost reduction while processing 55TB+/month with sub-10-minute refresh latency
β‘ Performance Optimization Expert β Reduced data processing from hours to minutes; achieved 80x faster searches through Elasticsearch optimization (4.4s β 125ms)
interface Engineer {
role: "Lead Software Engineer";
focus: ["Data Lakes", "Lakehouse", "Platform"];
stack: ["Hudi", "Iceberg", "Spark", "AWS"];
}Leading enterprise-scale data transformation:
|
metrics = {
"data_volume": "1.3TB daily | 55TB monthly",
"table_count": "10,000+ Iceberg tables",
"job_throughput": "466 daily | 11,832 monthly",
"cost_reduction": "~50% | $200K+ saved",
"performance": "2Γ faster ingestion",
"latency": "<10 min refresh"
}Core expertise:
|
| Platform | Reach & Engagement |
|---|---|
| π¬ YouTube | 46,188 Subscribers | 432.9K Views | 16.4K Hours Watch Time |
| πΉ Videos Created | 1,800+ Tutorials (3.9M Impressions, 5.5% CTR) |
| βοΈ Medium Blog | 422 Followers | 18.4K Presentations | 5.2K Views | 3K Reads |
| π Blog Posts | 100+ Articles on Data Engineering |
| πΌ LinkedIn | 10,922 Followers | 1,000+ Posts |
| π GitHub | 300+ Open Source Repositories |
| π Website | Technical Resources & Portfolio |
| β±οΈ Avg View Duration | 2:21 minutes (High engagement) |
- ποΈ Data Lake & Lakehouse Architecture β Apache Hudi, Iceberg, Delta Lake
- βοΈ AWS Big Data Engineering β Glue, EMR, S3, Step Functions, Lambda
- β‘ Real-time Data Pipelines β Kafka, Kinesis, Streaming ETL
- π§ Performance & Cost Optimization β Query tuning, incremental processing
- π End-to-End Production Projects β From ingestion to analytics
- πΎ Data Management Best Practices β Table formats, partitioning, clustering
- π PySpark & Python β Distributed computing, big data processing
Creating 1,800+ videos and 100+ blog posts to democratize data engineering knowledge and help engineers worldwide build better data platforms.
How Zeta Global scales multi-tenant data ingestion with Amazon S3 Tables
Deep dive into production architecture handling 10,000+ Iceberg tables, 55TB+ monthly processing, and multi-tenant data ingestion at scale with sub-10-minute refresh latency
- Experiment: S3 Tables with Incremental Loads up to 520GB β Performance benchmarking at scale
- S3 Tables: Table Maintenance Flexibility with Spark β Advanced maintenance strategies
- Stream Real-Time Data to AWS S3 Tables using Kafka β Production streaming patterns
- Writing to S3 Tables Managed Iceberg Tables Using DuckDB β Interoperability demonstrations
- ElasticSearch Performance Tuning: 80X Faster Searches Case Study β 4.4s β 125ms optimization
| Degree | Field | Institution | Achievement |
|---|---|---|---|
| π M.S. | Electrical Engineering | University of Bridgeport | 4.0 GPA |
| π M.S. | Computer Engineering | University of Bridgeport | Best Academic Award |
| π B.S. | Electronic Engineering | K.J. Somaiya Institute |
π Honors: Best Academic Achievement Award (4.0 GPA) β’ 3rd Place UB Hackathon β’ The Builder Award
Data Lakes & Lakehouses β’ Apache Hudi & Iceberg β’ AWS Data Platforms β’ Spark Optimization
Multi-tenant Architecture β’ Cost Optimization β’ Content Creation β’ Technical Speaking
Open Source Collaboration β’ Mentorship & Career Guidance
I'm actively seeking speaking engagements, technical collaborations, content partnerships, and consulting opportunities around:
- Data Engineering & Lakehouse Architecture
- AWS Big Data Technologies & Cost Optimization
- Building Production Data Platforms at Scale
- Content Creation & Developer Education
Location: New York City Metropolitan Area
- πΉ Created 1,800+ educational videos in a single year β that's averaging 5 videos every single day!
- π¬ Accumulated 432.9K views and 16.4K hours of watch time helping engineers worldwide
- π Authored 100+ technical blog posts reaching 18.4K presentations and 5.2K views
- π» Built 300+ GitHub repositories with production-ready code and tutorials
- π Teaching data engineering in 4 languages: English, Hindi, Gujarati, and Marathi
- π Turned data processing from hours to minutes and saved companies $200K+ in cloud costs
- π Managing 10,000+ Iceberg tables processing 55TB+ monthly in production
- π― Created LakeBoost framework now running 466 jobs daily (11,832/month) in production
432.9K YouTube Views
16.4K Hours Watch Time
46,188 YouTube Subscribers
1,800+ Videos Published
10,922 LinkedIn Followers
5.2K Medium Blog Views
300+ GitHub Repositories
100+ Technical Blog Posts
55TB Data Processed Monthly
$200K+ Cost Savings Delivered
Made with β€οΈ and countless hours of code by Soumil Shah
"Building data platforms by day, creating educational content by night, making data engineering accessible 24/7"
Last updated: January 2025


