In 2014, if I was analyzing a 33GB file, I would have jumped to writing a MapReduce job. Today, it take's 56s to load a csv file with 292 million rows of hospital price data *and* only 734ms to find the average price paid by patients per ICD code. On my laptop, using DuckDB. 🤯
The prescription for my life in databases was written early in life. With my 8086 Hyundai PC, I created custom MS-DOS databases to index my books, movies, toys & software. It was local, easy, fast-- even if not beautiful in UI, it was beautiful in simplicity.
Wow, the responses to this are just ugly and representative of exactly the types of things it sounds like you've had to deal with at your former job.
👏 Congrats to you for standing up and doing what you believe was right for you. Best wishes as you search for a supportive env.
My colleague @mehd_io has built up an amazing collection of videos on data engineering, DuckDB and MotherDuck. He even interviews many of the GOATs in data. We’ve now vastly improved the discoverability of these videos on our website. Check it out: motherduck.com/videos/
Fake data > real data? I had fun generating fake data w/ Python Faker & DuckDB. This post walks through 3 ways (pandas DataFrames, parquet, csv). Culminates in generating 1 Billion fake people and doing a 1s full table substring filter + SUM(). Whoa!!