#status/done/processed ![Cover Image](https://motherduck-com-web-prod.s3.amazonaws.com/assets/img/big_data_is_dead_692251bb36.jpg) ## Metadata Author:: [[Jordan Tigani]] Title:: Big Data Is Dead Full Title:: Big Data Is Dead Import Date:: 2023-05-13 Source:: #source/readwise/reader Source URL:: [Source URL](https://motherduck.com/blog/big-data-is-dead/) Review URL:: [Review URL](https://readwise.io/bookreview/25107245) ## Document Tags:: [[Rated ⭐⭐⭐⭐⭐]] ## Highlights - For more than a decade now, the fact that people have a hard time gaining actionable insights from their data has been blamed on its size. “Your data is too big for your puny systems,” was the diagnosis, and the cure was to buy some new fancy technology that can handle massive scale. - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0djzrxd4qz2gay66cd0phq) - Tags: [[Big Data]] [[Favorite]] - In 2018, I switched to product management, and my job was split between talking to customers, many of whom were the largest enterprises in the world, and analyzing product metrics. - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0dmvsrfd2gm4t66aqdh1gy) - Tags: [[Career Pivots]] - The most surprising thing that I learned was that most of the people using “Big Query” don’t really have Big Data. Even the ones who do tend to use workloads that only use a small fraction of their dataset sizes. - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0dr3na049qwbd5wey9en8z) - This post will make the case that the era of Big Data is over. It had a good run, but now ==we can stop worrying about data size and focus on how we’re going to use it to make better decisions== - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0drk2z34pc9mm7g9gcbhrx) - MongoDB is the highest ranked NoSQL or otherwise scale-out database, and while it had a nice run-up over the years, it has been declining slightly recently, and hasn’t really made much headway against MySQL or Postgres, two resolutely monolithic databases. If Big Data were really taking over, you’d expect to see something different after all these years. - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0dv3fng1s74c10z80z7js1) - Tags: [[Database Management Systems]] [[Favorite]] - ==He found that the largest B2B companies in his portfolio had around a terabyte of data, while the largest B2C companies had around 10 Terabytes of data== - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0x3r8q9jvgd5wkdea5n4nk) - Note: Interesting statistic that generally B2C companies have 10 times as much data as B2B companies. - Link: [[B2B vs B2C Companies]] - It is hard to see how this adds to massive data sets under reasonable scaling assumptions. - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0xvq79ehm3zzex2g83bftk) - Note: The only counterpoint to this would be videos or photos. - Modern cloud data platforms all separate storage and compute, which means that customers are not tied to a single form factor. This, more than scale out, is likely the single most important change in data architectures in the last 20 years. Instead of “shared nothing” architectures which are hard to manage in real world conditions, shared disk architectures let you grow your storage and your compute independently. ==The rise of scalable and reasonably fast object storage like S3 and GCS meant that you could relax a lot of the constraints on how you built a database.== - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0ymwec0vr09q4xw8ted35n) - Note: Well except [[Redshift]] - Misunderstanding of this point leads to a lot of the discussion of Big Data, because techniques for dealing with large compute requirements are different from dealing with large data - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0ynpn611zfj2bwjypfpqjn) - ==But compute needs will likely not need to change very much over time; most analysis is done over the recent data.== Scanning old data is pretty wasteful; it doesn’t change, so why would you spend money reading it over and over again? True, you might want to keep it around just in case you want to ask a new question of the data, but it is pretty trivial to build aggregations containing the important answers - Date:: [[2023-03-08]] - Find: [View Highlight](https://read.readwise.io/read/01gv0ytpyr5ymkhv5k6fzw32vv) - Note: This is what XV should be doing - ==This bias towards storage size over compute size has a real impact in system architecture. It means that if you use scalable object stores, you might be able to use far less compute than you had anticipated. You might not even need to use distributed processing at all.== - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv10e2j5g97tn3367g7redff) - Even when querying giant tables, you rarely end up needing to process very much data. ==Modern analytical databases can do column projection to read only a subset of fields, and partition pruning to read only a narrow date range. They can often go even further with segment elimination to exploit locality in the data via clustering or automatic micro partitioning. Other tricks like computing over compressed data, projection, and predicate pushdown are ways that you can do less IO at query time.== And less IO turns into less computation that needs to be done, which turns into lower costs and latency. - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv10grytez1280h1sdcpqy8p) - Tags: [[Cloud Data Warehouses]] - One definition of “Big Data” is “whatever doesn’t fit on a single machine.. By that definition, the number of workloads that qualify has been decreasing every year - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv11d7a0r454h5c90df43pqs) - If you think about many data lakes that organizations collect, they fit this bill entirely: ==giant, messy swamps where no one really knows what they hold or whether it is safe to clean them up== - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv11f945dz7k6mmcq99c59yw) - Just as many organizations enforce limited email retention policies in order to reduce potential liability, ==the data in your data warehouse can likewise be used against you.== - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv11fta2k7yg61rap5skefzj) - Tags: [[Favorite]] - The longer you keep data around, the harder it is to keep track of these special cases. And not all of them can be easily worked around, especially if there is missing data. - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv11h20rk13n1n1xg9bdj20a) - If you answer no to any of these questions, you might be a good candidate for a new generation of data tools that help you handle data at the size you actually have, not the size that people try to scare you into thinking that you might have someday. - Date:: [[2023-03-09]] - Find: [View Highlight](https://read.readwise.io/read/01gv11hjyhdcbsqxqv4ct8b9k8) #status/done/processed - Are you really generating a huge amount of data? ^f2406b - If so, do you really need to use a huge amount of data at once? - If so, is the data really too big to fit on one machine? - If so, are you sure you’re not just a data hoarder? - If so, are you sure you wouldn’t be better off summarizing? - Date:: [[2023-05-28]] - Find: [View Highlight](https://read.readwise.io/read/01h1ezts3snv93ew2cb9exbx0h)