#status/done/processed ![Cover Image](https://readwise-assets.s3.amazonaws.com/static/images/article0.00998d930354.png) ## Metadata Author:: [[Randy Au]] Title:: It’s OK to Use Spreadsheets in Data Science – Towards Data Science Full Title:: It’s OK to Use Spreadsheets in Data Science – Towards Data Science Import Date:: 2023-05-13 Source:: #source/readwise/instapaper Source URL:: [Source URL](https://towardsdatascience.com/its-ok-to-use-spreadsheets-in-data-science-c1d0eff95b8b) Review URL:: [Review URL](https://readwise.io/bookreview/26338667) ## Document Tags:: [[Data Science]] [[Spreadsheets]] ## Highlights - But it’s probably the greatest Swiss army chainsaw for data for the sorts of ugly work that no one ever wants to admit they have to do every day. ==In an ideal world they wouldn’t be necessary, but when there’s a combination of tech debt, time pressure, poor data quality, and stakeholders who don’t know anything but spreadsheets, they’re invaluable.== - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484473) - There’s even a whole “European Spreadsheet Risks Interest Group (EuSpRiG)” (founded in 1999!) that’s dedicated to Spreadsheet Risk Management a.k.a. how not to ruin your business via spreadsheet snafu. - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484475) - Note: Amazing - The majority of other issues is when people attempt to make a spreadsheet do too much, like becoming a database, data warehouse, project management tool, when more powerful and user-friendly dedicated solutions exist. - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484481) - The only real way to get a good sense of the data is to look at distributions, visualizations, and directly sampling it in raw form. Spreadsheets are generally great for this. I tend to find it less clunky than using pandas to poke around at arbitrary chunks of rows. - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484487) - ==The trick to know when to stop is if you’re seriously considering writing a macro or something, stop.== - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484499) - Many times, ==there’s no other way to deal with data sets like the above other than writing some kind of brittle hard-coded mapping function of some kind==. It’s honestly a challenge to keep everything consistent and documented over years of production, the mix of camelCase and underscores points to that. Doing a simple aggregation for meaningful analysis is an utter pain in the butt. - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484500) - There’s a reason why lots of BI tools of all levels have a kind of “export to CSV/Excel” feature. Lots of very smart analytic people don’t know much about coding in Python or R. - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484502) - So why not have just a CSV, the universal data transfer format? You can, but it makes leaving a data source trail more work. You can package all the relevant information needed to pull a data set into a tab in the spreadsheet, whether it’s relevant queries, links to scripts, whatever. - Date:: [[2019-03-30]] - Find: [View Highlight](https://instapaper.com/read/1177860845/10484503)