Why All Your Data Should Be Raw
By SCUBA Insights
Your company generates a ton of data—so much that it's essential to pare it down and only store the most relevant stats, right?
Wrong. In 1975 data warehouses were developed, and at the time a gigabyte of storage cost $200,000. Today, one GB of storage will cost you around 2 cents. With low storage costs, companies can stop worrying about compression and place their focus on making sure they can fully understand their data.
What does this mean for brands? Heavily “cooking” (processing) data may have been necessary a few decades ago, but now its few benefits are far outweighed by the advantages of keeping data raw.
Read on to find out why all your data should be raw, and how it can benefit your brand.
What is "cooked data"?
“Cooked data" essentially refers to processed data. Meaning, that data has been taken from its raw format and processed, reorganized, or compressed. Traditionally, companies heavily cook their data in order to optimize storage space and query times. Three major ways to cook data are:
- Fitting data warehouses with compression schemas: One common schema is the star schema, which compresses data by taking information from an event and storing it in different dimension tables. When an event, such as a click, occurs, information like the timestamp and user ID is collected. In a star schema, this information is split up into pieces and stored in dimension tables.
- Fitting tables with indices: Schemas are usually paired with indices, like bitmaps and B-trees, so information can be found again quickly.
- Only storing aggregates or subsets of the data: Companies may choose to store pre-computed aggregates, like averages, or just pick a few dimensions of the data to store in an OLAP cube, instead of keeping the raw data.
However, cooking your data with any of these methods isn't the optimal choice anymore. These methods were initially created because they allowed the data to fit on a machine and allowed people to answer queries quickly—not because they actually made sense. Subtle bugs, like email automation pulling information from the wrong table, are exceedingly difficult to find when data is processed like this. And again, the motivation behind cooking data no longer exists as storage prices have dropped.
Better understand your data by keeping it raw
“The Sushi Principle,” says that raw data is better than cooked data, because it keeps your data analysis fast, secure, and easily comprehendible. There are three steps you need to take to keep your data raw.
1. Use a simple, well-tested pipeline.
When your data pipeline already has to read every line of your data, it's tempting to make it perform some fancy transformations. However, brands should steer clear of these add-ons to avoid:
- Flawed calculations: If brands have thousands of machines running in their pipeline in real-time, sure, it's easy to collect your data—but not so easy to tell if those machines are performing the right calculations.
- Limiting yourself to the aggregates you decided on in the past: If you're performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward—your old data is already set in stone.
- Breaking the pipeline: If you start doing fancy stuff on the pipeline, you're eventually going to break it. So you may have a great idea for a new calculation, but if you implement it you're putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data—which would be damaging to your company.
Of course, there are a few circumstances where you will need business logic in your pipeline. Regulations may require you to purge old user accounts and drop IP addresses. But every time you think about pushing a piece of business logic into your pipeline you need to consider the risks. We're all still relatively bad at writing software—every complicated bit you add increases your chances of an error. And since storage is so much cheaper now, you have every incentive to just perform those calculations later.
2. Keep all of your original data.
Once you've gone through the trouble of collecting all your data, you shouldn't toss out portions of it. With data storage costs so low, there's no reason not to keep all of your data—but a bunch of reasons to do so:
- You can easily trace the lineage of any statistic: Imagine trying to figure out exactly how your DAU was calculated. If your stored data is in the same format that it was generated in, you can just ask the developer of whatever service you're using to generate data what they meant. If you have heavily processed data it's harder to backtrack through all the transformations that were done to find the original data.
- You can perform any query you want: The beauty of data is in how it can lead you to further questions. If the number of users subscribing through email is shockingly low, you're going to want to look into the attributes of users who do actually signup through that channel. You don't lose any detail when you have all your data on hand, which means you can iterate on your questions at any time. If you've pared down your data to an OLAP cube, you can only measure already defined dimensions—everything else is lost.
- You don't have to waste time deciding what stats you want: If you decide to precompute stats, you're going to need to spend a whole lot of time planning out what those will be—and even that's no guarantee that you'll have everything you need.
Keeping your original data reduces your unnecessary work, so you can get to parts that actually add value. It takes away the need for extensive prior planning and spending time figuring out where your stats came from—so more time can be spent on fully exploring your data.
3. Summarize and sample at query time.
You may be tempted to summarize and sample your data early in the pipeline. The thinking goes, "I'm going to have to do these things no matter what, why not shrink my data and make it easier to process?" But sampling and summarizing early on can harm the accuracy of your data. It's much less risky to do so at query time:
- You can ensure that your summary statistics aren't skewed: If you calculate the average number of edits a Wikipedia user makes per week, that figure is going to be outrageously high unless you exclude bots. While this may seem like a mistake you'd never make, little things slip through the cracks all the time.
- You can sample once you know who's interesting: You can't simply keep every 100th event that's logged—that doesn't give you a picture of how users, accounts, and devices are behaving. You need to sample by actor, not event. But you won't know which actors will be interesting to look at before you've started to come up with queries. And the types of users you want to look at will change between queries.
- You'll get statistically significant results: Much of the time you're going to want to look into the behavior of small segments of your user population. But if you sample before query time, you may not have enough data on that small population to get statistically significant answers to your queries.
Yes, you will likely need to sample your data at some point to get answers to your queries quickly. But making that point at query time will ensure that you have the appropriate representative sample you need for every query.
Do less to your data, so you can do more with it
Data is necessary to grow any business—so stop wasting it.
At Scuba Analytics, we believe data works best when brands can iterate queries continuously instead of having to craft the perfect idea first—if you throw business logic into your pipeline, you lose this ability. By keeping your data raw, you can ask any query you want without having to plan for it in advance.
With Scuba's continuous intelligence platform, brands don't have to worry about tedious ETL or constantly updating their data—and gives brands the agency to explore and understand their data with ease.
Ready to learn how Scuba can help you optimize your data? Request a demo today or talk to a Scuba expert.
Blog Categories
Recent Blog Posts
- Crack the Code: How To Maximize Ad Revenue in a Privacy-First World
- MTCDPA: Will Montana’s New Privacy Measure Disrupt the Future of Advertising, and Business?
- Capture Signal Loss with Decision Intelligence
- AWNY24 Session Recap: Privacy Hijacks Signals: Future-Proof 1P Data with Real-Time Data Collaboration
- #PROGIONY: Game-Changers, Fading Fads, and the Future of Advertising
- Publishers’ Responsibilities in the Age of Signal Loss
Popular Blog Posts
- Diving Deeper into Analytics: How SCUBA Fills the Gaps Left by GA4
- 48 Analytics Quotes from the Experts
- 10 Great Examples of Hyper-Personalization in Entertainment & Media
- Data Bias: Why It Matters, and How to Avoid It
- It's Time to Stop Being “Data-Driven” (And Start Being Data-Informed)
- How to Conduct a Behavioral Analysis (in 7 Steps)