A Brief History of Data, Part 1

By: Molecula

From Filing Cabinet to Feature-First

Unless you’ve been living under a rock, you know about big data and the impact it has made on our world. However, the term “data” can be confusing in part because it is so vague and is used in a wide variety of contexts. Data is also sometimes hard to grasp because of its abstract nature. Since we can’t really see or touch data, we frequently lean on metaphors to help us understand its role and the value it brings: the new oil, your company’s most valuable asset, flowing through pipes, etc. 

The Filing Cabinet

Humans have collected data for millennia. In ancient times, a tally mark on a stone might represent the sale of a goat. Fast forward to the 1890s, and data was recorded on paper, placed into folders, and stored in filing cabinets. This more functional way to store and retrieve data mimicked the organization in our brains, but only on a single axis. For example, a general store’s sales could be organized into folders by customer name. 

Compared to haphazard stacks of loose papers, this system made it easy to retrieve and analyze all the sales to Charles Thackeray, or Lois Higgenbottam, for instance. But what if the store manager wanted to retrieve sales information on another axis, such as by department, price, or product ID? We’ll come back to that concept in a minute.

The Traditional Database

Let’s fast forward further ahead to a time when computers are ubiquitous. We now have machines that can store, compute, and retrieve data unimaginably fast. Again, we lean on more tangible metaphors to help us understand. We have “documents,” “files,” and “folders,” to store our data. These terms are helpful for humans to understand the structure of data, but the constructs actually slow down a computer’s ability to retrieve and compute complex data.

Just like the filing cabinets were an improvement over tally marks, our databases serve us well for storing all manner of data, whether in a datacenter or in the cloud. Returning to the general store example above, if we needed to retrieve all the hammer sales, we would have to access every folder, customer by customer to locate each hammer sale record. It would still be better than sorting through random stacks of papers, but it wouldn’t be as efficient as locating a single folder already filled with all of the hammer sales records. 

If we knew we needed to access sales records by product as well as by customer name frequently, it would be more efficient to make a copy of every transaction record and have individual filing cabinets: one organized by product ID and the other by customer name. This is effectively how big data in databases is managed today. Since databases don’t have physical folders full of paper, and storage is cheap, it is relatively easy to make duplicate copies and store them in additional “filing cabinets,” or servers. In fact, for every piece of enterprise data that is stored today, there are nine additional copies of that data being stored and managed on average.

While storing data with computers is obviously an improvement over physical filing cabinets, databases are still structured in a way that mimics how humans think of data. When you need to analyze data on a particular axis or combine and preaggregate it with other datasets, the computer must go through the database row by row or column by column to find the values you want to work with. Copies of the relevant data are made, and computations are preprocessed in batches. For each new use case (like analyzing all the sales greater than $1,000), a new “filing cabinet” is created to preaggregate the data so it can be accessed faster in the future. 

We’ve simplified this example to make a point, but imagine if the general store was a nationwide establishment with millions of “filing cabinets” and billions of transactional data points continuously being updated. Sorting through row after row and column after column every time a manager needed a report or every time a customer added something to an online shopping cart would bog down the system.

filing-cabinet

The relatively recent widespread adoption of cloud storage appears to alleviate this problem to some degree by providing one expandable place to store all your data. What you might not realize though, is that the same filing-cabinet-style copying is still happening—it’s just been relocated to the cloud. Even if you know that it is technically inefficient, cloud storage isn’t all that expensive and you may not really care what’s happening behind the clouds if you have the data you need when you need it. However, at some point when your data gets big enough and/or you need complex analytical queries returned fast enough, you will face the same performance challenges, no matter where your traditionally formatted database is stored.

When dealing with extremely large volumes of data, there are ways to mitigate the problem of sifting and preparing data for quick retrieval. One approach is to run the database on bigger and faster computers. This is effective for relational databases, but only to a point as it is a resource-intensive solution with significant costs. Another solution is to distribute the process and run the database across more computers. This works better with non-relational databases, but because everything must be stored with key values in order to retrieve information across partitions, the process can take time. There is an entire industry devoted to clever techniques to improve performance and balance costs with delivery. But at the end of the day, we are still working with variations on how to sort, store, and copy folders and filing cabinets that feel analogous to the physical world. If your data is big enough and your application needs real-time access, chances are you’ve experienced roadblocks to achieving the business outcomes your team is seeking. What if you could stop making copies, stop building filing cabinets, and pull out of the data death spiral that has resulted?

files-in-cloud

In Part 2, we’ll dive into an unconventional approach to storing data that allows for ultra-low latency access to the freshest data, enabling organizations to achieve true real-time insights and actions.

Can’t wait to find out more? Watch our on-demand webinar “Data Aggregation is Unavoidable! (And Other Big Data Lies)

Watch Now  Learn More