A Brief History of Data: Part 2

 

In Part 1 of “A Brief History of Data,” we walked through the creation of the traditional database and how it mirrors the physical ways we’ve stored data since the 1890s with the introduction of the filing cabinet. As we covered, the traditional ways of storing data are great for human-centric purposes, but they are not ideal for machines or the extremely large-scale data analysis required for advanced analytics and machine learning. 

At Molecula, we invented a new way to store and process data that was built specifically for lightning-fast analysis of incredibly large data sets of varying cardinality and sparsity.

The Feature-Oriented Database

A feature-oriented database is an unconventional approach to storing data. In data science, a feature is a measurable piece of data that can be used for analysis. Rather than evolving from the way human brains organize and retrieve information, a feature-oriented database is purpose-built for computers to consume and compute data. 

Computers are made up of billions of transistors that exist in one of two states: on or off. Therefore machines “think” in 1’s and 0’s, which is both efficient and explicitly simple. The beauty of that simplicity is that computers can compute immense amounts of data instantly when that data is in the form of 1’s and 0’s. A feature-oriented approach exploits this fundamental aspect of computers.

With a feature-oriented database, all your underlying data is automatically and continuously converted into the most compute-efficient format. Once the data is in this new format, it occupies up to 99% less space than traditional data formats while it remains 100% computable. Computations with a feature-oriented database are so much faster, there is no need to create copies or build additional “filing cabinets.” Once your data is prepared for machine processing in the feature-oriented format, nearly everything you do with your data becomes more efficient: storing, accessing, computing, securing, etc. 

On the following page, figure 1 shows four possible formats for communicating the same basic data. Keep in mind this is over-simplified to drive home the point that different formats are better suited for various uses both by humans and computers. Further, some formats inherently enable much faster analysis than others. Refer to the customer sales data in figure 1 to answer the following: How many people used a credit card to pay for their transaction and which payment method was least popular?

examples of different data formats

Which data format allowed you to obtain the answers most quickly? You likely found the format in the second quadrant, “Visual Graphic,” the most efficient format to find the answers: “4 people used a credit card,” and “written checks were least popular.” This makes sense because visual charts are specifically designed for humans to quickly analyze and extract meaning from data. If you were a computer, however, you would perform faster by using the “FeatureBase” format, as it was specifically designed for computers to process data at the fastest rate possible.

Data Storage Formats

When it comes to database storage, different formats have come to exist as more advanced use cases require different access patterns. Historically, the two major database formats have been: online transactional processing (OLTP) and online analytical processing (OLAP). Because of how the data is physically stored (row-wise), OLTP works best for simple, fast, high-volume, record-oriented transactions. OLAP works best for analytical workloads, but in order to be performant, requires preaggregation based on an understanding of the specific query patterns it will serve. If your application requires large amounts of complex ad hoc queries for analytical purposes such as machine learning, these both fall short with regard to setup time, performance speed, and/or ability. The feature-oriented online machine processing (OLMP) format is best suited for high-volume, complex analytical workloads in real-time.

difference between row-wise, columnar, and feature-oriented databases

Modern Analytical Workloads

As the world shifts towards automation, machine learning, and generalized artificial intelligence, new technologies are not only inevitable but required. While most tech companies have attempted to meet these needs by optimizing existing ways of storing, processing, and analyzing data, the world is finding that it’s not enough to allow for the scalable implementation of advanced analytics and AI. At Molecula, we’ve built technology from the ground up that ensures data is findable, accessible, and usable throughout organizations, in both human-readable and machine-readable formats.

 

Download our complete Feature-First Field Guide to continue reading.

Download Field Guide