How to Solve a Too-Much-Data Crisis


We Invented a New Data Format so You Don’t Have To

It’s crazy to think about, but today while an estimated 33% of data holds potential value, only one half of one percent of data is actually analyzed… It’s not because we don’t want to use all of our data, but instead, it’s because current technologies have literal physical limitations (compute power, pipeline flows, etc.). For decades, we’ve been taught to collect and store data, and technologies have become quite adept at data storage. However, we haven’t seen a similar level of innovation around taking that data and making it model-ready—a crucial step to realizing the value of data.

Not all applications need access to massive amounts of data in real time, but some of the most exciting uses of data are happening with AI and ML projects, which are more and more often stifled not by data engineers’ abilities, but by the physical limitations of data access. The founding team at Molecula lived this experience.

Too Much Data

Our previous company was a customer data platform serving the biggest names in sports, media, and entertainment. Every time we signed a new customer, we needed to ingest data about hundreds of millions of fans with hundreds of millions of attributes (think social graphs, behavioral graphs, in-arena WiFi data, etc. from hundreds of different data sources). It was our job to make all of this data instantly accessible with real-time joining and querying so that departments across our clients’ organization would be able to make decisions on the same, most up-to-date data. 

As the datasets grew larger, it became apparent the existing systems couldn’t support the amount of data we needed to access with the low latency query times we required. We had huge Cassandra and Elasticsearch clusters, and our most important queries were taking longer and longer. We began to explore pre-processing, pre-aggregating, and all of the “things that you do” to attempt to make bigger data faster. However, we found the tradeoff was losing our ability to deliver the ad hoc query results on up-to-the-second data, which was not negotiable for our business.

A Purpose-Built Feature Storage Platform is Born

A couple of our engineers, inspired by their backgrounds in stock market blackbox trading systems, had a “crazy” idea to use feature extraction to dimensionally reduce the data we were ingesting. They recognized that if we could losslessly reduce the data—and we could do it fast enough—it would impact every aspect of the data and ML stack. Without many alternatives on the table, management approved the new approach, and the engineers went to work.

They devised a way to automatically extract features from data at its source into a machine-ready format so that data was instantly and continuously readable by machines. Since the new format was so small, it could travel fast enough to deliver query results without pre-processing. We weren’t too concerned with classifying or naming it at the time, but this is what came to be known as FeatureBase.

So Good, We Didn’t Believe It

We put it up against our massive clusters of Elasticsearch servers, and our initial reaction was “this is broken, it didn’t register.” We continued to test, and the results were so fast that we couldn’t believe it was working properly—our trace was down to milliseconds. Once we proved to ourselves that these two servers using our newly-engineered data format were completing the same query in milliseconds that 40-50 Elasticsearch servers were taking 10-20 seconds to run, we realized that we had stumbled onto something truly unique and potentially groundbreaking.

Automatic Features for Everyone 

After making some adjustments so that our feature extraction and storage platform could be implemented as an overlay to virtually any existing infrastructure, we launched the technology under the name Pilosa as an open source project. Today, we are focused on making Molecula the absolute best enterprise product for extracting value from data. We do this through both an on-prem and fully-managed, purpose-built feature storage solution called FeatureBase. 

As a company born into the rapidly-evolving ML world, we are constantly working to improve our product while also getting the word out to other like-minded data engineers, systems architects, software engineers, ML engineers, or any other roles responsible for creating and delivering big  data products who can benefit from our “crazy” (or maybe desperate) times where we had to have instant access to massive data without compromise. If this is a story you can relate to, we would love to chat about new approaches to solving the “too-much-data” problem. Schedule a demo here.

Learn more about Molecula’s origin story, the technology behind it, and where we’re headed in the coming months on Episode 175 of The Data Engineering Podcast with Tobias Macey.