How to Escape the Data Death Spiral
By: H.O. Maycotte
In 2010, GE Engineer Dave McCrory made an astute and spot-on analogy between data and the mechanical force of gravity. Data Gravity describes the increasing gravitational pull that develops as data grows in size. More data begets more data and so on. In my job, I get to work with brilliant professionals who urgently need access to all of this data. So often, the problem is that specialized teams are retrieving big data for various projects across multiple silos using disparate technologies. Rather than an overarching, coordinated data access strategy, each effort is too narrowly focused, causing further data fragmentation as it gets copied and moved from system to system into. I call this out-of-control data trap the Data Death Spiral. In nature, army ants will form a similar “death spiral” as they literally exhaust themselves to their demise while following only the ant in front of them without looking at the bigger picture. I’ve seen far too many companies whose data is spiraling out of control, swirling, growing, and becoming more and more inaccessible.
We all know by now that data is growing bigger and being generated faster by the second. Some data is vital for the long term, while other data is instantly valuable but becomes worthless within a few seconds of being born. We have built all kinds of databases, formasts, stores, data lakes, data warehouses and cloud services to solve for transactional data, time-series, events, and now we are even building specialized silos for ML data. All of this fragmentation and specialization has led us to a distribution of data that is difficult to reconcile. In the average Fortune 500 company, dimensions of the most important data such as customers, patients, merchants, and devices live in dozens and even hundreds of systems and silos. Harnessing a complete, up-to-the-moment picture of our most important entities has become almost impossible.
With every attempt to solve our data access woes, we are inadvertently making the problem worse because the solution requires us to copy, pre-process, and move data. Of all the data being managed by companies today, the original data makes up only 15%—yes, the average company manages 6 copies for every original piece of data. The other 85% is what fuels the Data Death Spiral. Modern cloud data lakes and data warehouses are solving the human-scale data access problems by automating much of the work needed to move and manage all the copies. In fact, there is an entire industry category for “copy data management.” (This reminds me of the pharmaceutical commercials advertising drugs that are sold to treat the side effects of other drugs.) Without addressing the underlying problems, most conventional data access approaches suffer when it comes to performance and agility. Real-time data access at scale is an enigma. With the most exciting and promising technology trends of IoT, ML and AI requiring massive, real-time data access for machine-scale analytics and decision engines, this access barrier must get solved. This is the precise problem we are solving at Molecula with our enterprise feature store.
Without getting too far into the technical weeds, I want to summarize how the Data Death Spiral and Molecula’s enterprise feature store affect the following roles: data engineers, data scientists, business executives, and consumers.
Data engineers have immense pressure to maintain business continuity while meeting the needs of the business teams who are rapidly demanding more and more access to data. Unfortunately, as it often turns out, while they are solving data problems in one area, they are inadvertently contributing to the Data Death Spiral problem more than any other role in the company. Each project for each application, for each business unit across the entire company, ends in a monolith of complex infrastructure that is incredibly difficult to disentangle. In an effort to get out of the spiral, IT departments are increasingly moving everything from on-prem and edge to the cloud. The cloud definitely provides benefits in resource management and offers instant access to a vast array of services. However, moving big data to the cloud does not solve the data access, pre-aggregation and copying problems; it just moves them to the cloud. An analogy I use is: Just because you take your clothes to the laundromat doesn’t mean they don’t still need to get washed.
Imagine if data engineers could move from architecting, provisioning, optimizing, and managing infrastructure for every single project to maintaining systems that automatically make data AI ready. That is what Molecula’s feature store does. As an overlay, Molecula’s feature store can be quickly implemented to provide a centralized, updated view of a company’s important data. With Molecula, data engineers transition from being the unintentional gatekeepers of data into empowering data access stewards, helping their stakeholders unlock use cases never possible before.
Data scientists are perhaps feeling the effects of the Data Death Spiral more than any other role in the organization. They need access to data for training and putting models into production. However, they rarely have access to the core IT data systems and are dependent on IT to provide it. It is not unusual to hear about data scientists spending upwards of 85% of their time just preparing data. This puts their education, their purpose, and their potential on hold while the data access process creeps along. To make matters worse, when they finally do get access to the data and they have built and trained their models, putting those in production becomes another IT acrobatics headache.
Imagine if a data scientist could begin every project or task with access to not just all of the data, but all of the data updated to the second. Models could be trained on the data and productionized on the data. One hundred percent of data science time could be spent actually extracting value and innovating from the data as opposed to sweet talking IT to get moved up in the queue for the access they need. This is the world that Molecula’s feature store provides for data scientists.
Business executives have the most to lose in the Data Death Spiral saga. Eroding margins, encroaching competition, and technical challenges can keep even the best up at night. While you would be hard pressed to find an executive who doesn’t claim to be data-driven, the truth is that most are not. Most often, BI reports created with week-old data is the best they ever see. There is a tremendous amount of business value being obscured. With access to enough up-to-the-moment data, machine learning applications could literally be predicting and prescribing transformational business outcomes. When execs employ data scientists, product managers and IT teams to put their data to use, they don’t realize that on average, less than one percent of their data is actually being used.
Molecula is setting out to change that statistic. With Molecula’s automated feature store, data engineers can help business executives achieve their most ambitious goals. Having access to all of the organization’s data in real-time enables new use cases and reveals opportunities that have yet to be conceived.
One of the most conspicuous—and notorious—ways consumers have been impacted by modern advances in ML, analytics and algorithms is through interactions with companies like Facebook, Netflix, and Amazon. While those companies are harnessing data to attract consumer attention and sell more advertising, there are innumerable industries that have the potential to materially and positively improve our everyday lives by utilizing data as expertly as the tech giants. In healthcare, there are lives to be saved by helping doctors make more accurate diagnoses faster, by preventing adverse drug interactions, and by identifying global health trends. In life sciences, developing customized, targeted medicines, speeding and improving the efficacy of clinical trials, and eradicating genetic diseases are all goals that will be achieved with the use of machine scale big data analysis. In financial services, consumers will get approved for taylored loans faster and more flexibly, and the average person could have access to stock trading strategies that were previously only available to the elite. There is almost no industry that won’t be revolutionized by leveraging data to make smarter decisions, in the moment, at scale. And there is almost no consumer who won’t profoundly benefit in some way.
Molecula is focused on helping industries escape their Data Death Spirals and access their most important data to be used for transformative Artificial Intelligence, machine learning, and other machine-scale big data analytics. The ultimate goal is to unlock human potential through data.