Data Engineers Face New Demands

By: Molecula

New Demands Require New Technology 

By now, we all know that “data engineer” is one of the fastest-growing and most in-demand positions. DICE’s recent job report showed a 50% increase in data engineer job listings in 2019, making it the highest-growth tech occupation by more than 10%. Why are data engineers so in demand? The world is finally starting to understand that machine learning is more than just models. It is often quoted that data scientists spend approximately 80% of their time effectively performing data engineering. Machine learning and AI projects can never be reliably successful until businesses have solved “the data problem.” Data engineers are key to solving the data problem.

To better understand this, let’s start by examining what a data engineer does.

What Does a Data Engineer Do?

A data engineer is responsible for building and maintaining an organization’s data pipeline systems. They make sure that the data an organization is using is clean, reliable, and prepared for whatever the necessary use case might be. To put it simply, a data engineer creates a usable data product.

To the uninitiated, this doesn’t seem like it should be too difficult (at the very least, it seems highly repeatable), but with current volumes of data collected by the millisecond from distributed sources and businesses attempting to become predictive, data engineering is extremely complex.

According to Open Data Science, “Experts estimate that it takes two to three data engineer jobs per data science job to help maintain that pipeline, driving the high demand for these engineers.”

With the need for constant data-wrangling in order to create a usable data product, one might think that data engineers could be singularly-focused, but with so much data available today, that data is only valuable if the data engineer understands the desired business outcome and is working towards it.

Today’s Data Engineer

Today, organizations are aware that advanced analytics (including machine learning and AI) can create major business value. A recent Ventana Research study notes that, “By 2023, more than three-quarters of analytics processes will be enhanced by artificial intelligence and machine learning.” 

As data volumes continue to grow and machine learning becomes an expectation, the role of the data engineer is shifting. Data engineers are making massive strides in architecting together and optimizing legacy data technologies to minimize delays, costs, and overall inefficiencies, particularly when attempting to scale. But these Information Era technologies were not built to handle the huge volumes of data necessary to power the machine learning lifecycle.

Today’s data engineers are responsible for unleashing the power of data science and machine learning within organizations, while maintaining an efficient and scalable data product. The expectations put on data engineers are higher than ever before and continue to grow.

So is it reasonable to expect data engineers to continue “turning lemons into lemonade” on the regular as the already-gargantuan volumes of data continue to grow?

Short answer: No.

Data Engineer Skill Requirements

According to O’Reilly, a typical data engineer is expected to be proficient in at least ten data technologies, and to know at least 30. This can include file formats, ingestion engines, stream processing, batch processing, SQL, Python, data storage, cluster management, transaction databases, cloud based warehouses, data visualizations, machine learning tools, and on and on… 

Creating data pipelines is not an easy task—it involves advanced programming skills, an understanding of big data framework, and knowledge of systems creation.

To-date in data engineering, one of the strongest skills a data engineer can have is knowing all of the different tools that are necessary to achieve a desired data pipeline solution. When a data scientist is tasked with solving the same problem, they’ll often attempt to rely on as few tools as possible, reusing a single tool that might lose out on major efficiencies (or could possibly break the entire pipeline).

This requirement of multiple tools has been the reality, and it will continue to be a facet of data engineering, but even with all of the tools currently available in a data engineer’s arsenal, they will still encounter problems when attempting to scale machine learning operations.

New Technology is Necessary

The core technologies that power data engineering today were built before the advent of big data. Yes, new platforms and companies have emerged within the data and ML/AI space, but these new platforms are largely built on reference architectures, and the reference architectures are incremental optimizations of legacy technologies.

Step function improvements with new technologies are needed to realize the true potential ML and AI can create. For example, FeatureBase is purpose-built extraction of and storage for features. FeatureBase taps directly into data sources and converts all of a business’ raw data into a model-ready data format (features) that it then stores. While it is compatible with legacy solutions, it is not constrained by them.

A new technology such as this enables data engineers to stop building individual pipelines, project-by-project. It eases back-and-forth struggles with data science teams. FeatureBase lays the necessary “feature” groundwork to power all machine learning workflows within an organization. It empowers data engineering teams to deliver instantly-provisioned, up-to-date, ultra low-latency, data to data science teams without the need to architect, deploy, secure, and monitor infrastructure for every project.

The data engineers of tomorrow will no doubt find new ways to strike the balance of keeping the lights on with legacy products while making revolutionary strides with next generation technology to meet the ambitious business goals of the organization.