How to Maximize Your Scarce Data Engineering Resources

By: Molecula

Strategies for Coping When You Can’t Clone Yourself

In the last few years, we’ve seen a lot of hype around the rise of data science followed by enthusiastic hiring of data scientists. As it turns out, it may have been a little too enthusiastic. Data scientist hiring has outpaced data engineer hiring to the point where the data scientist headlines are talking about why they are quittingfrustrated, and disillusioned, while data engineers are getting some overdue attention as the new most in-demand hires. 

Data engineers already have a tremendous amount of pressure to keep the data “lights on” and are now being required to provide the foundational support for ever more ambitious AI and ML initiatives. If there is no infrastructure or there are no data pipelines or “data product,” there can be no data science. This is a reality many organizations are coming to realize the hard way. Data scientists shouldn’t be expected to magically fill the engineering gap, and data engineers are now in a spot to become the key that unlocks the advanced, real-time data products everyone is so excited about .

To help your company recover from its data engineering debt, here are three strategies to maximize data engineering resources:

1. Implement an Efficient Operational AI Process

Ad hoc, duct-taped systems and one-off requests don’t provide the efficiencies required to recognize repeatable business value. Setting up a stable process where data scientists can train on self-serve, production-scale data will improve efficiencies—not to mention quality of life—for data engineers.

With some claims that only 13% of data science projects make it into production, there is an efficacy gap between the experimenting and production phases of the ML lifecycle. One way to close this gap is to adopt a “feature-first” approach to data preparation. For example, Molecula’s FeatureBase automatically generates features from data at the source, providing data scientists the ability to experiment with real data without having to make manual requests every time they need new features. The impacts of this are that more models make it to production and the resources required to take a project to production are significantly reduced. This can save hours, weeks, and sometimes months.

Additionally, if you can get your data scientists operating in hosted environments such as hosted Jupyter notebooks, you can reduce wasted time and effort transferring data to and from laptops. This will also reduce security concerns of tracking untethered data across individuals’ computers (yes, this actually happens, is called “laptop data science,” and is not good). 

Aim to develop the same CI/CD processes and orchestration of resources for both development and production. A good foundation here will allow each phase of the ML lifecycle to be more efficient. 

Implementing feature storage with a focus on providing hardened, production-scale data on the front-end will require less day-to-day engineering resources, provide more control, and set data science teams up for success.

2. Partner with SI’s to Bolster Your Efforts

In some cases it makes sense to tap outside systems integrators. Global consulting firms like E&Y, specialists like Pleco Systems, and boutique firms like F33 can bring much-needed manpower and niche expertise. Their experience means they have seen a myriad of use cases and know the tried and true programs and processes that work. They can also help you scale up and down as required. Hiring outside firms can be expensive, but they can often pay for themselves by decreasing your time-to-value, and they can provide a level of accountability and security that might not make sense for your company to provide in-house. You can also use them to set up a system while you build the team to maintain it for the long term.

In addition to being able to build out operational AI infrastructure, SI’s can help with digital transformation initiatives across your organization. In some cases, the roadblocks to getting sufficient data engineering resources are political. A good consulting firm can aid with change management from the top down and educate decision-makers on tying compensation and resources to the right metrics and departments.

Further, your company may already have a contract with a consulting firm, which could make it politically and financially easier to access budget for bringing in outside help.

3. Simplify Your Architecture

Simplifying your architecture can set off a domino effect of efficiency improvements across your organization. Simplification can lower costs and complexity in terms of footprint, vendor management, human resources, security, upgradeability, speed-to-data, scalability, and more. 

One way to simplify is to eliminate batch processing and replace it with real-time systems that don’t require the pre-processing of data. Batching requires ETL pipelines that need to be architected, managed, and secured. It also requires intermediary storage and the creation of data warehouses, and materialized views that are performant enough to power models and applications. 

Avoid the inefficiencies of batch processing by moving to real-time systems powered by streaming backbones like Apache Kafka. Steer away from data lakes and other types of storage that require you to post-process data in batch. Data warehouses are more efficient and actionable because they are structured. To make this work in a real-time environment, you’ll need to maintain compute-ready access in a feature storage platform alongside them.

Another obvious, but important, way to simplify is to fully leverage the cloud with platforms like Snowflake for warehousing and human-scale analytics, so that you will have access to elastic workloads that don’t require having engineering resources on standby.

Centralizing security is a worthwhile endeavor to simplify architecture. Providing a single system to secure old and new data driven applications saves your team from building and maintaining one-off security models or, worse, suffering from a security breach. Okta is one provider that can help with implementing a centralized security solution.

Leveraging data catalogs such as data.world to maintain an informative and searchable inventory of all data assets in an organization can also improve efficiencies for data engineers by allowing data scientists to help themselves to the most appropriate data for a particular purpose.

Finally, modernizing your integration strategy by implementing dynamic data pipelining systems with a platform like Meroxa can reduce complexity and enable self-serve capabilities to your team.

It’s an exciting time to be in the fields of AI and ML, but keep in mind that to be successful, your AI projects must ultimately deliver business value from data. We have been putting the AI and ML cart ahead of the data engineering horse, and it is time to pause and rethink our priorities. Many companies are just figuring out how to do this for the first time. That means data engineers may find themselves having to make the most of the resources they have while rallying across the organization to create more balanced teams with a priority on building scalable, repeatable, and successful operational AI systems to support data science and business goals.