Data Council 2022 Recap: Vendors, VCs, and Visionaries, Oh My!
Data Council 2022 came to Molecula’s hometown of Austin, TX last week, bringing with it a plethora of vendors, practitioners, and thought leaders. First, a quick overview if you’re unfamiliar: Data Council is a community-driven technical conference that bridges the gap between data science, engineering & analytics. The event includes six tracks: Data Engineering & Beyond, Data Science & Analytics, ML Infrastructure, AI Products, Lightning Talks, and peer-led workshops, drawing a global audience that joins together to share and learn about the latest and greatest in data engineering, MLOps, and more.
This year was the Molecula team’s first time attending the conference, and overall, it was not entirely what we expected. However, it was still a wonderful experience meeting and hearing from all involved (and being back in person, face-to-face with fellow humans!). While we’ve heard that Data Council has been dominated by practitioners sharing educational “how-to”s and learnings in previous years, this year felt overwhelmingly vendor and VC dominated (likely a reflection of the explosion of data infrastructure and MLOps tools over the last couple of years).
The central theme of the conference? Analytics and machine learning at scale is HARD. Hence all of the vendors. Almost every session started with some iteration of “Hi, we’re here because we faced a very challenging problem. Here’s how we solved it at <insert Uber, Airbnb, Pinterest, Lyft, LinkedIn, etc.> and the company we’ve spun out of that.” While it can feel a little like Groundhog Day, it’s further evidence that dealing with massive-scale data, maintaining the quality of that data, and delivering it to customers at the speeds required is still a problem with no easy button or single solution resolution.
It’s interesting to note that the solutions that spin out of major companies were created for very custom environments; many felt like echoes of each other (similarities throughout with slight differences in perspective of how to solve a problem). Very few seemed utterly original, which was exciting and validating to the Molecula team because in a field full of the best-in-breed data horses (Thoroughbreds, Quarter Horses, Clydesdales, Appaloosas), we are a data zebra. We’re all of the same Genus, but Molecula FeatureBase and our feature-oriented format is a different species altogether with mind-blowing benefits and capabilities.
Read on for a few of our key takeaways from Data Council 2022 and some of the solution trends we’ll be keeping an eye on in the future.
- Preaggregation is Everywhere:
- Many of the companies that presented or exhibited as sponsors were built specifically on preaggregation as a concept. There was even a (highly entertaining) discussion centered on OLAP cubes that determined preaggregation is a necessity, but the industry could do a better job presenting preaggregated data to users. At Molecula, we fundamentally disagree with preaggregation as a concept – it creates stale data and is only a requirement today because column-oriented data formats are less CPU-friendly and have inherent latency.
- Ease of Use and Complexity Reduction is Key:
- There was a recurring theme around the ideal state of having “tightly integrated but loosely coupled” technology.
- Diego Oppenheimer of DataRobot shared the similarities between the ML lifecycle and the software lifecycle, noting that a main difference is that the ML lifecycle happens at about 10X the speed of the software lifecycle. So when building these systems, one has to be aware of “one-way doors” and “two-way doors” (aka, you must understand which decisions lock you into specific paths vs. which can be rolled back with ease).
- Context Matters:
- Data is an abstraction of human or machine behavior. Diversity of views and experiences helps to layer this context into an analytics use case.
- The idea of a ‘single pane of glass’ for everyone to view data/data products is not feasible; there are multiple overlapping panes of glass. Different stakeholders need different data views.
- Single source of truth is a topic in executives’ minds, and a majority think definitions of data should live as close to the code as possible.
- Metadata Layers/Semantic Layers were Heavily Discussed:
- “Thin” semantic layers that sit on top of databases and prioritize having different data types and metadata flow through without change management is critical .
- It’s SQL’s World, and We’re All Just Living in It:
- SQL was the talk of the town, and many of the companies presenting made sure to emphasize how easily one could adapt to their product (and how easily it integrates with others in the ecosystem) because of the presence of SQL.
- At one point, a speaker mentioned “in the future, we may see a world where business leaders and analysts are expected to know SQL so that they can be self-sufficient,” and the audience began to clap and cheer.
- The Rise of Real Time
- The rise of real-time analytics is here (again…we know, we know…), but it’s not fully baked, and batch processes will continue to exist – mainly because of the technical difficulties that arise when combining streaming and historical data (if this is a problem for your organization, reach out to us!).
Solution Trends to Watch:
- Reactive Notebooks are trying to make notebooks even more adoptable by wider audiences. (e.g., Hex)
- Conversational analytics is growing, empowering more users to generate insights and reports by asking questions in English and getting data and visualization in return. (e.g., Unscramble and Whiz.ai).
- Data Quality solutions are here and are being integrated. (e.g. Great Expectations and SODA)
- Using Github as the place of commit and as a metastore makes sense (essentially as a no-SQL database). It’s what it does already. Bringing this into the UI is new and exciting. (e.g. Y42)
- Responsible AI is of concern to those who are incorporating AI into products. Tools are looking to incorporate ethics into the ML lifecycle, making each step measurable, but will likely also result in increased time to production. (e.g., Credo AI).
If you were unable to attend this year’s Data Council 2022 event, Data Council is kind enough to publish speaker decks (already live, requires a form fill) and video recordings of each talk (coming in the next two-four weeks) on their website.