What's New In Data

Crafting the Blueprint for Decentralized Data Systems with Hubert Dulay

Striim

Unlock the secrets to transforming your data engineering strategies with Hubert Dulay, the mastermind behind "Streaming Data Mesh," in a riveting exploration of how the field is evolving beyond monolithic systems. As we converse with this data integration expert, you'll gain unparalleled insights into the decentralization wave sweeping across data management. Hubert unveils the power of domain-centric stewardship, where domain engineers are empowered with SQL to revolutionize analytics. Meanwhile, he advocates for a paradigm shift—one that involves harnessing data right from its origin, ensuring a seamless journey to analytics that aligns perfectly with Data Mesh's core tenets.

Venture into the future with John and Hubert as they dissect the burgeoning appeal of Postgres and its ascension in the database echelon. Discover how the fusion of operational and analytical databases is redefining the industry, with a spotlight on how companies like American Airlines leverage real-time data for critical decisions in aircraft maintenance. As Hubert and John navigate the intricacies of implementing a data mesh without the pitfalls of data duplication or spiraling costs, you'll be equipped with knowledge on crafting effective data management strategies. This is an episode for those who recognize that mastering operational analytics is not just a competitive edge but an essential cornerstone for every forward-thinking enterprise.

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

Hello, everyone. Thank you for tuning into today's episode of What's New in Data. Really excited about our guest today we have Hubert Dulay. Hubert, how are you doing today? I'm very good. Thanks for having me, John. Yeah, absolutely. We've been talking about doing this episode for a while now. I'm a big fan of your book, Streaming Data Mesh. And you gave me that special signed copy over at the Apache Kafka conference in San Jose last year, which I was honored to receive that. So we can dive into that book today, but first Hubert, tell the listeners a bit about yourself. Yeah. Started off as an engineer, heads down programming, mostly data and integration over 20 years ago, it was called integration now it's called pipelines and so on. But yeah, over 20 years of engineering, most of it heads down, but the latter half more in like big data and streaming data found myself at Cloudera and then at Confluent. And one company called Decodable and now a StarTree, which is a real time OLAP database. But yeah I've written one book so far, which is, as you said, Streaming Data Mesh. I'm halfway through another book called Streaming Databases, which I'm co authoring with Ralph DeBussman from Germany. It's really great you're able to spend some time putting your thoughts into literature that all of us can read and learn from 'cause you have a lot of great experience in the industry ranging from the types of companies you just mentioned. I'd love to hear the story behind, before we get into all the concepts in your book, streaming data mesh, I love to learn the story behind writing the book Streaming Data Mesh. Yeah, it was all an accident, really, when I first read the blog HMX blog on data mesh. I was inspired by it because it resonated with me. Part of my career was a heads down data engineer, that's really, I call it a monolithic role because everybody's sending their data at you and you somehow have to secure it, process it govern it, and then provide it to the consumers who expect some certain SLA and you're just like, what? You know? And when I read this blog and it was a decentralized approach to data management, I thought, that's decentralizing the role of a data engineer, which I'm completely fine with. I think it's it's a role that's has too many responsibilities. if you think of what are the requirements of a data engineer nowadays, it's a huge spectrum of tools and technologies and programming language that you need to know. So bringing some of those responsibilities back to the domains that produce the data made a lot of sense to me as far as like, ownership of the data, they know how to secure it, they know how to transform it. Why not give them the responsibility or at least the tools to help them publish it? So when I read that blog, I thought, this makes a lot of sense. And at the time I was working for Confluent and I was engaged with a prospect asking about data mesh, not even like a month later after I read that blog. I knew for a fact that, some of the other SCs in the company and Confluent probably weren't read up on what a data mesh is. So I took it upon myself to implement some, like a reverse go to market, right? I trained my fellow colleagues, as to what data mesh is and how Confluent can contribute to it then the field and then that got me into people that were marketing in Confluent and then eventually to the CMO who I did my own pitch to what a data mesh is and how Confluent supports it and a couple of months later, Jay Kreps actually coined a lot of my quotes . So, it inspired me to get into kind of the space and to write a book that is related to to Data Mesh specifically in streaming. Yeah. Like you mentioned, there's, conceptual overlap from data mesh into kind of technology overlap in data streaming, the way you can sort of decouple and decentralize your architecture. And then of course, map that into the business where you can kind of decentralize the business domains of data consumption. It would be great if you could kind of walk through how streaming and data mesh are directly related. I thought a lot about what the experience would be from a domain engineer, right? Like an application engineer. So she would be like a coder, maybe coding and go or something, but it was just, it was responsible for sending events or some data to the analytical plane so that we could build analytics from it. What would that experience be? She probably wouldn't know Python or wouldn't know Java or Scala. The typical languages needed to be a data engineer. But there's a common language that everybody uses for data is called SQL and typically, application engineers will know at least a little bit of SQL to be able to, render, or at least look up data that's in a database. So, what that got me to think is that how can we simplify because it just goes back to the self services in what a data mesh is, how can we simplify that experience for the domain engineer so that he or she can produce data and not only produce data, but productize it so that it can be consumed by many. So how is streaming related to that? A lot of sending data, sharing data really has a lot to do with replicating data across regions from your operational to your analytical plans. You're replicating data and by doing so you're already by default streaming, a lot of your sources are going to be from streams and if they're not somehow it got to that site, and it does data doesn't come out come to you in in batches. It actually comes at you in real time in bits and pieces that you have to accumulate, so why not tap at the very beginning? So if the experience for the developer in the domain is to be able to use data and produce data, why worry? Why have them also batch it up and put it into, files and bulk loaded somewhere? Why not just tap it when the source happens and then let the consumers of that data accumulate and aggregate and report off of that data. So that's really the experience. Why not just take it from the beginning, don't give them the responsibility of having to accumulate that data themselves to be shared before. Absolutely, and now we're seeing a lot of data teams become either adjacent to AI teams or, absorbing the AI responsibilities and organizations, how does streaming data mesh support teams are trying to get up and running with A.I. Initiatives? Yeah, so I just had this conversation. I think it was Kai, but I've had a fruit with a few other people. Was that, is that Kai? Waehner from Confluent. He brought up the rag pattern which is a retrieval, augmented generation which is a pattern to help you get real time data, right? To have these models, the AI models produce more recent answers versus kind of stale answers. So this rag pattern provides a really nice pattern so that you can source new data, new corpus of data, and then create the embeddings that that you need to put in your vector database and then have these large language models use that vector database to supplement and add more to the context of his answer. So you get these large language models, you have a stream of data that's converting your unstructured data into embeddings, you're putting that in some kind of vector store. Then, you ask a question of your LLM and then the LLM incorporates that latest information to the answer. So it's really a nice pattern, I haven't seen it live I'd love to be able to implement it someday, playing, right? I'm also sort of wading into these RAG architectures and, it seems very transformational technology, but at the same time, it's a bit incremental in terms of the actual technologies you use today with, let's say Postgres adding vector extensions or MongoDB adding vector extensions, et cetera. A lot of databases are going to have their own flavors of this and it really gives your product an opportunity to highlight its strengths. For instance, Mongo's object model is a strength for for that elastic search and open search. The open source version of Elasticsearch are highlighting their unstructured storage layer through that, of course, Postgres with its wide adoption now. For data teams, coming back to the types of patterns that you're talking about is very foundational. Having a streaming data mesh under the hood that, helps with the governance and control around domains, because you don't want your AI to hallucinate and you don't want to give it incorrect answers that does really require a good amount of preprocessing of the data and enforcing the data quality. And with AI, it's not like the more data you throw at it, it's going to do better, for all the data you are throwing at it, there's there's some fine tuning work that needs to be done to make sure it's really resolving the types of problems that the users want to solve in their chat experience, the intents and the entities and things along those lines. So I think for that, streaming data mesh would be a powerful kind of foundational concept to apply. Yeah. Yeah, absolutely. And I also want to ask you ; how does data mesh compare to data fabric? Right. This is, this is a very popular question. And at the time of writing the book, Streaming Data Mesh it was still a bit fuzzy as to the difference. What I was able to kind of discern was the data fabric is a lot of technical requirements that enable sharing of data. While the data mesh is like a socio relationship between domains that share, share data. So there's multiple layers there. What I wrote in the book is that data mesh actually encompasses what a data fabric is. So you can't do a data mesh without a data fabrics. Data fabric kind of gives you the technical requirements while the wrapping data mesh provides the experience and the accessibility, and the governance that are required to do that ecosystem of sharing data. And it's very interesting, I see the terms used interchangeably from enterprise teams, but they're so different, right? When you think about it data fabric is very technology oriented whereas data mesh is almost more ergonomic. It's a way of structuring your organization to, act on and organize data. Now Microsoft naming their new core analytics products data fabric is going to add some additional fun to industry terminology. Always great to hear about these concepts. Jumping to another topic, what are the technologies you're most excited about in our industry? I'd say that there's a lot of buzz around Postgres. There's a lot of startups coming out of the woodwork that are really making a lot of noise around Postgres. I think database insights or I can't remember the name of the website, but they named Postgres as the number one database. I don't know if I'm excited, I'm a little excited about what was happening with Postgres, but I think I'm a bit more excited about where real time analytics is going cause there's a spectrum of analytical use cases where you need immediate, like low latency, real time data versus an hour, maybe real time data. Is that still considered real time? I don't know. And there is a spectrum of systems that really provide solutions to these problems, a lot of them are postgres. So typically the rule is you don't report out of an OLTP, you report out of an OLAP database, but then you're starting to see like data, like postgres databases Or htap databases that have these columner olap capabilities. So analytics are starting to get really close to the edge. I think about like Duck DB and then click house came out with an embeddable click house called CHDB. I see analytics being stretched all the way back to the operational plan. So when you think about like the operational analytical plan as Alak kind of talked about, there's, there's a divide there, right? That division is starting to get more and more blurry, right? In fact, I see analytics stretching a lot further than it has been before. Because everybody wants to know analytics. They want to see it in their apps, they want to see it in everything, right? And a lot of these may end up to be embeddable, analytical solutions, or depending on how much data you need to be able to analyze in those applications. I think Edge analytics or operational analytics is going to be something really interesting and not sure how those solutions are going to be put together just yet, right? It's definitely exciting, I'm seeing a lot of that in the industry as well. What are some examples that you've seen of analytics sort of going back to the operational plane? And then, why does real time make an impact there? I wrote this blog, maybe a few months ago called the Streaming Plane, basically a lot of the streaming technologies and even the analytical technologies and workloads they're very hard to implement in the operational plane because you're not going to have your historical data there. A lot of times you're not going to have the streaming processes there to be able to take it out and transform input into something that can be that can serve analytical queries. It goes back to the data mesh approach, right? When you start to take data away from the analytical plane and start presenting it at the operational plane, there's a decentralization again, of analytics, right? You're bringing analytics back, not just data. You're not just giving data ownership back to the domains, you're also giving them some of the responsibilities for their own analytics, right? And if they need historical data, you need the streaming plane to actually provide the tools that the operational plane doesn't have, like CDC, for instance, you know transformations in real time, aggregating multiple domains together, the data products from multiple domains and presenting a real time view in that particular domain, it gets really fuzzy and it creates a cloud of data between each domains and the analytical plane and the neatest place to organize it is back in the operational plane where we don't have the tools yet to manipulate and present that data. But I think we're still figuring out how that's all going to work, but again, it goes back to decentralization. When you have an operational workload, someone's going to act on the data, right? And that's when latency becomes more critical cause if you have some stale data, that's feeding action, it's very likely to be the wrong action or just delayed. I think about presentation I did with the American Airlines at Data and AI summit, and they're using data analytics, our product Striim and DataBricks to basically feed these operational workloads of servicing aircrafts. And for them, latency is absolutely a critical thing to look at because it's time is money when those aircrafts are on the ground and are waiting for maintenance. So, that needs to be something that they take action on as soon as possible, they can't have some batch job that's running for 12 hours, aggregate all the data and then tell the crews Hey, we have had this aircraft sitting on the tarmac or in the maintenance center for the last day go service it. Now, these are the types of real time operations that I think about where the latency of the data is critical. And of course the way you prepare the data and decentralize it for access within a company because multiple consumers might need the same data for their operational use case. Maybe they need different materializations or schemas or views of the data, but ultimately it's the same underlying data. You don't want to duplicate it, you don't want to drive up costs in your organization by saying, if I have this data set, I'm going to duplicate it n number of times for every n number of business groups that are using it. So there are smart ways, your book gets into this streaming data mesh. It's basically broadcast the data to multiple business consumers while keeping storage compute costs to a minimum. As data is becoming more operational, it's going to be very critical to think about these things. Teams were sort of able to get away with it in the modern data stack world," yeah, we'll, we'll load the data into the warehouse, what we do with it and when it happens, I don't know." So, I think now as we're going into this next decade with AI and operational analytical use cases that's going to be very much the norm, right? Cause when you look at how enterprises operate their data is kind of the the, the core piece of how they're operationalizing a lot of their workloads in a smart way, and that data is sitting in a data lake or data warehouse and they're trying to find ways to operationalize that. And that's where latency governance, all this stuff is more critical than ever. I think your book is very timely and of course, it's going to be fun to look out for your new book on streaming databases. That being said, Hubert what's your future vision for the data industry? I've been writing this book called streaming databases, if you just look at that term there are two terms that go at odds with one another, like real time and batch, right? Databases are typically associated with batch while streaming is obviously real time. But streaming databases is a conversion, I feel, of real time streams and with data at rest, and I think there's going to be more of that in the near future, not only just because I think it's good to think of it in six terms, but I think it increases the adoption of real time and streams if you provide a database experience on your real time streams, right? And being able to see these materialized views and these, these tabular views of data that could be from anywhere in the globe that you're consuming locally, I think really advantageous, it promotes decentralization and it by default gives real time. I think that this is a really hard problem to solve, but it feels really easy. The vision is there and it's really, it's a nice vision. I'm not sure that it's really that as easy, to implement, but I think the technology is gonna, help get us there in the future. Definitely. Hubert, where can people follow along with your work? I have my a Substack account, you could follow all the content there's free for three months and then you can pay $5 a month if you want. It's a hubertdulay. substack. com. You could also find me at a couple of conferences coming up, the first one will be Data Texas, I believe it's called in mid January. So I'll see you there. Yes. Yes, absolutely. And those links will be down in the description for those of you who are listening. Hubert Dulay, thank you so much for joining today's episode of What's New in Data, and thank you to everyone who tuned in!