What's New In Data

Joe Reis at Big Data LDN

Striim

Join us as we sit down with Joe Reis, live at Big Data LDN (London) 2024. Joe shares his partnership with DeepLearning.ai and AWS through his new course on Data Engineering. Joe's new course promises to elevate your data skills with hands-on exercises that marry foundational knowledge with cutting-edge practices. We dive into how this course complements his seminal book, "Fundamentals of Data Engineering," and why certification is valuable for those looking for foundational, hands-on knowledge to be a data practitioner. 

But that's not all; we also dissect the hurdles of adopting modern data architectures like data mesh in traditionally siloed companies. Using Conway's Law as a lens, Joe discuss why businesses struggle to transition from outdated infrastructures to decentralized systems and how cross-disciplinary skills—a concept inspired by mixed martial arts—are crucial in this endeavor as he cleverly calls it 'Mixed Model Arts'. 

Check out Joe's Work: 

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

John Kutay:

Hey everybody. We are live here at Big Data London, Not technically live, it is pre-recorded.

Joe Reis:

But we're here live.

John Kutay:

Yeah, we're here in person live, which is a rare thing these days. I'm here with Joe Reese, the guy in data. Joe, you've been doing so many cool things, glad to run into you here. You recently had Andrew Ng on your show. Yeah, and Andrew is one of the great thought leaders in machine learning.

Joe Reis:

He's on the Mount Rushmore as far as I'm concerned yeah, yeah, yeah, yeah.

John Kutay:

So that was an awesome episode. You're doing a lot of great work in educating the masses on AI and machine learning as well. Yeah, just catch us up on what you have going on.

Joe Reis:

Well, it's funny, today we're announcing the launch of the data engineering specialization on Coursera. So I've been working with the fine folks at Deep Learning AI and AWS for the last year on this Deep Learning AI's Andrew and his team. So, yeah, drops today. Pretty stoked about that, it's coming out today.

John Kutay:

Wow, can anyone sign up? Anyone can sign up.

Joe Reis:

You can sign up. Can I sign up, okay?

John Kutay:

Most importantly, I can sign up. You can sign up. Can I sign up? Okay, most importantly, I can sign up. Yeah, those who are watching this in person can probably sign up too. We can probably get you on the wait list. Go to Courseraorg, that's super exciting. So tell me about this specialization.

Joe Reis:

Well, so I was at Matt Housley and I wrote a book on data engineering a couple years ago and I really felt like that was a great intro to the fundamentals of data engineering. But it was a very technology agnostic book that didn't go into a lot of code or any examples. That was by design. I feel like a book really isn't the place to have code examples. Some people may disagree with me, but once something goes into print it's sort of out of date already. So you want something that's going to last.

Joe Reis:

We really felt like a course was like the real complement to the book, where you can give a lot of hands-on examples and, I think, dive deeper into material that you really couldn't get into in the course. So, um, so the course really, I think, is sort of the yin to the yang of the uh. The book covers the data engineering lifecycle and the undercurrents, I think, but in a way that gives you a lot of challenging hands-on exercises. So I mean, even when I took some of the exercises and did the final versions, it's like God, yep, that's legit.

John Kutay:

So we know anyone who goes through your specialization. Do they get a certification?

Joe Reis:

Yeah, you get a certification at the end by Coursera. Yeah, signed by myself and, I think, a few other people.

John Kutay:

Excellent. So we know that people that go through Joe Reese's data engineering specialization course are tried and true at implementing data pipelines that can feed real world machine learning and AI operations and analytics Analytics Still generally a lot of focus on analytics. Your book, the Fundamentals of Data Engineering big cornerstone Everyone, I would say it's in my definitely on top shelf of my book cabinet for data engineering your book, along with Martin Kleppman's book on design data intensive applications, and you mentioned yin and yang. I feel like those two are great yin and yang because your, your book is so practical. Right, you say all the things that data practitioners already need to know. People who've been on the job always read it and say like, okay, they kind of say the stuff that like is mostly unsaid but everyone knows like that you know from from learning on the job.

John Kutay:

So I always recommend it to people who are breaking in yeah and then I recommend martin clemmons book for the people who are breaking in. And then I recommend Martin Klemmens book for the people who are like. They do their jobs in analytics and engineering but they don't know too much of how it works under the covers. And now it sounds like with your specialization, your Coursera class, it's really for technical operators to level up and get certified.

Joe Reis:

Yeah, I mean you could start from start from zero. I mean the expectation is you might know a bit of uh python I think that's about it, not even that much, so it walks you step by step through everything. But yeah, it will take you to being at least like technically capable of doing stuff. I wouldn't say you'd, like, you know, kill it first day of your job, but at least you'll know what you're doing. Uh, you know. But I think that uh yeah, so it's, it's, it's yeah, yeah. I think it fulfills a lot of things.

Joe Reis:

And Martin's book's also awesome. I think that's sort of the. We originally wrote our book to sort of be the prequel to his book, where that was very much, I would say, throws you off the deep end into the innards of distributed systems, and I think that's a great book if you're working it. But what we also noticed is like not a lot of people actually do the low-level engineering that that book tells you about, and I think it's great to know how things work. But if you're not doing it day-to-day, especially if you're more junior in your career, I think it would need to be sort of an introduction to being able to think about how to reconcile, using those tools in production.

John Kutay:

So that's where we came in.

Joe Reis:

And then, obviously, you know, martin's working on a second edition of his book too, with Chris Rickerman. He's an awesome engineer, so I'm excited to see what they're doing. I hope they're doing it.

John Kutay:

Yeah, yeah, it sounds like he's updating his book to probably reflect more of the modern tools it is. He mentions things like React that were more like early 2010s, late 2000s open source tooling, but now there's been this flood of tooling that's come in and what I'm really excited about, especially for your Coursera class, is it's all coming together. I feel like the stack is sort of stabilizing. The tool chain is stabilizing.

Joe Reis:

Yeah, it is. It feels like that. I mean, if I look around it's a lot of the same vendors from last year, you know. So it feels like that. I mean, if I look around it's a lot of the same vendors from last year, you know, it feels like the vendor space is at least like stable. Yeah Right, I don't see anything here that's completely out of left field. There might be, I don't know. I mean, you have to walk around and see.

John Kutay:

No, I agree with you. Yeah, I did a quick walk around. That's how I felt it's like stabilizing. We're all sort of agreeing on what the layers of indirection are, right.

Joe Reis:

What do you think about that, though? Do you think that that's a good thing, or do you feel like that's when the industry is right for its next inflection point?

John Kutay:

No, it's a good thing, for sure. I mean, if you ask my opinion, you know, pre-covid there was your modern data stack. Modern modern data stack. Modern data stack sort of blew up during COVID. You used so much funding like billions of dollars of funding. All these unicorns came up. We all came back to conferences in 2022 being like what the heck is all this stuff right? And now it's sort of matured, stabilized. I have a good sense of what you should use now if you're trying to build a data or AI stack.

Joe Reis:

What do you think what would be John Coutet's stack of choice right now?

John Kutay:

Yeah, yeah, you've got to have your cloud vendor that really drives things. I think that's sort of going to be your gravity, where you're going to decide as your core, what tools am I going to use that are cloud vendor native, and then what things are especially value add. So let's just pick AWS, for instance, I'm going to use that are cloud vendor native, and then what things are like especially value add so let's just pick AWS. For instance, I'm going to obviously, cold storage is going to be on S3. You know, I'm going to use an AWS managed database. Now Oracle is on every cloud, so I don't have to just because I'm on AWS. You know sacrifice having like a super performing, you know scalable database, and then Oracle has its own cloud now. So that's the other thing. The companies that were deemed legacy have actually modernized all of their own awesome cloud versions.

John Kutay:

And my stack of choices okay, start there, whether you're Oracle Cloud, aws, google Cloud, whatever, and then you're going to have your adjacent tools. So you're going to have your ingest products like Stream or whatever. You know whatever. I'm not going to name names, but you could probably guess data modeling, right, you're going to have your DBT or you're going to have your specialized data modeling tool, then your analytics tool, which is also evolving quite a bit. So now I'm going to turn that question back to you.

Joe Reis:

Well, it's my stack of choice. I think the stack that works for you, right? So I think, like the only difference I would say is like I think nowadays I'm also seeing people sort of repatriate from the cloud and that's often onto their laptop oh my gosh. Yeah, good point. You're starting to see that because I think, you know, duckdb is getting pretty awesome, so people, I think, are maybe doing workloads there. So it kind of reminds me of what Notebooks used to be back in the day, where everyone did local development on Notebooks, and I think you're starting to see that now with DuckDB, polar's another big one. I think that's an interesting one to look out for. But yeah, I mean in the cloud it's an interesting one.

Joe Reis:

You know you could call it still the modern data stack, but I think we kind of move beyond that terminology just to the analytics stack. Yeah, but now everybody's also an AI vendor. So now you know that's one of the things I'll be talking about is sort of like the lines between you know, analytics, ai and even applications are blurring a lot, yeah, and so I would say pick what works for you, because I think the notion of where we're moving is changing as well, where it's like the tooling is mature, but now you know what you need to plug and play for certain use cases. What I'm looking at next is sort of what happens when these things start fusing into data-driven products, ai-driven products, and it's not just about dashboards anymore or whatever. It's about how do I power real applications to provide analytics to users? That again provides a feedback loop to the app, and same with machine learning and AI, right? So I think that's the thing I'm most excited about is that.

John Kutay:

And you cover all those core components really well in your book, the Fundamentals of Data Engineering. Right now, the other big thing yeah, you mentioned people are moving workloads into their laptops, which is a great point. I'm actually going to be at Small Data SF run by the Motherduck people next week and that idea is like okay, your laptop can store a terabyte of data and has hyper threading and multi-core processing and you can do so many ad hoc workloads on your own machine without running up cloud compute builds, so that's also going to be super interesting. But how do you still centralize and have governance around it, not have copies of data and things like that?

Joe Reis:

This is the big crux, I think, when I figure out what's the governance of the data sets that are created. That's why I'm talking about data modeling, because I feel like the practitioners can't understand how to use these various forms of data. You know we moved beyond tables, right, semi-structure is even an old story, but now it's about text, even images, video, audio. And how do you blend all these data sources together Even Mike Ferguson was talking about it during his keynote where it's like you're going to be doing analytics, combining that structured, unstructured data sets. I mean, it's been happening at some companies for a while, but this is sort of what I'm excited about. But then you've got to know how to handle all these different data sets. Right, like if you think that everything needs to be in a table and you approach everything from a relational modeling standpoint and you or even nested data, right, semi-structured data, that's a first-class citizen in almost every database now. But then, okay, that would definitely violate the third normal form, relational model, because you can't have nested data. So how do you reconcile all these concepts together and create data sets that make sense, not just for you but for other users, because if you're working in a decentralized world, not just for you but for other users, because if you're working in a decentralized world, the data ideally has to have some useful form and shape, right. So it's been on my mind a lot like how do you make this work?

Joe Reis:

Because I think decentralization is sort of the. I think that's the goal of. I mean, even now, right Zermatt talked here two years ago and they still hear conversations of data mesh, data fabric and everything else. That conversation isn't going away. I think that's an ideal that everyone wants to get to. But ironically, the centralization of standards and the federated computational governance is sort of how you're going to get there. So it's almost a paradox in a way. Right, but in a centralization of practices, or at least an understanding of cross-team practices to make it work. So it is interesting.

John Kutay:

That part. Every organization has to figure out what decentralization means to them. Yeah, We'll talk tomorrow with National Grid on how they've decentralized their analytics with data products, and there's always this sort of messy middle where you can't just read a book and say this is how we're going to do decentralization. I think it's like trying to fit a round peg into a square hole. So what are your recommendations for teams that are trying to go down this journey of decentralization?

Joe Reis:

I think you hit it, though pretty clearly. You've got to understand what does that mean to you, because inevitably you're going to run against Conway's Law, especially when you're in your own organizations. The Conway's Law describes that you'll design systems according to the way that your company communicates. So if you're very siloed, the systems you build are going to be very siloed, and so that's one of the cruxes is really understanding you know in your organization what's a tolerable level of decentralization or is it tolerable at all? That's the other thing.

Joe Reis:

Maybe you don't need to, because it's physically impossible anyway. So you may as well not lie to yourself, which I think a lot of companies do, because they're like oh, we got to do data mesh. I'm like, yeah, there's no chance in hell that'll ever happen at your company. Yeah, like everything, I think it's very rigidly hierarchical and siloed, and like it wouldn't work. Yeah, like the very org structure is like the epitome of centralization. Yeah, so it's the inviolable rule you know I talk about. There's sort of a corollary that I jokingly called Reese's law, which is you'll design your data models according to Conway's law and the architecture that supports the company and so it's interesting.

John Kutay:

Okay, yeah, I think, joe's law, that's certainly something that everyone needs to coin at this point. And I agree exactly the way you do data modeling, even the way you name your tables, is like a reflection of like you know how your company operates, like are you really truly meant for scale where business users can go click a button and, you know, get some insight, or are you always going to go need to ask the data engineer to kind of decipher and decode and preach, you know, prepare data for you?

Joe Reis:

Oh, yeah, I mean, I've seen it in some companies where they One company I remember they were using still a 1980s era mainframe and they're like, oh, we're going to decentralize it. No, yeah, not until that thing's gone. But you had to design all your architecture according to what that database, how it was designed back in the day. You had seven or eight character limit columns and it was just, it's pretty awesome. So, yeah, but that's the reality of a lot of businesses. Right, you have like a lot of infrastructure that needs to be revamped and that's like, okay, you're just going to gut all that today and go move to something else. No, like, that's not how that works. Yeah, just the reality of it. So I guess, to answer your, question.

John Kutay:

It's like what does that mean to you? And you got to look at the. You got to look at the cold hard facts of like, oh, what can we support? Yeah, right, so so, as of the time of this recording, today, you have a keynote here at big data london. What are you going to talk about?

Joe Reis:

I'm going to talk about, uh, mixed model arts. So I'm a big mma fan, uh, mixed martial arts fan. Have been for a long time. The notion is really that you know, we're still the data world is. In a lot of ways, I think we're moving forward. I think ai was sort of the kick in the pants that everyone needed to sort of like move on, but the discussions have been um, you're much napoleon dynamite. Yeah, remember that character, uncle rico, the guy who was like living in like the past when he like almost won state for football in high school, so you could throw a ball over the mountain, over the mountain, right.

Joe Reis:

That's what I feel like a lot of the data industry is. I feel like we're uncle rico, where we're just like reminiscing about the past, so, and we're just like stuck in the 80s and the 90s. And data modeling. This is true. When you mention data modeling, people still talk about relational. It's like, oh, you mean Kimble or Data Vault, right, like we're still stuck in this tabular world. Yeah, nothing wrong with that, it works for what it works for. But the thing is applications. No, sql is around 20 plus years already, right, and streaming is another thing. Are any of those tables sort of, sort of not? There's a blend of data. Now we're introducing text, all this other stuff, into it.

Joe Reis:

The notion of make small and large is about adopting what works across disciplines Machine learning, analytics and apps Because what's happening is there's a convergence happening of all these disciplines as we speak. When you open up Uber, right, I think at one point, when it started, that was a Rails app actually, but now it's like you know, if you look at the number of Kafka events it brings in a number of other things, it's insane how much data this thing ingests. It's a data-driven app that also uses machine learning. All these things work together seamlessly, but this is the direction products are going. Yeah, you know. So it's a recognition that there's more than one way to think about and model data. But you have to know all these things and increasingly, especially when we live in an era of constraints, budget cuts, teams are more as asked of people. Whether you're a software engineer, whether you're a data analyst, scientist, you're going to have to become, I think, full stack with your data modeling skills.

John Kutay:

So that's the notion of the talk.

Joe Reis:

That's incredible and does this tie seamlessly into your certification? Somewhat so. We do talk about data modeling and we do talk about all these use cases across different. So we talk about analytics, machine learning, even working with application data in the course, for sure.

John Kutay:

Yeah.

Joe Reis:

So I feel like this is one of those things where, like, you just need to understand all the different ways of handling data. It's sort of you know, mixed martial arts, right, no-transcript, cross-disciplinary sport, and this is the same way I feel about data, where, like, we need to catch up to where reality is instead of dwelling on the past and thinking like there's you know, I think it's saying fixated on the old approaches I always quote you specifically on this when you talk about, okay, what direction is data going in?

John Kutay:

and look at software engineering. Right, because the more I find like really good productive data engineers and like people who work with data stacks have the skills of a software engineer. And if you hire a software engineer, they can probably learn that pretty quickly. Right, they can probably take your certification, read your book and just hit the ground running and you know, build these scalable, uh, fault tolerant data pipelines, whereas if you hire someone who's too specialized, just in, like you know analytics, like their only knowledge is SQL, right, they're going to they have to work in a very narrow scope, which ultimately makes it hard for them to be successful. So I always mentioned this to folks like data engineering is software engineering, but it's specialized software engineering.

John Kutay:

I think your point about mixed model arts where, yeah, you'd have to have this multidisciplinary skill set and ultimately that's what being a software engineer is you can't be a software engineer who just says, oh no, I only use this one part of one language and I don't deploy it, I just write the code. No, no one thinks like that. Right, I only write structs. Yeah, I only write structs, just for loops.

Joe Reis:

No, but I always felt like you know, if you're taking the mixed martial arts equivalent of software engineers, I always felt like software engineers are the wrestlers. Or if you were to pick like one discipline, if you were to do nothing else and you were to go into martial arts, like wrestling, you can dictate where it goes, you can keep it standing up, you can go into the ground and I feel like the software engineers, they just you have the chops to sort of dictate where things go, because you have the technical ability to deploy things in production. That's unlike analysts that just make dashboards. Nothing against that, but it's just a different skill set. But I feel like, again, with the way everything's going, it's like these, a lot of these skills. What I notice is that my software engineering friends I have a ton of them, but a lot of them are interested in analytics and machine learning. It's like I need to start bringing this into my stack now.

John Kutay:

Yeah.

Joe Reis:

Right, especially ML and AI. It's like this is the stuff that you weren't talking about. But now everybody A lot of software engineers are conversant in vector databases, for example. Right, two years ago that wasn't even a conversation piece, right, right, two years ago, that wasn't even a conversation piece, right. So now they're expected to bring in all these different workloads, like I need to bring in, you know, make an OpenAI plug-in call or something you know. But it's like, yeah, different world.

John Kutay:

One thing I am really impressed with is just how ergonomic the cloud providers have made. You know, using LLM, and you know Ergonomic the cloud providers have made using LLM and just integrating into your same cron jobs that are doing data processing and data modeling and transformations. I can go into Google Vertex AI's model garden and go pick Anthropix models or Gemini AWS has Bedrock and it's super integrated. So I just did a talk here right before this with Cramp, which is an amazing company. They supply a lot of the parts for agriculture companies in Europe. They operate at tremendous scale and they've already adopted AI machine learning. Not because they were just super gung-ho on doing AI, but they were like this is the next practical step to make the experience better for our customers.

Joe Reis:

Bingo and that's just it. It's about making experiences better, and so that's what a lot of these products do Better chat interfaces, for example. It's like that's a low-hanging fruit, right, okay, but then what does that depend on? Well, it depends on, probably, training on some text data that you have. Right, you got to know how that works. So, yeah, it's an exciting time.

Joe Reis:

I feel the inflection point that we're in right now is it's like I think the stack stabilized, but then, you know, now it's about adopting, you know, new approaches to solve new types of problems, not just shoehorning the existing stuff to solve old problems. I think we've done a good job at that. We've solved a lot of those problems, at least from a technical standpoint. I still think the people process technology arm is always in the picture and that there are always challenges. But I think, from a technical aspect, when I look at the vendors here, it's like it's hard to find like a tool that you would say like that's a really bad tool. I mean there's a lot of great tools these days. I mean the bar is very high, competition's intense. So now it's time to, now that you've solved a lot, I'd say, I'd say the pretty standard problems. Now that's that that pie widens right, and now you can solve more problems.

John Kutay:

Yeah, absolutely, joe. You're the man in data. That's, that's the best way to describe you. You're also author of the fundamentals of data engineering. We'll have a link to that book down in the show description. Also, his new Coursera class. Joe, have a great time at your keynote today. I'm definitely going to be there. I'll be your biggest fan in the back Maybe heckling, I mean, depending on what you say. Joe, great to see you. Thanks everyone for tuning in you.