What's New In Data

The Secret to Becoming a Great Data Engineer with Zach Wilson (DataExpert.io, Facebook, Netflix)

Striim

Zach Wilson, an industry virtuoso with experience as a data engineering leader at Facebook, Netflix, and Airbnb, pulls back the curtain on his journey through the world of data in our latest episode. With tales of his ascent from the ranks of Think Big Analytics to pioneering educational practices with DataEngineer.IO, Zach's narrative is a treasure trove for aspiring tech professionals. He not only demystifies the progression from data engineering to software engineering but also shares the trials of career elevation—all served with a hearty side of SQL and backend development insights.

The conversation then shifts gears to the buzz around AI's role in data engineering. While LLMs like ChatGPT are adept at churning out SQL queries, Zach asserts that they haven't usurped the throne from human engineers just yet. Distilling the essence of stakeholder communication and conceptual data modeling, he reminds us that the human element is the linchpin in a landscape increasingly guided by algorithms. It’s an eye-opening exploration of how AI might be the trusty sidekick, but data engineers—as the heroes of their own stories—still save the day with their indispensable human touch.

Wrapping up, Zach takes us on a tour of the latest innovations shaping the data engineering domain. From analytical patterns to the significance of community within the data sphere, his enthusiasm for the field is infectious. The episode underscores the vital role of collective wisdom and personal experience in navigating the toolkits and methodologies of data engineering. So, buckle up for a ride with a mentor whose insights illuminate the path forward in the ever-evolving tech landscape.

Follow Zach Wilson for his insights and educational material on Data Engineering

Zach Wilson's Data Expert Academy - DataExpert.io

Zach Wilson on Twitter 

Zach Wilson on LinkedIn

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

I'm You Hi everyone. Thank you for tuning in to today's special episode of What's New in Data. I'm here with the Zach Wilson. Zach, how are you doing today? Doing great. Doing great. Excited to be here. Yeah, for sure. We're up here pretty high in San Francisco. We got the bay behind us. It was your suggestion to get that as the backdrop. So, yeah. Appreciate your support. Your creativity and insights as always, Zach. Yeah. Yeah, so for those of you I'm sure you all follow Zach already, as hundreds of thousands of followers that, go to Zach for his insights on data engineering. Zach, you've worked at Facebook, you've worked at Netflix, you've worked at, Airbnb, you name it, and you've been in the trenches building the data engineering pipelines, and you're so generous. Sharing that knowledge with the data community, but tell the listeners a bit about yourself. Yeah, for sure. So, like, I I actually started my data engineering career back in 2015. I worked at this company called Think Big Analytics. That's where I, I, I got into, like, Hadoop. And, like, back then I was like, Hadoop was the craze. Everyone was like, You gotta know Hadoop, you gotta know MapReduce, you gotta know all these things. This is like, this is what is the future. And then I like, learned a lot of that. And then that's like, those skills were actually what were able to get me in to get my foot in the door at Facebook. Even though I only worked at, I worked at ThinkBig Analytics like six months. That's it. And like, but like, because I didn't really go to a fancy school or anything like that. I went to Weber State University. It's like a tier three school in Utah. So it's like not like that fancy. When I got in at Facebook, I was the only. The only Weber State alumni who worked at Facebook, even though they had 40, 000 employees, like, it was just me. And I was like, wow, that's crazy. That's a really crazy thing to figure out. But, like, then after that, I, like, got in, and I, like, started, like, really finding a groove with it, right? And, like, it was a good time at Facebook. I really, really liked it there. And one of the things I learned at Facebook, though, was just, like, Facebook is, like, a very, like, They're, the way they do data engineering is very, like, SQL oriented, very, like, like, analytics and SQL oriented, and I wanted to be more of a builder, because my archetype of data engineering is more like data engineer, software engineer, I kind of do both, like, so, like, when I'm doing, like, just SQL and Python, after a while, I get really bored, because I want to, like, build other things, and that's why I left, I left Facebook for that, to go to Netflix, because Netflix is, like, a free for all, they let you do Whatever you want there. It's like freedom and responsibility is there like number one cultural tenant there, and it was great. They're learned a lot there. I shipped code in like 11 languages at Netflix. It was so like crazy. And I learned so much there and actually transitioned from a data engineer to a software engineer at Netflix, because I wanted to do more like backend development and systems and stuff like that, but like. When I was there, when I was doing that transition, I realized that, like, I, I didn't, I wanted to do that transition, but I also didn't want to, like, stall out my career in terms of, like, leveling and stuff like that, so, like, I essentially got a promotion to staff engineer at Netflix. And changed from data engineer to software engineer. And that was a lot. That was a lot. I realized when I was doing that, I was like, this is a lot. Like, doing the getting promoted and changing the track, I don't recommend it. You should do one then the other, not at the same time. Like, that was a big thing that happened for me, and that, like, increased responsibility. I feel like was the thing that really ultimately got me kind of burnt out at that company and then I quit and I was just like, I need to take a break. I can't do this anymore. And now, Zach, you're the founder of this awesome program called dataengineer. io. It's really a product when you think about it. So I personally signed up for one of the classes. I'm going to talk about my experience there a bit. But tell the listeners a bit about DataEngineer. io. Yeah, for sure. So DataEngineer. io is a way to learn, like, the, not the, like, the basics of data engineering, but kind of the experience that you, like, you get the learnings that you, that you get from experience. So, I bake in it's a six week program where I bake in all of the hard earned knowledge that I got from experience, like whether that, whether that be around like data modeling or real time data or spark or data quality or communication. We have all sorts of different weeks where we cover things and yeah, like you can take it two ways. You can either take it live in, in like over a six week block or you can take it self-paced if you want to like, not do it in such an intense way because it is an intense program if you take it live. Yeah, absolutely. And that was one of my favorite things about the course. You know, it's not your elementary, like, hey, what is SQL? What is a database? You know, all these things that are, you know, extremely fundamental, but now you can really learn it anywhere. Your course is so cool in that, It's, like you said, it's that, it's from that hard earned knowledge that you just have to be on the job to know that you'll come across those problems. And you just have people kind of dive into those problems, right? Like, do the dimensional modeling with the real Postgres instance that you set up on your laptop. Like you know, set up Flink, set up Spark, like, build those initial pipelines and really get people's feet wet. So, when they do go interview for a job, they actually have those skills. skills already. So it's one of my favorite things about your course. Yeah. Awesome. Thank you. Yeah. Like I've been it's been awesome. Like I've been building it out, like, especially like in the iterations. So we're currently in V3 the, from V2 to V3, one of the big things I'm shifting is actually taking the content and moving it to be more cloud oriented. So that like. We are now instead of using Postgres, we use Postgres in V2 like local Postgres, and now in V3 we're using Trino in the cloud, Trino and Iceberg in the cloud. So like, we're making it even more real world in that way where we are gonna be using cloud tooling like exclusively for this this, this version of the bootcamp. Oh, very cool. Very cool. And yeah, I think that that ultimately makes the process a bit faster for a lot of the students, right? You don't have to spend as much infrastructure on their laptops. Maybe their laptops aren't beefy enough to run Spark and it's great. That's actually like one of the things that I showed is that you can actually, especially like the first two weeks on data modeling, it's It's so accessible now that you can actually, like, learn data engineering on the phone. Like, you can do all the, you could do all the coursework on the phone if you wanted to do it on the phone. It's freaking crazy. And honestly, that's the future. I mean, I had Zach Henlon on the show and he has a, you know, mobile first BI product and we were just thinking, like, Yeah, all the, all the stuff you do in your life in terms of productivity is on your iPhone now. Why don't you do, you know, data on, on it as well? Get your insights just fed to you with push notifications and then learn about data engineering too. Yeah, for sure. And like, I mean, especially like for data viz stuff, I could definitely see like a product that's like kind of like Tableau, drag and drop and move things around, build a chart. Like all that seems like like the, the Tableau experience on mobile seems like, like that's a killer idea, you know? Yeah, yeah, for sure. And you know, what you're bringing to the table. You know, mobile education and data engineering is a super cool too. I saw your tweet about it with, with Trino and yeah, just making that, you know, that expert real knowledge of data engineering accessible, like on your phone, on the go and then through these live bootcamps as well just gives engineers a lot of data engineers who are breaking into space, a lot of options. I really one of the ways that I think about comparing is an existing Course called Reforge Brian Balfour, who used to run marketing at HubSpot started. And, you know, he has thought leaders there who have been in the trenches of doing growth and marketing and product for, you know, these big unicorn companies and publicly traded companies. And yeah, very similar. It's like, it's real boots on the ground stuff that you actually do on the job, the stuff you actually face. And I think that's the type of knowledge that's the most valuable right now. Cause definitely the elementary stuff's all out there now. You can learn anywhere. It's the stuff you're teaching. Is so critical to actually doing this stuff on the job. Oh yeah. Oh yeah. Because it's like all that elementary stuff is like. chatGPT and AI, chatGPT is going to teach all of that, and that's going to be the future. I actually have the, one of my visions for things is like, I actually have the domain ZachGPT, I actually own that, zachgpt. com, I bought that one, and like, I have this vision of like, taking all my like, LinkedIn posts and everything, and making like, my own LLM, where people can ask it questions, where it's like, And it's trained off of all the content that I've written. And so that you can learn, like, my opinion on things. But, like, it can be, like, in its own, like, remixed sort of way. Yeah. Yeah. Yeah. So, I, I mean, my stance on this is, you know, AI is going to redefine what it means to be human. What does it mean to be yourself? Because, you know, if, if you can index all your thoughts to date in an LLM and give people chat experiences that make it feel like, Hey, I'm learning from Zach. I mean, this all comes back to the, the individual and what it means to actually be human because AI is going to do something that's, that mimics a human really well. So you know, speaking of AI and, and data engineering specifically, you have a really great insight on how data engineering is being disrupted by LLMs. Yeah. Can you speak to that a bit? Yeah, for sure. Like, I mean, that's like a big one where like, There's all, I'm sure some of y'all have heard those articles of like the death of the data analysts from LLMs and like people are showing like, Hey, look, like you can ask a chat GPT to write a SQL query for you. And like, but. On all those examples, it's always like, give me the number of sales in New Zealand, or something like that, and it's like, three lines of SQL, and it's like, wow, congratulations, LLM, you can write three lines of SQL, but then it's like, if you actually try to ask it for something a little bit more complicated, it's not gonna do a good job. Like, It's still, like, for complex SQL, it's still not quite there. I mean, I want to try GPT 4 turbo. I want to definitely try, like, some of the new stuff and see, like, what it does. Because GPT 4 definitely has, like, a, there's a big increment from that versus chat GPT. I've noticed that, but there's still a bit for it to go there. I think that's one area where that's at risk. Like, like, Writing SQL, writing pipelines, the technical pieces of data engineering. I think those pieces are more at risk. But then there's this other area of data engineering that is more or less like safe. And that's going to be like talking with people, like discovering the needs. Because like, I remember one of my friends, Sundas, she has, she's a content creator on Instagram. And she was saying that data analytics jobs are safe from chat GPT. And there's only one reason why they're safe. And that's because stakeholders do not know how to actually say what they want. Right? Or, or, or, like, describe what they need. If stakeholders could actually, correctly, describe exactly what they need, then, you know, Yeah, sure, we can get there, but that's, that's, that's a difficult problem. Being able to describe exactly what you need comes with experience, analytics experience. And you have to know what, what is it possible. And like, you have to have like that analytical mindset already to be able to describe those things. And that's what data engineers and data scientists and all those people do, is they have that analytical mindset of like, they know what's possible. They know what's, you know, available to them. And those things are like, things that are like, are difficult for the more like business oriented people to do. So, that's where like. That's, and I don't see that changing because that's like a more of a human problem of like humans and skills and stuff like that and that like that's, that's definitely the piece where that's why I like in my class I talk a lot about like data modeling and like a lot of people they think of data modeling as like okay, like you define a schema in the cloud or a schema in Postgres or something like that and that's like, but that's like physical data modeling, that's like the actual data, but you actually have this higher level concept called conceptual data modeling which is Where it's kind of where you talk with the stakeholder about like What questions do you want answered? Like, what is the, what are the things that you want insight into? And that empathetic process of like going through and having that back and forth with the stakeholder so that you can build both, you can build a data model that works for today, for today's needs and tomorrow's needs. But probably not the next day, because it's very hard to get like a, a completely future proof data model because data models evolve over time. But if you can get today and tomorrow and then let the future like, understand that you will probably have to edit your data model in the future, but like, you can capture it so it has some runway, then that can be, that's, that's the, that's the skill that really separates, like, good data engineers from great data engineers, is that ability to like like the forward looking as well, and so, and also backward looking, so you can have a data model that like, you can apply backwards on your historical data, so you can have like a continuous model back from like whenever the data starts to, to a forward looking kind of future, and like, that process is one that I think is pretty safe from LLMs and chat GPT and all that stuff, because of the fact that It's mostly conversation based. It's very like human driven conversation based. So yeah. Yeah. Yeah. And absolutely. And some people try to reduce this too much where it's like, well, all that data will be in the vector database or the LLm. And anyone can go ask a question. And that's, That's where real data engineers know, like, no, that data's not always, like, in the warehouse or in the vector store, and, you know, there's, you have to build new pipelines and things like that, so when you're alluding to, like, that, that human conceptual problem, it was always going to be there, and I was I previously had Patrick Miller on the show, he ran enterprise AI at Google, and now he leads data and AI at New Front, and he was talking about Look, I have experience in AI, but I know that there needs to be a human in the loop. So, what are some of the ways that AI can empower data engineers? That's great. I think there's a bunch of different ways. Like, one way that, like, I, like, It's a, it's a thing that I hate about data engineering that, like, I have already used AI a bunch of times for is when you're building a pipeline, say you're building a pipeline in Spark, especially Spark pipelines, it's, it's the best for, is when There's two types of, of tests that you need to do. You need to do your unit tests, and you need to do your production data quality checks. Those are two places that you need to write tests. Otherwise, the quality of the pipeline is not going to be very good. So, one of the things about the unit test part of a Spark pipeline is that you have to generate, like, fake data. And, like, Oh man, like every time I'm generating the fake data, I'm just like, I hate this so much. I hate this so much. And so what you can do with ChatGPT, which is so amazing, is, you like, you can plug in, you say, okay, here's a schema with these columns, and then generate me 20 fake rows for this, this schema. And then, ChatGPT's pretty good. It gives you a pretty good output data set, right, that like you can use to then run. Plug into your pipeline and then it will that can be that fake data that you can use to like, have your input and expected output. Cause that like, every time like, I remember at Airbnb before ChatGPT, I was always like, I was like generating like CSV files, like by hand, like adding like all the IDs and the commas and making sure the null and I'm like, why am I like writing data right now? This is such a, it just felt like such a waste of time, right? And like, I mean, that's actually like. I, I had such a, like, pain with that process that, like, at at Netflix and Facebook when I worked there, I just didn't write those tests. I was like, they're not worth it. Not worth it, right? And, like, I mean, when I was at Airbnb, I changed my mind about that, where I was like, okay, these tests do catch errors. They actually, cause, they actually are amazing cause they actually cat, catch errors before they enter production, which is awesome. The best time to catch it because then, you know, you never even have bad data enter production that way. It's freaking really good, but like just writing those tests is so tedious and painful. And that's a great spot where Chat GPT does a really good job. I think that's a good spot. I also think that it does great if you know different analytical patterns. So like. For example, slowly changing dimension, cumulative table design retention curve, like a J curve, survivalship analysis. There's like, I don't know, maybe five or six different analytical patterns that like you can ask chatGPT and say like, Hey, give me, a skeleton DAG that that will generate this analytical pattern. And then you can just start with that in Airflow and you just get the skeleton. And that's it. Amazing, because then you don't have to think about it and like how the tasks all depend on each other because that analytical pattern is always the same. And so like like, and those are very powerful analytical patterns that you can also use chat GPT to generate that skeleton for you. And then you just have to plug in the SQL details. That's great. Yeah, all those little annoying things that data engineers don't have to deal with that are super redundant. It seems like chat GPDs and AI in general is a great candidate to automate that, to ultimately empower the person, make them more productive. Now, what's something at a principle level, not just because the technology is not there, but what's something that data engineers should not use AI for? Ooh, that's good. So, I think there's a couple things that data engineers should not use AI for. At least in my experience. This is, I don't know, maybe a controversial take, but like because other people, I would say, would disagree with me on this, is that like SQL generation is an interesting one, where like, I feel like, It's getting there, and I think that this is something that I might change my mind about. Maybe GPT 5 comes out or something like that, like in the future I might change my mind about this. But like, right now it's like, if you're asking, like for example, if you ask ChatGPT instead of like give me the skeleton of a slowly changing dimension pipeline, but you say, here's the schema of my table, these are the dimensions that are slowly changing, write all the SQL and the pipeline, like, that's like, It, it doesn't, it does a bad job. It doesn't, it doesn't do good. Like, I, I, there's always like weird mistakes. That like, you, it's, there's that meme, right? Where it says like if you use chatgpt, you spend 10 minutes coding and 24 hours debugging. Whereas in the past, it was like you spent 8 hours coding and 1 hour debugging. So it's like, you're spending 3 times more now. Because like the, the errors, it, it, it, once you're dealing with more than like 5 or 10 lines of code, it's like Chat GPT just like, is gonna make a mistake. It's gonna, there's gonna be a one or two like little small things that like are hard to correct. Where like, especially if you don't have that much experience, if you're like a new data engineer, you might not even know that it's wrong. Because like, the query will run. It's just the data's wrong. So like, and that's why I don't like using Chat GPT for SQL generation. Because it will give you a query, and it will have data, and then like, you will maybe feel like, hey, that query ran. It ran, it didn't, there was not a syntax there. But then like, When you look at the data, and the data might even look mostly right, but there's going to be like, a lot of times there's like these nuances around the edges that are like off, and that's where like, you, you don't really want to use it for SQL generation. Unless you're asking for a query that's like five lines, but then, I guess, my perspective as a SQL practitioner is, I can write those five lines of code just as quickly as I could write the prompt for chat GPT, right? So it's like, I'm like, I, I don't find it impressive that it can do a five line query. Alright, that's all. Yeah. In data, what does the community mean to you? Oh, that's great. Community is a very, very important part of data. Especially, like, now as I've, like, kind of branched away from, like, Big Tech and I'm in this new environment. Is, like, community is so important because, like, a couple of reasons. Like, I know when I post on LinkedIn, like, Sometimes I'll just post like a hot take, and I, cause like, I'll post a hot take literally just because I want to have people tell me I'm wrong. And like I want to, I want to see the other side, right? And I just want other people to like say like, no Zach, like it should be this way or that way. And like, get like, I really love the wisdom of crowds. Where like I remember like, like my dad told me this story one time where he was in like his, he's like a mechanical engineer. Where like, the The professor put a line on the board and asked everyone in the class to estimate the length of the line and they all put in their estimate and then when you average it all together, it's like within two or three percent, even though like they have no measuring tools, they're just looking at it and like all of it. But if you take all the data points together, you get pretty close. So you get like that kind of Crowdsourcing of knowledge and that is something that I think is really cool because we are all just on this journey And we all have blind spots even me I have blind spots in data engineering where like you know I like for example, I always like people are always talking about like dbt and stuff like that and like yeah I've literally never used dbt not even one time because like in big tech like we just don't use it They have like other tools that they use instead of dbt and then like all these people are talking about it Like oh, it's the next big thing. It's so hot and I'm like, okay like I need to learn from these people and I you know, Then I added it to my boot camp and like, you know I'm and that's a great example of where community's awesome, because I added it to my bootcamp, but I'm having someone else teach it, so that, like, because I don't feel like I should teach that class, because, like, I would be, like, essentially teaching not from a place of deep experience, and that's the whole point of my bootcamp, is every class. should be deep experience based, right? That's why, like, in the last one I had like experimentation where I taught it and then I realized I'm like, I needed to not teach that class. That's a data scientist, that's not me. And like, so I, I, I have a data scientist teaching it. I don't know if you know Tim Tim Chan at StatSig. He's gonna be teaching that class. And so like, we're, that's the whole idea is that like, It's beautiful because we get this community rolling. Ultimately, I feel like these bootcamps are going to be easier on me. Because in V2, in that bootcamp, I taught all of them. I did 20, I did 22 lectures, filmed 65 hours of content in 6 weeks. And like, I almost died. It was too much. It was too much. At the end of that, I was like, I need to sleep for like a year. But like, this time around, I'm hoping to instead of doing 22 lectures, I think I'm going to do like 16, and then I'm going to have the other 6 be done by like guest lectures or outside lectures so that they can Because they'll do it better, because there are edges around this data engineering puzzle that like, other people know better than me, for sure. They know it way better than me, and I want to bring those people in, so that like, we can have that same sort of deep experience based teaching, but for all the aspects, so that like, people can really get that. Just like what you're saying with Reforge, they can get that same sort of idea, but for the whole data analytics stack. Wow. And that community is amazing. Yeah, yeah, absolutely. And the best part about community is just how generous people are with their earned insights, and how eager the people who are earlier in their career are to really learning from the people No. What they're talking about know how to essentially, you know, navigate this stuff in a professional environment. So the work you're doing there is incredible. What's next for Zach Wilson? It's great like so that's a good one, like, so I have a couple things that I'm working on right now. So data engineer, i.o, so actually I have I own a couple domains actually is interesting because data engineer i.o was like when I quit my job actually data engineer i.o was not was not the first product I was working on. It's actually I didn't start working on data engineer i.o til, like. Two or three months into quitting my job. I actually was working on a different product, which I still am working on It's just I realized that that product like needs a lot more iteration before I can get it to market So I actually have another domain called tech creator. io. So data engineer. io is one example of how to run a Like a bootcamp, like a cohort based class. So, I don't use any platforms for my content serving, like, at all. Like, cause, I, you know, there's like content platforms like Maven, where you can like sell your course. Maven takes 10%, though, and I was like, when I saw Maven taking 10%, I was like, I know I'm sitting on a million dollar business, and I'm not giving them 100, 000. Not doing it. No way. They're not providing enough value to give them 100 grand. Right? And so, like, I was like, I'm gonna make the content, so like, For me, like, when I'm, as an entrepreneur, like, I think there's only one company that deserves a percent cut of my business, and that is Stripe. Stripe gets their three percent, or whatever, right? Their three percent and seventeen cents, or whatever their weird, you know, monetization model is. But everyone else, like needs to be a monthly fee, right? Like, it's like a server fee, or like a monthly, like, membership fee, or whatever. It's gotta be flat, right? I like that, but like So anyways, TechCreator is the idea, so Data Engineer actually uses the TechCreator platform so that TechCreator is a way to empower other influencers and creators who want to build a course to launch their own course, and it does that and it also allows you to index all your social media content. So I actually have an archive, so if you go to ZachWilson. tech slash search, you can actually search all of my social media content. Like, and say you search for Airflow, you can find like the 30 posts I've made on Airflow, and then like go through all of them if you want to go through them, and like, and it gives you like a kind of an indexed archive of everything that I've ever written, and I'm trying to get, I want to integrate Substack as well, so like, because I feel like that will be the, Like, I, you gotta integrate each platform at a time, right? But that's what Tech Creator is gonna do, is it's gonna enable people to create cohort based courses and kind of have like, it's kind of like a super link tree as well as cohort based courses. So, yeah, I'm excited for that. That's gonna be launching probably sometime like March or April next year. Wow. So, so other thought leaders, educators in the community are going to have, you know, more great technology coming their way from someone who's, who's done it before. So, that's super exciting. Zach Wilson, creator of DataEngineer. io. You've taught so many people in the community. I was pleased to be able to learn from you today. Thank you for joining What's New in Data. Yeah, thank you so much. This is great being here. I'm You