How AI Will Transform DevOps and SRE Practices | Better Stack Podcast Ep. 12

English
BBetter Stack
컴퓨터/μ†Œν”„νŠΈμ›¨μ–΄νŠœλ‹/λ§€λ‹ˆμ•„κ²½μ˜/λ¦¬λ”μ‹­μž¬νƒ/원격 근무정신 건강AI/미래기술

Transcript

00:00:00I'll put it this way open up a clod code project and ask it to design you a bridge
00:00:05Or a skyscraper
00:00:08And then I want you to take that design and I want you to go and build it
00:00:12And then I want you to go and sit in it. Are you going to be comfortable doing that?
00:00:16Are you going to be comfortable driving over that bridge that it designed?
00:00:19Right until until you stop shaking your head. We need those engineers in place
00:00:25Welcome to the better stack podcast where we have conversations about software development ai and all kinds of new technology
00:00:32I am one of your hosts andris and i'm joined today by wishes. Hello and
00:00:38Amin astani. Hi. Amin. How's it going? Happy to have you here?
00:00:44Yeah, uh, thank you for having me. This is a pretty great pleasure. So from what i've known about you
00:00:51I kind of think that you are
00:00:53Kind of like an sre wizard. You're very deep into that realm you host your own podcast on that topic
00:01:01So we're just curious about how did you start in that area? And what's been your journey through the
00:01:08software development career so far
00:01:11Yeah, good question again andris richard. Thank you for having me. Yeah, how did I get started?
00:01:17Well when I was in high school my girlfriend's mom introduced me to red hat linux
00:01:23she was taking a computer science course in community college and she
00:01:29Just showed me her desktop, which wasn't windows. It wasn't mac. I was very confused what it was was a kde desktop running on
00:01:37Red hat linux and I was enthralled and I started to get into it all the way back in high school junior year
00:01:44of high school and then got my um
00:01:47Cs degree and been very interested about infrastructure been very interested about operating systems and teaching a computer to do things
00:01:55That previously we didn't have control over so that was the beginning
00:02:00So you've always been like a proponent of open source software. I imagine if you're using linux
00:02:06Oh 100
00:02:08Linux has been my daily driver
00:02:10for over two decades now like since
00:02:122005
00:02:15Okay. Yeah, and when was the point where you got very serious into sre?
00:02:20Yeah, that was around 10 years ago. So prior to that I was a classic operations person at a rapidly growing
00:02:29Past sas company called aquia. If you ever heard of drupal the founder of the drupal open source project
00:02:34He founded a company they did the professional services and the hosting and the support for drupal websites
00:02:38So I was in their operations team, but as the company grew and it was rocket ship growth
00:02:45Lots of lots of big clients whose names you'd recognize with that scale came a lot of manual toil. There we go
00:02:51Now we're talking about sre now. There was a lot of manual toil. There was a lot of incidents
00:02:55We were starting to come up against the reality of operating at scale versus the architecture that we originally
00:03:01Started with and the processes that we started with so I started reading the phoenix project. I I got um into
00:03:09The practice of cloud system administration, which was the first book on sre
00:03:12It wasn't the google sre book it would sre was like a few pages and it had 12 practices in it
00:03:17And again, I was very very intrigued and and started to practice it
00:03:21And I don't actually know what is the phoenix project. Oh my goodness
00:03:26Yeah, yeah. Yeah, so the phoenix project was like the original devops book
00:03:31Um, and it's a really good book
00:03:34Yeah, so it's a story and it's it's about
00:03:39like the vp of it operations dealing with some significant organizational problems and
00:03:45He through the guidance of like this mysterious mentor who you originally later find out. He's like a board member of the company
00:03:53he's teaching him about
00:03:55like
00:03:56Old school plant manufacturing concepts and how it directly applies to um software engineering and it ops and
00:04:05It introduced a lot of concepts that I still use today and talk to clients about today because it's extremely relevant
00:04:11So yeah, that was the original book but then
00:04:13Beyond there. I read a whole bunch of stuff a lot of it in the world of management
00:04:18So in addition to all the technical stuff that i'd be reading just to skill up and being an sre and as a software engineer
00:04:24There's also the management stuff because devops and sre is a socio technical practice
00:04:29You're bridging you're bridging the gap between what technology can do and how humans use it to make good business outcomes
00:04:36And I imagine there's a lot of human error involved in that and handling with it, right?
00:04:42I mean, yeah. Yes, absolutely. I mean shoot there's a book about human error that that you should probably read too
00:04:50Yeah, uh, there's a gentleman I his name unfortunately escapes me right now, but he is a
00:04:57Like professor in sydney australia and all he talks about is like airline accidents and how human error, right?
00:05:05So huge like it's very useful in post-mortem culture like human error isn't where we end
00:05:10Like if you say oh, yeah, there is the reason why we had this incident this production incident was human error
00:05:16No, that's the beginning of the discussion. That's not the end of it like what factors contributed to
00:05:21You know that person in that place fat fingering that command and blowing out the production database
00:05:27Like should he should he have had access to the production database where the tools are ergonomic? Did he have sleep last night?
00:05:34How much did he get paged? You know where they're there's all kinds of things
00:05:38But yeah, when people say human error, I immediately identify with with that book. I'll have to give you a link for show notes
00:05:45Yeah, I I think I remember the first my first
00:05:49Introduction to end-to-end testing was I was listening to this course where someone said
00:05:54about this incident where there was some kind of a software developed for
00:05:58Surgeons and you had to like click the buttons in a certain order
00:06:03But the surgeons got so used to clicking them that they clicked them
00:06:07So fast that the software started glitching and nobody had expected that that might happen
00:06:13So that's kind of interesting
00:06:15Yeah, definitely intersection between humans and and automation is is something that we have to always pay attention to in the products that we operate
00:06:25Yeah
00:06:26And then I also heard that you started
00:06:28With dev ops and sre during the beeper era where you had to use pagers. Is that correct?
00:06:35Yeah, actually, yes the first time I was on call I was issued a physical
00:06:42Pager and now granted. Oh, wow. What year was that?
00:06:46It was 2005
00:06:48Oh, wow. Okay. So so my my first my first tech job. Oh, this is great
00:06:55Uh, I was working in a garage as every successful startup does in in plant city, florida
00:07:02A good friend of mine
00:07:04Her father had a side hustle where he was running a web hosting and email hosting and dial-up isp out of his garage
00:07:12His name is his name is curtis fellaini. I I have mad respect for him and he um
00:07:18He taught me everything I knew but yeah, like we were running this this on-call tool called what's up gold?
00:07:24And all it did name I know like and the the the what it did it didn't do hdp checks
00:07:33It was doing icmp pings to the servers. So just old school ping and if it didn't get a result back
00:07:40It would send
00:07:42A page and I had a pager I carried a pager for for a little while when I was uh working at that place
00:07:48Immediately after when I was uh working at my alma mater doing super compute hpc. It was nagios
00:07:54And you know text messages to your cell phone
00:07:57Was that like a deliberate choice or by that point like newer technologies had already emerged like you said sms messages and stuff like that
00:08:07I mean, yeah back back in those days there there was just more tools
00:08:11Granted limited but there were more tools available to monitor
00:08:15Infrastructure at the time and when I was at um, the hpc department at university of south florida
00:08:22We were already dealing with hundreds of server scale because it's an h it's hpc
00:08:27It's like you you have racks of servers that are doing compute jobs. Um, so we we needed
00:08:31something a little bit more than
00:08:34uh than a pager and the content in the page also mattered because if you just got
00:08:40You know a message it's like oh, okay. I need to go check to see what the what the thing is from the beeper
00:08:45but um with nagios you got at least some context like this server
00:08:50Is down or this service attached to this server is down. So even then
00:08:56Back in those days you had some sophistication in terms of being able to configure
00:09:01you know
00:09:03What you wanted a monitor on a per host or per service basis?
00:09:06Like definitely pre pre pre pre slo we're not even thinking about that
00:09:13We're just thinking about is this port open and giving me a response that I expect
00:09:17Yeah, wow 2005 I think I was in the secondary school at that time. I wasn't even thinking about like
00:09:26computer
00:09:28science back then
00:09:30Yeah, I don't look it but but i'm older than I look. Yeah, I realize that well you look very good. Thank you very much
00:09:37So yeah doing a bit of research when you it seems like you worked at meta what did you do?
00:09:43And what was it like working at meta?
00:09:45Oh my goodness. Uh, so I was a production engineer
00:09:48and a production engineering manager, so
00:09:52What that basically means it's their flavor of a sorry and I worked on two projects the first project which was just a few months
00:10:00I worked on a team called conveyor. There's actually paid public papers. I think for i triple e about
00:10:06Conveyor conveyor is probably one of the largest cd systems on the planet
00:10:11So all of all of the build artifacts that come out for the backend services at meta
00:10:18Go through conveyor and there's progressive rollout of all the changes to like thousands upon thousands upon thousands of backend services
00:10:25Are they strictly internal to meta?
00:10:29Internal only yeah
00:10:31But yeah, there's a paper by boris gubrick that you should go read about
00:10:37What they built and it's very very interesting
00:10:39So I was there for a few a few months when they were scaling it
00:10:43And helping them with their alerting because they were early game
00:10:46It's like anything you get like a hundred hundreds of alerts a week and i'm like wait
00:10:49let's clean this up and
00:10:51You know get get their get their observability into a place where at least they knew when failures were happening and then I also
00:10:58put together some
00:11:00What would you say like some emergency procedures to shut off key features using their feature flagging tool
00:11:06Um just to make sure that we were able to recover quickly from from incidents
00:11:10So I did that for a few months, but most of my chain year was spent
00:11:13on another team that
00:11:16Was in an internal heroku of sorts. So you have very simple stateless services
00:11:23um, and you know teams are turning these out all the time and you need to run them somewhere and that
00:11:29Team and service existed to make that process simple because in the old days standing up a service was very difficult. Um, so I was the first
00:11:37production engineer
00:11:40On that team and it was it was pretty wild stuff. But that's what I did there
00:11:44It was it was a definitely a lot of fun. There are people that I worked with that
00:11:48Just had brains five times
00:11:51mine in size, you know, like the people you work with the companies like that are just
00:11:55Super brilliant and I learned a a lot there. I was gonna say I think there's a company called honeycomb and
00:12:02The person who started it charity majors was that meta?
00:12:06Do you know her did you meet her?
00:12:09Um, I have that's very that's very interesting
00:12:12You asked that I I met her at an observability days event a year and a half ago in boston. I met her in person
00:12:20yes, she did work at meta, um, and
00:12:22the inspiration for honeycomb came from a
00:12:26planetary scale observability tool called scuba
00:12:30right, so I used scuba and really loved it and i'm delighted that that she um
00:12:37Is founding a company to to do that she and I have been talking on linkedin on and off a little bit lately
00:12:44because of our common interest in
00:12:47how agentic coding is going to influence reliability, so
00:12:51We've we've been talking on and off about that subject
00:12:54But yeah, she's a really cool person and it's it's a privilege to to chat with her from time to time
00:12:59Yeah, yeah, so actually that's a nice segue into what we also wanted to ask you. Where do you think ai
00:13:06Will take asari and devops and all these practices
00:13:13Oh goodness, okay. So amazing plug. Thank you very much. I'll get right into the to the thesis here. So
00:13:19we have software engineers that have been given a
00:13:24An amazing tool which is agentic development. Everybody's like getting code magically generated from wim, uh from from tools like clod and whatnot
00:13:35So what this means is that there is going to be an order of magnitude amount of code
00:13:39that's going to be flowing through our
00:13:42value stream our cicd pipelines our testing our review our code deployment our incident response our change management procedures like
00:13:50All of all of this code is going to is going to go through that system that each of us have built for our companies
00:13:56And that means we're going to stress test
00:13:58All of those capabilities. So for example
00:14:01There is a a linear
00:14:04relationship
00:14:06Between the number of changes you can make to a piece of software and the number of pages or the number of incidents that you receive
00:14:12That is established fact
00:14:14so
00:14:15It's reasonable to assume that if you increase the flow of code through through your pipeline by 10x
00:14:20You're going to have 10x alerts
00:14:23so how
00:14:26Would your organization respond to that right is really the question. So from the sre side I anticipate
00:14:32That in the future we operations people we're going to be very um
00:14:39Very popular because we're going to be where the constraint is now
00:14:42It's no longer the matter of right like writing code. Um, and and and shipping it. It's now how do we operate it?
00:14:49Is it working? Is it meeting customer expectations?
00:14:51um
00:14:53and
00:14:55And so I I anticipate that being the major shift
00:14:58It's a it's a huge challenge because a lot can go wrong during this transition
00:15:04You can like I mentioned you can have a lot more incidents. How are you learning from failure and
00:15:09You know, are you continuing to do the postmortems? Are you building a cicd pipeline that actually works, you know and
00:15:15Basically makes it so that human effort is always high leverage
00:15:20So the funny bit is that we've been preparing for this event for decades. It's just that you know
00:15:27We we tend to think about it in terms of big tech where you have 20 000 engineers
00:15:31but
00:15:33Now we have smaller organizations that could do 10 times the code and therefore the flow is the same
00:15:38So all of those fundamentals
00:15:40That those larger companies are, you know have been writing about and been talking about for for the past decade now
00:15:47We're all going to need to actually do
00:15:49Like all the fundamentals you actually have to do them. And so there's that
00:15:53there's also the matter of
00:15:56Organizations like software development organizations have organized themselves around the idea that oh it takes the longest like the most amount of time
00:16:03Like the biggest time suck is coding
00:16:05So we need to make sure the engineers are spending as much time coding as possible if that's no longer true
00:16:09We have to reorganize ourselves around those activities that are now taking the most amount of time
00:16:15So there is going to be a bit of a shift in terms of how we run these organizations and then finally
00:16:20And I think you've seen this in terms of the AISRE trend
00:16:23Which i've talked often about and in my podcast that we are going to need to skillfully introduce
00:16:31That technology into places appropriate where we can get a lot of leverage out of it to do those operations, right?
00:16:37Because like for example, there are companies out there that I think a good example is incident.io you get paged
00:16:44And before you even get out of the bed
00:16:47You know, there's already an agentic workflow. That's like, okay, what changes were made to the code base recently?
00:16:53What do the alerts actually say? What do the metrics actually say and try to
00:16:59Do a first pass diagnosis again before you even got out of bed and got the sleepy out of your eyes
00:17:04So there are some places in the operational set of responsibilities that we have that can
00:17:09Get a lot of benefit from AI but you don't just like adopt everything
00:17:12Like you need to you need to think about what the what the actual problems are what the ROI is
00:17:16But anyway, that's my thesis. I'm actually doing a webinar in a couple weeks about this very subject
00:17:23That's so interesting because um, the the the use case that you mentioned before
00:17:29Getting out of bed. It's already paging you about something
00:17:32has that already been implemented in practice because
00:17:36one thing that I see can go wrong is that like let's say you have like these triage levels where first triage like
00:17:43Maybe the AI can solve it on its own. But then the second we need a human in the loop and sometimes what I
00:17:51just just to make a comparison with coding sometimes I see clod code when it's reasoning it's like
00:17:56I should do this and then it's like hold on a second
00:17:59No
00:17:59I should probably delete all of this and start over and do this
00:18:02And I feel like the same thing could happen with these incident management bots. They're like, oh, this is a easy fix
00:18:10Hold on. Maybe I can fix it on on my own, you know situation like that
00:18:13No, you're you're 100 correct
00:18:15I think I think this quote from ibm from like the 1970s has been making its rounds lately never never have a computer perform a
00:18:22Management decision. So what I what I mean by that is there are two ways that you can use
00:18:27this
00:18:29Ai tooling or automation in general. There's two philosophies around automation
00:18:32That people have written about the first is called the leftover principle where you basically say. Oh, okay. We're gonna have software
00:18:38It doesn't have to be AI. It can just be tools that does all of these things for us
00:18:43Autonomously on our behalf. We don't have to think about it
00:18:46And then we the humans are given what's left over which is typically the stuff you need a phd to solve
00:18:51um
00:18:53a that's
00:18:54Not super wieldy because then you're spending all of your budget on phds
00:18:57And then you're also placing too much trust in in a system that's running autonomously and if it makes the wrong call you're in trouble
00:19:04the more
00:19:07Wieldy philosophy is is what's called a compensatory principle where it's like you are force multiplying human effort
00:19:15You're allowing computers and automation to do the things that it does. Well, and then you're then you're allowing the humans to do the things
00:19:22that humans do well
00:19:25which is
00:19:26We understand the context of our systems. We understand the problem that we're trying to solve we understand, you know, the the customer experience
00:19:33There's a whole lot of context in our brain that makes it a lot
00:19:38More advantageous for us to use that rather than have ai do it do all of it. So to answer your question more directly
00:19:44I think this ais retooling really shouldn't be
00:19:50Making changes to production unless it is on a well constrained well-defined path
00:19:58I'll give you an example automatic reverts of of code. We do that automatically without ai if you're thinking about kubernetes
00:20:06right if you have a deployment and you're changing, you know, the version of your of of your new container image and your
00:20:13Readiness checks don't succeed. Guess what?
00:20:16It's not going to roll out the change
00:20:19Right. So those mechanisms are are very simple easy to reason about and it allows us to be able to perform changes safely
00:20:26You know those types of of well-defined problems. I think you can leave automation to do and automation is doing it already
00:20:32but having
00:20:35Ai right now just say yeah, we should make this change to to to our infrastructure without human input
00:20:41I think is irresponsible. I think we we as humans back to the point about constraints. I think
00:20:46our time is going to be shifted to
00:20:49Is this really the right decision to make?
00:20:52You know making those those those uh approvals code review and review of any of any infrastructure change
00:20:59I think is going to be where
00:21:00We're going to be spending a lot of our effort
00:21:03And yes, we're going to want to automate that where we can but it requires a discipline and and a track record of safety
00:21:09Yeah, and then another question from what you said earlier is what you said about this linear comparison that the more code you have
00:21:19The more incidents you're going to get so you've been working as a consultant
00:21:22In this field. Have you heard that companies are experiencing it right now like
00:21:28since their
00:21:30outputting
00:21:31Or offloading more code to Ai they're experiencing more incidents as a result
00:21:36certainly definitely so
00:21:39Definitely so and I would also add to that by saying there are other failure modes that are starting to emerge due
00:21:47to
00:21:49This change I think a really easy one to think about and reason about
00:21:53is
00:21:54Your cd pipelines backing up the latency between having a bill artifact built and then having it go out to prod is taking longer and longer
00:22:00Because there's just more changes going through it. So yes organizations are starting to
00:22:05Feel these pain points all already. I mean I had just as a quick example case study. I have a
00:22:12A client that I recently did an assessment for because one of the things I do is i'm taking a look at
00:22:17The entire company's like operational posture how they ship code. How do they run it and give them guidance on what to do next?
00:22:24So one of the teams was like, yeah, this clod thing's real good. Let's start churning out some features
00:22:31and they did but
00:22:33They they didn't change how they release they didn't have a good a good solid code review process
00:22:38And so the number of incidents immediately shot up so they so instead of them
00:22:43Focusing on the code and just like shipping all the things that they thought they were going to do. They were getting buried by customer escalations
00:22:49and incidents
00:22:51So like yeah, this isn't this isn't theoretical it's happening right now
00:22:55um the this increased order of magnitude of changes to production is going to
00:23:01reveal
00:23:03All of the weak links in your operational posture
00:23:06and when you are like a
00:23:09Sre doctor that like has this client who has this problem. What is your what is your suggestion what to do in this case?
00:23:16Like you just mentioned
00:23:18Right. I I alluded I alluded to it earlier all of those fundamentals that we talked about
00:23:24Over the past decade or two we have to focus on them
00:23:27like teams
00:23:29Need to for example teams need to have a a valid testing strategy
00:23:35They need to know that the code that they're getting ready to ship into production is of high quality
00:23:39All the things that they've learned about in the past are thinking about i'm a big proponent of service level objectives
00:23:43i'm a big proponent of slos
00:23:47Because that allows us to understand the health of our production system from the customer's point of view
00:23:52so that
00:23:54Is a big piece of the puzzle because then it allows you to in a data-driven way regulate the amount of changes
00:23:59You're making a prod in terms of like feature requests and things like that. So that way
00:24:03You are not risking your existing source of revenue
00:24:07While chasing new sources of it
00:24:10That's the business reason that you want slos one thing that i've noticed with slos the previous company I worked for
00:24:16they were like
00:24:19Each team should you know maintain their slo goals and for some teams since we were moving so fast
00:24:25We were not meeting those goals
00:24:28And then a lot of teams were behind
00:24:30And the management decision was whatever it's more important to ship out features than to maintain slo goals
00:24:36Yep
00:24:37And that is probably the biggest reason why slo hasn't been as effective as google promised a decade ago
00:24:44Because here's the thing and i i've talked about this in previous content. You can have an slo program
00:24:49You can go through the process of having engineers go in a corner and be like, oh, what's what's the user journey?
00:24:55And let's quantify it and let's get some observability out there
00:24:57and set up prometheus or open telemetry and get really pretty dashboards, but
00:25:01If the product manager
00:25:05If the product organization isn't willing to play ball
00:25:07If the executive team isn't willing to play ball
00:25:09then
00:25:12It's a waste
00:25:14Then you just have sres and maybe the engineers working in a corner saying hey things are not healthy
00:25:19and you know
00:25:21You have the leadership wanting to continue to ship like in order for slo to be successful
00:25:26It needs to be a game that everyone plays and when I assess
00:25:32Organizations slo posture, which is a big thing that I do. I tend to ask that question if an error budget gets spent
00:25:38What does the product manager do?
00:25:41If the project if the product manager is like, okay next sprint we're changing the scope of next sprint
00:25:46Good
00:25:48You're you're you're at medium maturity if they don't
00:25:50You're at low maturity. I actually developed a slo maturity model from from one to five one being immature five being optimizing to help
00:25:58Companies reason about where they're at in the journey, but you're right like a lot of organizations. They don't
00:26:03They don't do slo. They have a dashboard. They have metrics on it engineers get paid for it
00:26:08but
00:26:10Business decisions are not being made using slo
00:26:12It's just alerting
00:26:16And so the type of clients that that you have that come to you what
00:26:21situation are they in that they
00:26:24Require your assistance. Are they in a good situation bad situation in between?
00:26:28It it depends. So there's two inflection points that I tend to work with with clients during the first inflection point is like
00:26:37The early stage company that just got a round of funding their sales organization is crushing it. You got product market fit
00:26:44But they're a babe in the woods and they've never won enterprise level infrastructure before
00:26:49They've never sold to enterprise clients before and they're starting to buckle
00:26:53Under the the strain of that increased scale very much the experience that I had earlier in my career
00:26:59Like that's something i'm very very knowledgeable about because I lived it. That's the first inflection point
00:27:04And that's a fun inflection point. That's like that's a good problem to have though. It can be stressful for the people working there
00:27:09the second inflection point
00:27:12Is usually when you're a larger organization and now you need to start standardizing
00:27:17How you operate across multiple teams?
00:27:21Maybe you have an sre program and you need to assess the s like how effective sre program is are the process is working
00:27:28Is the engagement model working are the engineering teams actually participating in the process is product participating in the process?
00:27:34So companies that you know do a lot of acquisitions
00:27:37They they will typically come to me because they're trying to tame this wild animal because every time they do an acquisition
00:27:44There's an entirely new stack. They need to bring in so there's a lot of like top level governance that you need to start thinking about
00:27:50In terms of making sure everyone's playing the playing by the same rules
00:27:55Sure
00:27:57And I think with the smaller companies like you mentioned the ones who have raised some money
00:28:02Um with with ai being around people using ai to build features
00:28:06The kind of lack of structure will will be increased because people just building features shipping it with ai
00:28:13Not thinking about logs metrics traces anything like that
00:28:16Um, what what are the biggest things you've noticed people in the smaller kind of phase of their company come up to you with problems that
00:28:23Are glaringly obvious to you, but they didn't see it as a problem
00:28:27Yeah, I mean I think i've alluded to it, uh in a previous answer, but i'll i'll frame it top down
00:28:34so
00:28:36The glaring thing that I see is is a broken feedback loop between what is happening in production
00:28:41And what is happening in the customer experience and then what is happening in terms of product prioritization?
00:28:46So and i've seen that for a long time across many teams many teams. It is probably the common problem
00:28:55So as I work as a consultant
00:28:58One of the things i'm thinking about behind the scenes is how do I make sure that feedback loop is established?
00:29:03So it is amazing
00:29:05How many companies will will experience an incident?
00:29:08Customers will get upset
00:29:10They'll file tickets into support support gets flooded
00:29:13With tickets support has this this amazing gift to give to the to the product engineer organization, which is called feedback. They know
00:29:23What is angering the customer and what is preventing them from enjoying the service and what is motivating them potentially to leave?
00:29:30And i've seen time and time again
00:29:33That feedback being completely ignored or being placed lower in priority than the feature work
00:29:41and so the end result is
00:29:44Full steam ahead on features even though the product is on fire and that to me
00:29:51Is something that I see immediately when i'm in it interviewing teams and assessing teams
00:29:55But usually when you work inside the company, you don't see it because it's like oh i'm a software engineer
00:29:59I'm incentivized on shipping features. I shipped 50 widgets last quarter great. I'm getting that bonus
00:30:05Product managers thinking the same thing
00:30:07Yeah, we got these projects out. This customer is signing with us because we did it. Meanwhile
00:30:11Existing customers are completely apoplectic. They're not happy with what's going on
00:30:16so broken feedback loop
00:30:19and is that because of like the culture in sf or just because
00:30:22People just want to look like they're being better than their competitors. I don't know. Why do you think that's the case?
00:30:28I don't think it's native to sf. I I I think I think it's just native to to business in general incentive structures like like
00:30:36business cultures because
00:30:39when
00:30:41you
00:30:42Have different departments
00:30:44Um as a company grows and that naturally happens because you have to organize effort
00:30:49You start the silo
00:30:51I have software engineers over here. I got product managers over here. Maybe they like they might be embedded
00:30:55With the engineering teams, but they got their own set of incentives. It's the support and the devops sre folks
00:31:01They got their own incentives, but they're all they're all separate and they're conflicting
00:31:06Right and
00:31:09typically
00:31:11at this stage of the game, no one sat down with them and said
00:31:13How do we how do we change our incentive structure so that?
00:31:18We're all moving in the same direction. One of the things that meta did
00:31:21That I I really liked and I think it works really well
00:31:25Is that when they are evaluating the performance of an engineer at least in teams that I was working with?
00:31:31They weren't just looking at features
00:31:34They were looking at operational excellence production excellence and they were they were looking at. Okay, what incidents they participate in?
00:31:41What reliability work did they work on? What scalability work do they work on?
00:31:45How do they make that the system more reliable more operable and that applied to are they going to get the bonus?
00:31:51Are they going to get promoted? It was it was an important part
00:31:54of the criteria
00:31:57so
00:31:58When they really started to lean into that that meant that the incentives for the software engineers changed and on the team that I was on
00:32:05where we were really
00:32:07Pushing that forward with the with the engineering leads that I was working with
00:32:12Those engineers magically started coming to me and saying hey, I mean i'd like to participate on some reliability work with you
00:32:18Can I take over this load testing process?
00:32:20Sure. Absolutely. Here's the run books. Here's some scripts like, you know
00:32:25Go nuts and it's amazing what happens when incentives?
00:32:29shift so
00:32:31That is I think it's just a natural progression for any organization if you're like a tiny startup where everybody owns everything
00:32:38And you're trying to get funding and you're trying to you know, win those initial customers and keep them happy
00:32:43I think I think incentives are already aligned
00:32:45Otherwise the company wouldn't survive but as you start getting bigger and you have that separation of concerns
00:32:51It gets harder and harder to have an incentive structure that is aligned across the org
00:32:55So yeah, this is this is where the socio-technical stuff starts coming out when we're talking about SRE and DevOps
00:33:00Nice and and you also mentioned that with this full steam ahead with AI
00:33:06People will need more people like you working in this field and you know at the other side
00:33:12We're hearing that software engineering is dying. There's no more jobs, right?
00:33:17So do you think that like for junior developers listening to this podcast?
00:33:23Is SRE and DevOps a good field to get into at this specific moment? Yes, and yes, so
00:33:30I I will I will push back and I'm I will not be the person that says that software engineering is dead
00:33:36I actually I actually think that the fundamentals
00:33:38Uh is is even more
00:33:41important
00:33:43Understanding programming languages and how they work algorithms and how they work is even more important
00:33:50because
00:33:53How are we supposed to review the code that that an AI generates?
00:33:56And how are we supposed to design systems that work because if you if you just give you know, an LLM model. Hey architect this thing
00:34:04Are you sure it's going to work like i'll i'll put it this way open up a clod code project
00:34:08And ask it to design you a bridge
00:34:11Or a skyscraper
00:34:15And then I want you to take that design and I want you to go and build it
00:34:18And then I want you to go and sit in it. Are you going to be comfortable doing that?
00:34:22Are you gonna be comfortable driving over that bridge that it designed?
00:34:25Right until until you stop shaking your head. We need those engineers in place
00:34:31Right, and as for sre and devops people. Yeah, obviously
00:34:34There needs to be more of us because that's where the constraints going if we're if we're making so many changes to the system
00:34:39We definitely have to make sure that it's healthy that we understand the implications of the changes on production
00:34:43That we have a cicd pipeline a value stream if I extrapolate that to top level that is
00:34:51Healthy monitored and treated as a production service because in my view it is all of those skills are going to become even more important
00:34:58It's like the the way that I I tend to look at it. It's kind of like, you know f1 racing you have a pit crew
00:35:04And that pit crew knows everything about that vehicle can swap out every part and gives that driver the ability to dominate that that race track
00:35:12You you need those people
00:35:15Um because that gives that gives the product managers and the architects, you know, the the environment to
00:35:22Iterate quickly and to be able to get those changes out without those constraints happening
00:35:27so in my view
00:35:29like
00:35:30It's a great time to be an sre. I feel like a couple years ago was different because everybody was getting laid off
00:35:35but I I think that
00:35:37If you really really focus on the fundamentals if you understand
00:35:39You know linux and in in in systems as well as the socio technical stuff like basically what it means to actually be an sre
00:35:46Um, I I think you can go
00:35:49very far
00:35:51In in this in this career, but we do have to be informed by the by the this advent of agentic development
00:35:57We can't keep our heads in the sand about it
00:35:59So you mentioned
00:36:02Was it phoenix project? Yes book
00:36:05Yeah, so what other resources do you think is valuable for someone who wants to like dive very deep into this subject?
00:36:12Oh, man, I mean there's there's tons. Let me let me get you the the the heavy hitter books
00:36:18I mean the devops handbook is good. I felt that when that book came out
00:36:21It was it was a great summary of everything that I learned prior to the book coming out. It's a great
00:36:27It's a great one-stop shop leading change by john cotter
00:36:30Is about it's about transformational change leadership and organizations
00:36:34It gives you a framework and when you are an sre on a team or if you're trying to drive a reliability transformation
00:36:41It's a it's a it's a good thing to know the book toyota cotta
00:36:45Which is about how toyota, you know does continuous improvement in the manufacturer manufacturing of vehicles
00:36:52Very useful very useful one of the first ones who came up with the system. Yes. I think i've heard of that. Yeah
00:36:59kaizen
00:37:02Was was the toyota concept?
00:37:04The japanese economic miracle is what created agile
00:37:07And devops like it was the precursor to that other books the practice of claus's administration
00:37:13I mentioned I think they might have like a new edition very very useful
00:37:17Yeah, like the sre books that google produces are good, but I I will I will put a caveat
00:37:22Those are books that were written by people that run that work for companies that are absolutely gigantic
00:37:27And have infinite budget
00:37:30so the practices and the ways that they go about things are going to be different than the way you go about things with your
00:37:35four-person startup, but I think
00:37:37Putting all of those pieces together will give you an understanding from an engineer
00:37:44from
00:37:44a leader
00:37:46And from a strategist on how to drive devops and sre initiatives
00:37:49So I think I have a unique perspective because I think about it holistically. I'm not just going deep on
00:37:55All I do is kubernetes and I make kubernetes work. I'm trying to make teams work
00:37:59And then after that we figure out the technical problems and solve those
00:38:03And for all the listeners, we will link the books mentioned in the show notes
00:38:09Um, I was going to ask so going back to ai you I agree with you on people should learn the fundamentals and that
00:38:16Software engineering or development isn't going anywhere, but there is a kind of trend
00:38:21But i'm seeing more and more of people pushing code to production without reviewing it
00:38:26and i'm sure that these are like people on twitter or x or just like
00:38:30People showing off it's not really true
00:38:32But but there are kind of thought leaders like like people who who run the ai companies like dario, uh musk saying hey
00:38:39Like next year 2027, whatever
00:38:41You can you can write code and you can ship it without looking at it or you can write code
00:38:47Like compiled code without writing the programming language like to compile that code and so languages will be dead. What are your thoughts on that?
00:38:54I seriously doubt that these thought leaders are on call
00:38:58Because if they were they would be saying something completely different
00:39:02No point
00:39:04They're they're not they're not they're not in the infra. They're not the information security group
00:39:08crying about
00:39:09vulnerabilities being introduced and they're not dealing with what happens when customers are upset like
00:39:13Sure. LLMs can produce
00:39:16Syntactically correct somewhat logical
00:39:21code
00:39:23Are they write like 20 30 something?
00:39:25Maybe
00:39:29But even then
00:39:31You're still going to need proper monitoring observability
00:39:37You know tests and production. Okay, if you want tests and production as charity espouses
00:39:42There is some
00:39:45prerequisite invest investment that you have to make and in her view
00:39:48it's going to be observability understanding the behavior of your software, but even even still if I
00:39:54you know
00:39:56Were a a government and I needed to buy software
00:39:59It has to meet my security controls
00:40:03If I was a medical company and you're writing software that is literally responsible for keeping patients alive
00:40:09Do you really think you want an llm to generate that without review? That would be unconscionable
00:40:15It goes back to the example about the bridge or the skyscraper. Do you really want an llm to design a bridge or a skyscraper?
00:40:21And just ship it pour the mortar in the rebar. Let's go. No
00:40:27No, absolutely not
00:40:31Yeah, that bridge and skyscraper analogy is a very good way of looking at it and never never thought of that
00:40:37You know if you were like an engineer
00:40:39From zero to hero, like would you want to walk on your own bridge that you just built?
00:40:43Yes, and here's the the thing that we have to think about so my I was mentioning curtis
00:40:49I love that guy in the in the beginning of this episode
00:40:52He was a pe
00:40:56A professional engineer in electrical engineering that meant that he went to school
00:41:02He took the pe exam. He interned at a company for several years and then eventually after taking a big test
00:41:09Then and only then was he able to do engineering projects. There is a high level of discipline
00:41:18in how
00:41:21You do things. I mean, it's the same thing with doctors
00:41:23Like you learn the academic but then you're like in the field for years under supervision before you can actually do medicine
00:41:30And and there's a reason for that. It's because
00:41:33the decisions that we make have have massive ramifications on the lives of human beings in the environment and
00:41:40And and in the world, um and and software hasn't quite
00:41:44caught up to that
00:41:47yet, and i'm not necessarily saying that we need to emulate those models, but
00:41:52We do have to at least
00:41:54acknowledge the risks that we are introducing we do need to have the discipline to when we're building things that affect lives to human beings
00:42:01We need to have that discipline. We need to think about risk. We need to think about the adverse outcomes
00:42:05we need to take responsibility and I think
00:42:08There's a lot of organizations that don't want to hear that because it pushes back on on their ability to make progress
00:42:14but
00:42:16Why are we writing software? We're solving problems for people we shouldn't be introducing new ones
00:42:23Yeah, and it's a very interesting point that you said about your friend, sorry forgot his name the PE professional engineer curtis because curtis
00:42:31Yeah, because here in canada
00:42:33There was a discussion about here in order to be called an engineer. You actually need a trade certification
00:42:40And then there was the debate that software engineers shouldn't have the title engineer because they haven't earned that certification
00:42:48Yes
00:42:51Yeah, that resonates that resonates
00:42:53Yeah, the level of discipline is just completely different between our industry and theirs
00:42:57Yeah, exactly. Um, I wanted to address something that you've written on your
00:43:03Linkedin bio you mentioned from burnout to bottlenecks no from burnout and bottlenecks to reliable fast-moving teams and systems
00:43:12So what's your what's your experience with burnout?
00:43:16Oh my goodness. I have a lot of experience in burnout
00:43:21You know as a as an operations person at fast growing organizations
00:43:26There
00:43:28Is definitely people that have like hero complexes people that want to prove themselves people that have
00:43:35those little things inside of the psychology where they feel they need to
00:43:39To to jump in and just own things. It's a vulnerability right imposter syndrome
00:43:45All of those tendencies that we all have as people people pleasing all of those things can contribute to to burning out where you
00:43:52Are so motivated to prove to others that you are doing good work that you're not listening to your body
00:43:59You're not listening
00:44:02To your to to your emotions your emotional state. You're not being introspective. You're not giving yourself time and space to rest
00:44:08Your relationship with rest is is dysfunctional like there was a period of time where I thought that rest was bad
00:44:14That the idleness
00:44:16Was was wasteful when you know in
00:44:20In in this season of my life now where i'm thinking a lot about this stuff. No rest
00:44:25Gives you the ability to do the big lifts later
00:44:29You know people that like, you know body build they're not in the gym every single
00:44:35Day doing every single body part they give themselves time
00:44:40to to to rebuild the fibers in their muscles
00:44:45So why why are we?
00:44:47You know
00:44:49Saying we can't do that when it comes to our minds
00:44:51So that's the the theme but I I remember
00:44:55After leaving meta I was I was like patient zero in the gigantic wave of layoffs
00:45:01So I was affected by the layoffs. That was what sorry about that
00:45:04Oh, no, I mean shirt. Oh moto would not exist if it were not for that for that. Okay, cool
00:45:08Right the ashes a phoenix right percent like that was that was the my response to it
00:45:13My response to it, but that that period where i'm i'm starting my own business
00:45:18And i'm taking a step back from the industry and thinking about how I want to work. It really made me
00:45:23Become very aware of the burnout that I was carrying with me and not, you know, acknowledging and um,
00:45:32You know, I I remember earlier in my career being the grumpy siseman and if there are any any folks
00:45:38From from aquia or former aquia that they're listening to this, you know what i'm talking about
00:45:42I used to be real grumpy
00:45:44and like i'm i'm very jovial today because i've done a bit of work on myself, but
00:45:49That was like burnout and I didn't recognize it
00:45:51I was just I was just thinking that people like too many people were coming up to me asking asking me for requests
00:45:56When they should be doing it themselves and like read the man page, you know what i'm saying?
00:45:59I dealt with a lot of burnout and and I think my business and the way that I run it gives me the the the
00:46:04the space
00:46:07to
00:46:08Tackle those challenges head on and into interact with work in a way that is more healthy
00:46:13and maps more to how my brain works what I want in my life and making sure that
00:46:20You know, i'm not
00:46:23Living to work. I'm i'm working
00:46:25So that I can have experiences that fill me up. You know what I mean?
00:46:28What do you think about that? I know I kind of bounced around
00:46:31I was gonna say that the grumpy
00:46:34sysadmin I can relate to that not not for myself, but like companies i've worked so there's always a guy who's like
00:46:40knows how to spell up kubernetes or notice everything about aws and if you want something from him
00:46:44He's like really? Okay, and then he goes and does it but doesn't want to do it so I can relate with that
00:46:49but yeah, I can see why that happens and where it comes from purely because of the fact that it happens a lot and
00:46:56So I can see the frustration
00:46:59And so yeah, I think it's good to to get a break from that and um, you have to work on yourself and try other things
00:47:05For sure. There's personal responsibility involved in that but it's also being very cognizant of the culture
00:47:11Of the environments in which you work the teams in which you work the people you interact with
00:47:15And making sure that you're choosing places that are best for you. I know I know the job market is still a little funky
00:47:21It might be hard to make
00:47:22You know those types of choices and being being able to say i'm just going to go to another place
00:47:26That might be really difficult to say I recognize that too, but at least being mindful
00:47:30Of where you are and listening to your emotions listening to your body and and doing things about them
00:47:36I think is a big piece of the puzzle. There is a
00:47:39Doctor a doctor moslock. She made something called a burnout inventory
00:47:44She originally designed it for medical personnel because she was in the medical field
00:47:49But the burnout inventory and that her her her research is extremely applicable to tech
00:47:56and you can you can even like
00:47:58Check out some of her talks. I think she did
00:48:01I think she was doing talks for the annual devops event that the author of the phoenix project
00:48:06like ran
00:48:08so
00:48:09Yeah, i've read a little bit about the clinical aspect of burnout as well as experiencing personally
00:48:15Yeah, and i'm super happy that this layoff turned into something good for you it led the way
00:48:24To you establishing your own company and I find it so interesting that from what I can see from your posts
00:48:31Like you're running this company
00:48:33basically on the road, right you're
00:48:35Moving all the time. You're basically living kind of like a nomadic lifestyle. Is that correct? That's right
00:48:42I I do live nomadically I am literally on this call using starlink in the middle of the high desert in nevada right now
00:48:50um, I converted my tikoma pickup truck
00:48:54I I called her molly. I I can I converted my uh pickup truck into a
00:48:59Mobile battle station. I am standing
00:49:02in the truck bed
00:49:04um
00:49:05And I have comfortable sleeping quarters 300 watts of solar
00:49:09Um a refrigerator, uh, you know a power system I built myself. So i've made this
00:49:16Micro home out of this truck and i've worked from every everywhere like like my property out here in in nevada to
00:49:23The parking lot of amc movie theater somewhere in boston like I I work from anywhere which is a lot of fun
00:49:29I've been doing this type of thing
00:49:31For two and a half years. I started shirtomoto three years ago
00:49:34I've been i've been living nomadically for the past two and a half and it's been the best experience adventure of my life like
00:49:41You know
00:49:43Doing work in the mountains. I had a bear climb in my truck once
00:49:46Oh, wow
00:49:48Yeah, what was that like?
00:49:51It was it was a little scary, uh back in back in those days. I was in a rooftop tent
00:49:54Not my cool camper thing now. Was it a black bear or a brown bear?
00:49:58Well, I am I am alive. Therefore. It was a black bear. Okay, cool
00:50:02It was it was the biggest black bear i've ever seen it was like it was gigantic
00:50:06It was it was looking for picnic baskets or something
00:50:08but
00:50:09It crawled into the truck bed where I was sleeping above on the bed rack with the rooftop tent and like it it shifted the weight
00:50:17And I had to I had to get the the keys out of my pocket and press the alarm button and go
00:50:21Do that and then the bear ran off when the alarm went off
00:50:24Yeah after after that I understood the necessity of building something that is enclosed like this thing is enclosed. At least the bear can't see me
00:50:30or get in there, so
00:50:33So you mentioned Nevada and Boston so you've done like a cross-country trip with your truck every year. Yeah every year
00:50:42I'm, actually planning to return to the northeast in a month right now. I have a six by ten cargo trailer that I am converting
00:50:48Like after this call i'm drilling holes and installing windows and vent fans
00:50:52And installing a thousand watts of solar and we're doing the whole business and it's going to be my mobile
00:50:58My mobile war room and office that I can work in during a clement weather
00:51:02That's so cool. And for some engineers or people who are also interested in this
00:51:08What's the price tag for building such a mobile home as you've done?
00:51:13I mean the the cargo trailer thing is actually rather economical you can pick it up for like four to five thousand and then
00:51:20Whatever you need to do that the price can go up, but you need to insulate it
00:51:24Obviously you need some ventilation some electrification
00:51:27There's a whole universe of people out there that do this kind of thing
00:51:31Um, i'm just one of the few that like combined nomadic living and tech and made it work
00:51:36But the you know, the truck is it's a coma with a custom camper on top from a company called go fast
00:51:41I got this one used but the um
00:51:44Yeah, it really depends on what your needs are. I mean, I see people rocking rocking it out in a prius
00:51:51you know and then I also see people in more sophisticated rigs, but
00:51:55yeah cost of entry can be can be low if you know a bit about cars and you're willing to
00:52:01You know roll up your sleeves and do work
00:52:04So what inspired you to start this endeavor of?
00:52:07camping and traveling and
00:52:10pimping up your car
00:52:11Yeah, um, well i've only lived in two places prior to
00:52:14This journey and I was spending my whole life working and and basically doing what what people told me to do
00:52:21What is expected of a young man, you know, you go to college you get your degree you get a good job
00:52:26You work your way up. I did that
00:52:28and
00:52:29and at the end of that corporate road, I realized hey man, i'm not as happy as
00:52:34I I should be about this i've made some massive accomplishments, but i'm not feeling it and since I started my own business and realized hey
00:52:41Do I really want to pay this much rent in the boston metro area?
00:52:44If I didn't stay in boston, what would I do and I found a youtube video of this guy who took a u-haul truck
00:52:51You know like a moving truck
00:52:53And he turned it into an apartment on wheels and i'm like
00:52:55Huh?
00:52:58That's kind of cool
00:53:00And I started going down this youtube rabbit hole and learning that there's like a whole
00:53:05you know community of people that live like this and I just
00:53:08Researched and watched a bunch of youtube and then decided I I sold everything
00:53:13I did not renew the lease at that apartment
00:53:16And drove out west in my civic and a few and a few months later. I got molly and and and set it up and
00:53:22And started but yeah, it was it was an understanding that I I had a lot of
00:53:28Life experience that I needed to catch up on
00:53:31Um, you know, I spent too much time ironically on a tech podcast. I spent I spent too much time
00:53:37On the computer and not enough time out in the world and I realized I needed to have that experience to balance me out because
00:53:44I'm more than just a dev obsessory dude. Like i'm a human being with my own
00:53:49my own flavor
00:53:52I've got experience needs to go and touch some grass
00:53:55I gotta touch some grass and and I've touched more more than grass out here in the world rocks
00:53:59You know mountains water all kinds of stuff. There's all kinds of things to touch out here. Yeah 100
00:54:04So you mentioned your bear incident has there been any other?
00:54:08Interesting adventures while you've been on the road
00:54:12I mean there so so many I mean I did cinnamon pass in colorado with friends, which is an off-road, you know
00:54:18A little technical a little challenging i've been to wyoming i've been to yellowstone grand teton
00:54:24I camped in some public land near there where where you can see the bears. I sometimes uh here
00:54:30Uh wild burrows out where i'm at. I saw wild horses a few days ago
00:54:34You can you can hear them walking and you look out the tent window and there it is. There's some horses
00:54:39um
00:54:41But yeah, i've i've crisscrossed the country a good number of times
00:54:45um
00:54:47And uh, yeah, it's been it's been nice. I I have a sweetheart that's willing to
00:54:51Indulge my craziness and and we go on adventures together
00:54:54um, and uh, yeah, it's it's it's been fun. I mean that could be a whole episode about all the crazy places i've been
00:55:01Oh, yeah
00:55:03That sounds so cool. Probably there is an episode on your own podcast, right?
00:55:08About this, you know, there should be because usually usually I I well because I bring on a guest and we're talking about you know
00:55:15The subject but maybe I should do a just one talking about this
00:55:20yeah, even the um the takes and your lessons from burnout, I think that would be a very
00:55:25useful episode for a lot of engineers especially nowadays because I feel like
00:55:29we've been promised that AI will
00:55:33Make us more productive and we'll have to work less hours when in reality. I feel like it's the opposite basically
00:55:41Sometimes you're just being burned out by the amount of work that you're that you're expected to do with these tools now available
00:55:50Yes, my views on that would probably be considered very subversive
00:55:53And i'll just i'll just leave it at that where where we need we need we need more workers. Let's be honest
00:55:58That's what we are. We are workers. We we need conditions that are more human
00:56:03Totally
00:56:06Yeah, agree
00:56:08We always like to ask our guests. Do you have any hot takes about sre dev ops?
00:56:14Ai anything related to tech. Yeah. Yeah. All right hot take and I say this I say this with the greatest amount of
00:56:22Respect I don't mean to gatekeep the practice of sre
00:56:26But i'll say this if you have an sre role if you have a job title
00:56:29That's sre
00:56:32And right now you're in the corner writing yaml
00:56:34And getting paged
00:56:37I want you to really question whether or not that is in fact an sre role
00:56:40Sre is a practice where you are taking a not so reliable system
00:56:45And transforming it into a reliable system
00:56:47And making customers happy through slos
00:56:50Toil management capacity planning right all of those higher order operational responsibilities and using software engineering
00:56:59That's very very important. That is sre practice
00:57:01So yaml is not a programming language if that's what you're doing most of the time
00:57:06I encourage you, you know get out into deeper waters. It's great out here
00:57:11there's a great community a lot of people that will be more than happy to teach you but
00:57:15Make sure you're finding uh roles that challenge you and help you grow
00:57:18The the phoenix project book it made the word devops popular and since then there are people who are devops engineers who do what?
00:57:26You said like write yaml and um
00:57:29Spell out kubernetes and all that stuff and it's like kind of like the devops
00:57:33That the book was talking about is not somebody who who does that. It's more like what you explained bridging two different
00:57:39Was it what's the word like not systems but disciplines together?
00:57:43And I think it's quite common like I said in the world to see i'm i'm a devops engineer
00:57:48I'm learning devops and it's like it's it's not that it's
00:57:51What you explain it's difficult now to bridge that because it's now so common that people are devops engineers
00:57:56It's not seen as what you explained. But yes, it's difficult to go back now, isn't it?
00:58:02Yeah, I mean we're talking about semantics and names so i'll try to qualify what i'm saying when I when I mean devops
00:58:09Yeah, indeed
00:58:10We're not talking about a set of tools. We're not talking about a specific team, you know title tool or team people talk about this
00:58:15All the time. No, it's not that it it it it is the practice of joining
00:58:19Technology people leadership process so that we can deliver software to the customer in a way that is as fast as possible
00:58:27Corruptive and is as good as as good for the business as possible like that to me is what devops is and the means to get
00:58:34There is the verse
00:58:35It isn't it isn't just kubernetes kubernetes is a piece of the puzzle. It's not just you know, cicd
00:58:40It's a piece of the puzzle. Sometimes it's also sitting down and listening to people. Sometimes it's talking about building vision and strategy sometimes
00:58:47You know, it's about sending your teams on a day off after they got paged one too many times at 3 a.m
00:58:52Devops is about that whole
00:58:55Holistic view rather than just the tools the tools are the tools are sexy. I get it people want to sell the tools
00:59:01But that's only one facet of the whole experience. Yeah, speaking of tools. Do you have any of your favorite tools that you use?
00:59:08Oh gosh, okay. Uh, let me let me think about that. Yeah, i'll i'll i'll volunteer one. So
00:59:14there
00:59:15Lately have been a lot of incidents in github recently
00:59:20And we've kind of gotten into the habit of hey, I would like my build processes and my testing pipelines to be
00:59:26In vendor there is a tool that you can run instead if you want to self-host your stuff
00:59:32called concourse
00:59:35And I like concourse. It allows you to build very sophisticated
00:59:38Pipelines for for testing or build or delivery or what have you using yaml each little section runs in a container
00:59:47But you self-host it and the reason why I really like it
00:59:50Is that the open source community?
00:59:53Like the governance is is is its own governance model
00:59:56It's not owned by a company that can you know swap out the the license and turn it into a sas
01:00:01Which we've seen many times for other projects. So if you're if you're getting frustrated with uh,
01:00:06you know
01:00:09Using the cloud service du jour for your cicd check out concourse use it run it on prem
01:00:16You know, maybe you're going backwards five or ten years in terms of philosophy, but it might be more stable. Who knows?
01:00:20Cool. I've never heard of it. I'll have to look into it
01:00:23Yeah, me too
01:00:26Yeah, it's good stuff. Oh, there's there's some companies out there that definitely use it
01:00:29Um, and yeah friends don't let friends run jenkins like it's that's done. Don't do that. Don't do that
01:00:35So you're saying jenkins is dead
01:00:38Did I say that
01:00:41No, no, no just it wasn't the gotcha question
01:00:44But what I am saying is that there's there's some there's some more options
01:00:47And I don't think people want to write groovy scripts anymore. So yeah
01:00:51Other things out there have been a pain for me as well
01:00:54Yeah when i've used it
01:00:57Yeah
01:00:59Very good
01:01:01Is there anything you want to plug like, uh, you do a podcast. Is there anything else you want to talk before we wrap up?
01:01:06Sure, let's do that. I'm always grateful for that. Yes
01:01:09I uh am a consultant at my company cherto moto. That's c-e-r-t-o-m-o-d-o.io. I specialize in assessing
01:01:18Companies reliability devops and sre posture if you're getting paged a lot
01:01:22If you're not shipping often if customers are angry
01:01:26Should definitely put some time on my calendar. I also run a podcast called reliability rebels
01:01:31Where I interview people talking about
01:01:33How sre isn't just the tools it's also challenging the status quo
01:01:38Talking about the socio technical and then on the 24th
01:01:41of february, I am doing a uh webinar about
01:01:45the
01:01:47ai flood of code
01:01:48I think I call it the ai code tsunami. Um, and I do webinars monthly talking about all kinds of interesting subjects
01:01:54So if you're interested check out my website and you can learn all about that stuff. Thanks so much for the opportunity to plug
01:01:59No worries. Thank you. I'm in for
01:02:03For speaking about all these sorry adventures and lessons learned. It's been very fun to have you
01:02:09So thank you everyone for listening to this episode to the better stack podcast
01:02:13Subscribe to our show wherever you get your podcast apple spotify youtube pick your poison, but for now
01:02:20It's a goodbye from me
01:02:22It's a goodbye from me
01:02:25And a goodbye from me
01:02:33(upbeat music)

Key Takeaway

As AI-driven development dramatically increases the volume of code, the role of SREs shifts from manual tasks to managing the resulting 'code tsunami' through robust fundamentals, aligned incentives, and socio-technical leadership.

Highlights

The 10x code tsunami: AI is generating an order of magnitude more code, which will stress test CI/CD pipelines and increase production incidents linearly.

SRE as a socio-technical practice: Reliability isn't just about tools like Kubernetes; it's about aligning human incentives and business goals with technical capabilities.

The bridge analogy: Engineers remain essential because we cannot yet trust AI to design critical infrastructure like bridges or skyscrapers without rigorous human review.

The 'Compensatory Principle' of automation: AI should be used to force-multiply human effort rather than replace management decisions or operate autonomously in high-risk environments.

Incentive alignment: High-performing organizations like Meta treat operational excellence and reliability as key metrics for engineer promotions and bonuses.

Nomadic engineering: Amin Astani runs his consultancy while living in a custom-built truck, demonstrating the viability of combining high-level tech work with an outdoor lifestyle.

SRE Career Longevity: Focusing on fundamentals like Linux, systems architecture, and SLOs is more critical than ever as AI shifts the bottleneck from writing code to operating it.

Timeline

Introduction and the Trust Factor in AI Design

The episode begins with a provocative analogy comparing AI-generated code to the design of a bridge or skyscraper. Guest Amin Astani asks if anyone would feel comfortable sitting in a building designed solely by AI without human engineering oversight. This sets the stage for a discussion on why human engineers remain vital in the age of generative technology. The hosts introduce Amin as an 'SRE wizard' and the founder of Cherto Moto. This section establishes the central theme: the necessity of human accountability in automated systems.

Amin's Journey: From Red Hat Linux to SRE Mastery

Amin recounts his entry into the tech world, which started in high school when his girlfriend's mother introduced him to Red Hat Linux and the KDE desktop. He describes his early fascination with operating systems and infrastructure, eventually leading to a career in classic operations at Acquia. As Acquia experienced 'rocket ship' growth, Amin encountered the manual toil and scaling challenges that necessitate Site Reliability Engineering. He credits early resources like 'The Phoenix Project' and 'The Practice of Cloud System Administration' for shaping his transition into the SRE discipline. This journey highlights the shift from traditional IT roles to the modern reliability-focused mindset.

Socio-Technical Practices and the Reality of Human Error

The conversation shifts to the philosophy of SRE as a socio-technical practice that bridges the gap between technology and business outcomes. Amin explains that human error is never the end of an investigation but rather the beginning of a deeper look into systemic factors. He references work on airline accidents and post-mortem culture to illustrate that tools must be ergonomic to prevent fat-fingering commands. One host shares a specific example of surgeons clicking buttons too fast for software to handle, causing glitches. This section emphasizes that reliability requires understanding the intersection of human behavior and automated workflows.

The Evolution of Monitoring: From Pagers to Planetary Scale

Amin shares nostalgic stories from the 'beeper era,' where he carried a physical pager for his first tech job in 2005. He describes early monitoring tools like 'What's Up Gold,' which relied on simple ICMP pings rather than sophisticated HTTP health checks. The transition to Nagios provided more context for alerts, such as specific services being down, though it was still 'pre-SLO' thinking. Amin reflects on working with supercomputers at the University of South Florida, where managing hundreds of servers required growing sophistication. This history provides context for how far the industry has come in terms of observability and incident response.

Lessons from Meta: Production Engineering at Scale

Amin details his experience at Meta (formerly Facebook) as a Production Engineer working on massive systems like Conveyor, a global CD system. He discusses the importance of internal tools like Scuba, which inspired the founding of Honeycomb and influenced his thoughts on agentic coding. A significant portion is dedicated to Meta’s incentive structures, where reliability and scalability work are directly tied to performance reviews and bonuses. This cultural alignment encouraged software engineers to actively participate in reliability tasks rather than siloing them. Amin argues that this holistic approach is what separates high-maturity organizations from those that simply 'ship and pray.'

The AI Code Tsunami and the Future of SRE

Amin presents his thesis on the 'AI Code Tsunami,' arguing that a 10x increase in code production will lead to a 10x increase in alerts and incidents. He contrasts the 'Leftover Principle' of automation with the 'Compensatory Principle,' advocating for AI as a force multiplier rather than a decision-maker. The discussion explores how smaller organizations are now facing 'big tech' problems because their code flow has increased so rapidly. Amin warns that AI-generated code will reveal every weak link in a company's operational posture. He stresses that SREs will become more popular as they move from the background to becoming the primary constraint in the software delivery value stream.

Burnout, Nomadic Living, and Closing Hot Takes

In the final section, Amin opens up about his personal experience with burnout and how it led him to adopt a nomadic lifestyle. He describes his current setup: a custom Tacoma truck 'battle station' equipped with Starlink and solar panels, allowing him to consult from the Nevada desert. He offers a 'hot take' that many SRE roles are actually mislabeled operations jobs if they consist solely of writing YAML and getting paged. The episode concludes with recommendations for books like 'Toyota Kata' and tools like Concourse for self-hosted CI/CD. Amin's closing message encourages engineers to touch grass and find balance while staying grounded in technical fundamentals.

Community Posts

View all posts