00:00:00This episode of the stand up is going to be extra special because Casey is going to do the intro.
00:00:05Casey, what are we talking about today? Hello everyone and welcome to the stand up. The number
00:00:1245, 6 best tech podcast on Spotify according to the most recent something. True.
00:00:29Anyway, sorry. Today on the stand up, I wanted to cover something. I'm going to talk about the AWS
00:00:34outage that happened in October, but I'm doing so because I kind of wanted to talk about a bigger
00:00:41thing, which is the idea of actually understanding something versus saying you understand something.
00:00:49So like one of the things that happens a lot, especially I think to people who are earlier in
00:00:56their programming career, like if you're a junior programmer or something, you're coming in. And I
00:01:01know this was certainly true of me, is you want to seem like you know stuff, right? Like you don't
00:01:08want to seem like you don't understand what's going on. So there's a lot of like external pressure,
00:01:14whether it's really there or not, you feel like you should kind of say that you understood something
00:01:21or pretend to understand something, even if it's like a little bit hazy or you didn't quite get it.
00:01:26And even if it wasn't your fault, like even if the thing wasn't explained properly or didn't include
00:01:33like important information, you're still incentivized to basically act like you knew what
00:01:39it was, right? Because it just makes you seem smarter or something, or at least doesn't make
00:01:43you seem junior, right? And so one of the things that at least I've found as I got older and
00:01:50programmed, had more programming experience and things like that, is nowadays I like almost
00:01:58over ask for things to be explained. Like I'll like, I don't care about looking dumb at all.
00:02:03I'm like, wait a minute, go back. Like I didn't understand that part. Like, what do you mean by
00:02:06this? Or like, what's that term mean or whatever? Because now I just don't really care about that.
00:02:12Like I'm not as worried. And I want to actually know because I've had so much experience programming
00:02:18where I thought I knew something or I pretended I knew something and it came back to bite me.
00:02:22I'm like, I want to actually know. Like I want to be sure that when I have an explanation of a bug
00:02:28or I think I know the reason of a performance slowdown, I always in the back of my head,
00:02:33I'm like, if I haven't really gotten to the bottom of this, it could be something else. It could be,
00:02:38it could be that the real thing is still hiding in there. And I just don't know because I haven't
00:02:43really looked at it all the way. I'm just, I'm moving on because it's convenient or whatever.
00:02:48And so the reason that I wanted to talk about the DynamoDB outage is because recently there's
00:02:55been kind of a string of high profile outages. So there was like a big one that took down Google
00:03:01and it turned out it was a thing where like they didn't handle a field being empty, right? So that
00:03:07their programming, the way they were programming, they were like, okay, we have this thing. We load
00:03:12some JSON and if there's nothing in the JSON, it's just, we like, we derefnal pointer or something,
00:03:17right? It was like literally that, right? And then there was one with CrowdStrike where they
00:03:23were like, they took down the entire world with blue screens. And that was, they gave a very good,
00:03:28it was like a really good explanation of it. They were like, we do this certain array sizing thing
00:03:33and we had too many rules. So it like overflowed the array, right? And so these were like pretty good
00:03:39when they gave what they call RCAs or root cause analysis, right? When they said like,
00:03:45here's why we went down. When I read them, I didn't feel like there were a lot of unanswered
00:03:50questions in my mind. Like maybe I didn't know like literally the line of code that, because they
00:03:55maybe didn't publish literally the piece of code, but they gave me enough that I was like, okay,
00:04:00I understand how someone wrote this code and I understand the stupid thing that they did,
00:04:05right? That like, okay, don't do that thing. I understand. And I'm totally like, okay.
00:04:10With the DynamoDB one, because it came up on this podcast, right? We talked about it when that,
00:04:17that dude at the guitar center, right? Was like, I overheard someone talking at the pub, right?
00:04:24Yes. Incredible. Here we see the elusive programmer, a simple creature that spends most of its time
00:04:31working alone, often in darkness, but what's this someone being wrong in the internet. Our coder
00:04:36springs into action, reaching top speeds of 120 words per minute before flash a light mode website.
00:04:42The natural enemy of these code lovers stuns our friend. The chase has called off. We'll have to
00:04:46get them next time. When not on their computers, they can spend hours drawing crude symbols,
00:04:53something they call whiteboards. Researchers have discovered thousands of dialects often with more
00:04:58than a dozen used in a single office. However, no linguist has yet deciphered what their purpose is.
00:05:03Vain creatures, their bodies have evolved over a millennia to be able to sit in unusual postures
00:05:11while looking at themselves online. This will often last for many hours using the excuse they're
00:05:16waiting for code review, but pressed to why they're so inactive. And finally, after a long day of
00:05:22accomplishing very little, our keyboard warriors ready for bed. Quick read and it's lights out.
00:05:29Good night, little coder.
00:05:30So how do I sleep so well at night? Well, I have Sentry to help me crush those bugs. And I'm not,
00:05:38I'm not talking about like little teen, tiny South Dakota bugs that die in the winter. I'm talking
00:05:42about big, mean jungle bugs. And I'm not scared of any of them, by the way, just, but I can squash
00:05:50those bugs with Seer by Sentry. So I was kind of a little more motivated about that one to go like,
00:05:56okay, let me go see like what, how much information they've posted. And I had read,
00:06:02I had already kind of read afterward, they had a summary where they posted an RCA and it was very
00:06:07vague. Like the RCA just did not really explain very much. I then noticed that they posted a full
00:06:14presentation like at reinvent in December, they, or I guess I don't know if reinvent was in December,
00:06:20but the video went up in December of the reinvent presentation where they covered this outage.
00:06:26So I went and watched all of that. And after having read the entire RCA and watched the entire
00:06:32presentation, I still was left going. I don't see an actual explanation of the bug here,
00:06:39right? Like I'm trying to figure out what the actual bug was and it just wasn't ever explained.
00:06:45And so what I kind of wanted to do was just talk about that, go through why I don't think they
00:06:51explained what the bug was and just use that as an example of like, I don't think people should just
00:06:56go, Oh, okay. I get what the bug was. Cause people have like replied to me and gone, Oh, here's,
00:07:01let me explain to you what the bug was. And then they just explained the same things. I'm like,
00:07:04that's not the bug. Right. So everyone see is like incentivized to go like, I understand it. Cause I
00:07:09read it's like, no, if you can't tell me what the actual bug was, then we're not done here. Right.
00:07:14Like we should have that fuller explanation. So does that all make reasonable sense? Like
00:07:18what I'm saying? Yeah. First off, I just want to say, I knew exactly what you were saying, Casey.
00:07:23Like right from the start, right. It's like right away. You, you were like, okay, I know,
00:07:32I know exactly what you're saying. No questions on my end. No blockers. Thanks everybody. I'm great.
00:07:39I'll see you guys tomorrow. You know, no problem. I just want to say, I really like listening to
00:07:43Casey talk on the podcast when I listen on Spotify, but also just right now, like I could listen to you
00:07:47talk for an hour. Great shout out too, for the Spotify. I was just going to say, I was going to
00:07:52say like, especially when you listen on Spotify, the quality is incredible. You also get the bonus
00:07:59extras, right? You get all the banter before and after the actual extra. We started posting longer,
00:08:08longer versions on Spotify that are like more of the extra. Yeah. Time less of on top is not on
00:08:15topic stuff, but a little more yappenings on Spotify because the live audience gets the yapping.
00:08:19They get to come in here. They get to hear about trash and his Pokemon addiction, which you probably
00:08:23don't even know about because you were not, you were listening to this on YouTube, right? You don't,
00:08:26you're going to, you don't get to hear all the fun stuff. That's kind of a hard sell for the first 10
00:08:31minutes of a YouTube video. It's a very hard sell for a YouTube video. Be like, I'm going to watch
00:08:36four guys talk about something I don't even understand. And it's called dynamo DB. Yep.
00:08:41Since we're starting the podcast, maybe we should introduce Adam. Oh yeah. That's a very good point.
00:08:45We haven't done any at all. Hello. Tell us a little bit about why you're onto the podcast today.
00:08:50Cause I am at TJ's house number one, number one reason why TJ requires all people who visit his
00:09:00house to be on the podcast. It's been awkward at a couple of times. Yeah. Yup. Who are you really?
00:09:07Other than an AWS hero. I'm not even that I wasn't AWS hero. All right.
00:09:13Kicked out of the superhero group. Like how does that work? You just don't, you don't get renewed.
00:09:18I was a one-term hero and they decided, Oh, you, is it like a paid up thing? You pay to be a hero?
00:09:24No, no, I just, I didn't really care about it anymore. Talk about it ever. So they were
00:09:28like, maybe he's not a hero anymore. Now he's a villain. Casey looks like he's part of like some
00:09:34murder mystery. He's standing there. Oh dude. We're, we're about to get, uh, like the, uh,
00:09:39what is it? Nick Hill. What's the person that does all the like drawing on the board. And then
00:09:43it shows up. Casey Muratori. That's the one you're thinking of. Muratori. Is it Muratori or is it
00:09:49Muratori? Oh my God. You're about to do visuals. Aren't you? So I know this is the best podcast.
00:09:54It's literally, this is the best one to be a part of. Uh, it's pronounced Muratori by my family. Like
00:10:01almost like there was a Y there like Muratori, but that's correct. It doesn't really make any sense
00:10:06because it's an Italian, it's an Italian name. And in Italian, it'd be Muratori or Muratori.
00:10:13Doesn't make it. So why, how it got mur I have no idea. That was some Italian American like
00:10:23immigrant thing that happened. I guess. I don't know. Okay. So here's effectively what they said.
00:10:28They have these things called and API endpoints, but they call them right. And these are the domain
00:10:36address. Like if you look up in DNS, it's the name that you're going to look for to know who
00:10:41you're supposed to send like your DynamoDB requests to. And these things, I guess, look like this.
00:10:47And Adam can probably confirm this because he is, or was a hero.
00:10:53They look like, Oh, it's behind. Yeah. We're, we're a few seconds behind. Cause our video
00:11:00disappeared on river. Oh, there we go. So they look like dynamodb.use-east-1.api.aws
00:11:06or something like this. And I guess it depends whether you're using IPv6 or IPv4. Like they have
00:11:14different names depending on things or whether you're using like a specific, like they talked
00:11:19about governments use like a different one or whatever. So these names are like names that you
00:11:24effectively hard code, I guess, into your application where you're like, when I need
00:11:28to do something with DynamoDB, I'm going to like ask for this. Does this make sense? And does that
00:11:34sound right? Adam to like, cause I don't use AWS stuff. Yeah. Yeah. Yeah. That's all right.
00:11:38So, you know, you, you asked for something like this and you're going to send perfectly.
00:11:42I mean, I know what he's saying. Yeah. So that then is going to redirect you somewhere because
00:11:53obviously there isn't like one machine that's going to handle all the DynamoDB traffic in the
00:11:59entire universe. Even if you subdivide it by region, which you can see here, you're kind of
00:12:04supposed to pick a region. I guess you don't, you don't send it to some main address. You send it to
00:12:09a regional address or maybe there is a main address you can use that will figure it out. I don't know.
00:12:13But anyway, at some point you're talking to this and this needs to point to effectively like a load
00:12:19balancing scheme. So this thing is supposed to point to effectively what they called a DNS tree.
00:12:27Although they never really explained the tree nature of it at all. It sounded more just like a,
00:12:32like a weighted array, if you will, where you just said, here's a bunch of machines and you're going
00:12:38to pick those machines based on weights that we set so that we can load balance, right? So if a machine
00:12:44gets behind, maybe we set its weight to lower. And if a machine seems kind of empty, we set its weight
00:12:48to higher. And so they called it a tree. So I'm assuming it's a tree. They never explained what
00:12:53the tree part of it was, but this name is supposed to point. Can I interrupt for one quick second?
00:13:00By the way, someone did get their L6 promotion based on that tree. So I do think next time you
00:13:05should find out what that tree is. Cause that meant a lot to somebody. Okay. There was a packet
00:13:09and engineers happened. I do agree. The tree is probably important. It's just not important
00:13:14for the bug. And even that, so that I will say there was no need for them to explain the tree.
00:13:19So I'm okay that they skipped out on what the tree is doing. But I got a quick question as well.
00:13:25Yes. Is it called a tree because it's a root cause analysis or no?
00:13:29No more jokes. We're too off topic. I'm sorry. I'm sorry. So anyway, this is supposed to point to that.
00:13:37And that, that sort of this, this load balancing scheme basically of DNS entries and the way that
00:13:46they described this in like their presentation is they would use a thing like I'll say plan
00:13:52one 45 dot dynamo DB, like DDB dot AWS, right? Now this is the root of that tree, I guess,
00:14:02not root cause analysis, but like this tree, this would contain like, this is the top level record
00:14:07of a bunch of records that allow it to do its load balancing. And I assume route 53 kind of
00:14:13has this load balancing capability. I'm reading between the lines of the presentation. They didn't
00:14:17say that outright, but I'm assuming route 53, which are doing all this through, you know,
00:14:21which is their own DNS thing is allows that load balancing to happen by you just set stuff up in
00:14:26here that says how the load balancing should sort of be working right now. And then it will pick the
00:14:31correct machine based on like some kind of randomization in the weights or whatever. Now,
00:14:35what they said was this name, which really does exist. And apparently there's a tree or something
00:14:41like this. This name is one that they just kind of used for the presentation. They never actually
00:14:48used a human readable name for this plan, like one 45 that I've written here or whatever. It was
00:14:53really a hash of something. So it would really be like, you know, zero a F E one, two, you know, nine
00:15:00a or something like that, right. Is actually what would be there. So if you went and looked, you
00:15:05would not see a human readable name, or at least at that time, you wouldn't, I guess you wouldn't see
00:15:09like plan one 45. You just see that. And so the idea was, okay, a user goes to use it. They query
00:15:15this name route 53 will direct them like to here. And this thing is some kind of a load balancing
00:15:22tree that route three can use that will allow you to get where you need to go. Right. They will give
00:15:27you an actual machine you can send traffic to eventually. Again, they did not describe any of
00:15:32that. So I have no idea how any of that works. I've never touched or used route 53. So I have no idea,
00:15:38but we'll just assume that that happens because it doesn't matter for this bug.
00:15:41We do have an AWS hero. So if you do, if you are confused, you can always
00:15:45ask Adam and he may have further insights. I mean, yeah, go for it.
00:15:50Well, route 53 does have a lot of different ways you can like split the traffic. So yes,
00:15:54weighted is one of them. And that sounds like what they described.
00:15:57So somehow they've set up these records with that. And they just didn't say how, but something,
00:16:02something in a tree format did that. My guess is there's like a weighted, like the tree has like
00:16:07weighted like there's a couple of weights at the top that branch out to more weights or something
00:16:11like that, because that's easier for it to deal with because there's a lot of them or something.
00:16:14Who knows? Anyway, I have no idea. Point being, this is what's supposed to be happening normally.
00:16:20Now, the reason that this is called plan 145 here, even though it actually would have been
00:16:24some hash code, but they refer to it as like plan 145 is the load balancing, as you might imagine,
00:16:31has to be kind of continuous because the DynamoDB machines are like doing stuff all the time. They're
00:16:38becoming more overloaded. There's machines are going down or crashing or who knows what, right?
00:16:42Could be happening, being taken offline. New capacity can be added. And so this stuff has
00:16:49to be updated constantly, like all the time. So this main API endpoint that you connect to,
00:16:56it constantly has to have that tree that it's pointing to be adjusted. And so the way that
00:17:02they do that is they create another tree, the tree that they're going to move to, right? They create
00:17:09like, you know, plan 146 or something. And they make the whole tree here. And then when they're
00:17:18ready, like when this tree is done, they take this, you know, this record here, and instead of
00:17:24it pointing to that one, they point to this one, right? So you make the new one, and they move over
00:17:28to it by just changing that name. Now, for some reason, and this reason is not really explained.
00:17:36The way that they've set up that process is they split it into two pieces. There's something called
00:17:44a planner, which figures out what the new tree should look like, basically. So you can imagine
00:17:50there's some machine called a planner. And I don't know if it's an actual machine or if it's just a
00:17:56process running on some machine that's running other things, who knows. But there's something
00:18:00called a planner. And as far as I could tell, there's only one, meaning there's just a planner
00:18:06that sits there and figures out what should the new plan look like that we're going to switch to.
00:18:13And it's constantly doing this. So it generates plan 145, then it generates plan 146, then it
00:18:18generates 147, 148, 9, 10, you know, blah, blah, blah, blah, blah, right? And it just keeps putting
00:18:25out plans for all of eternity, because that's its job. Now, it never actually creates them,
00:18:31apparently. Its job is not to ever make them in Route 53. It's just to figure out
00:18:40what they would be if someone were to put it into Route 53. Then they have three enactors.
00:18:50These enactors get the plan from the planner, and they put it into Route 53.
00:19:06Does this make sense? Now, one planner, as far as I am to understand the pronunciation,
00:19:11three enactors. There was no explanation for why this would be the case. They said the reason there
00:19:18are three enactors is because it's supposed to be fault tolerant, like if one of them goes down or
00:19:22something. But they never explained why you wouldn't then need three planners, because if the planner
00:19:28went down, then the enactors have nothing to enact. So it didn't really make any sense. So there wasn't
00:19:33an explanation in the thing about why this structure looks the way it does. It's not really
00:19:38that important to the bug that it looks this way, although it kind of is, as we'll see later. So I
00:19:43was a little weirded out by the fact that they didn't justify this, but that's fine. So hopefully
00:19:50that makes sense. We have a planner. We have three enactors. The enactors are all trying to enact this
00:19:55plan. Now, what happens here is that for, again, reasons that the only thing they said in the
00:20:04presentation was it makes it easier to reason about. This is the only information about. They
00:20:11said it makes it easier to reason about. Because it makes it easier to reason about, these enactors
00:20:18use serialization. So instead of them just trying to create records, and if the records are already
00:20:26there, just not creating them or something, in other words, I have three people running.
00:20:29We all want to create, you know, let's say this top level record, plan146.ddb.aws, right?
00:20:36We all are trying to do that. One of us does it first. The next person tries to do it, and it's
00:20:42already there or something, right? We're all trying to create the same record. So in theory, we could
00:20:48just have three people randomly hammering on whatever part of the plan they're trying to hammer
00:20:52on, and in theory it should kind of all work, right? And I sort of got the sense, although
00:20:57you didn't come out and say it, I sort of got the sense from the presenter that he would agree with
00:21:01what I just said, meaning that they could have just had them run arbitrarily and it would or should be
00:21:08okay. But, he said, they use serialization to make it easier to reason about. What that means is
00:21:15instead of these enactors just hammering on it like that, what they do instead is they attempt to
00:21:21acquire a lock for whatever the endpoint is that they're trying to update. So in other words, if
00:21:28this person is trying to update one of these things, and I got the sense that it was if you're trying to
00:21:35update this one, but it could have been if you're trying to update this one, or it could have been
00:21:41on both. They never really 100% said, if I remember correctly, exactly where the locking
00:21:46was occurring. But the locking occurs by them going, okay, I'm going to create a lock that is
00:21:56a DNS record. And by using the fact that Route 53 has the idea of an atomic, which is,
00:22:02you know, I can do two things and if they both wouldn't succeed, then it won't do either of them.
00:22:08They basically made a locking system that locks via Route 53. So Route 53's DNS records are actually
00:22:15the lock record, if that makes sense. Can I ask a quick question? Yes. You said it does this through
00:22:21serialization? I don't quite understand what that means. Because I thought serialization is just
00:22:25converting from one memory to a different memory representation of some. I'm sorry, different
00:22:31serialization. So yes, that is serialization. In this case, we mean literally temporal
00:22:40serialization, meaning they wanted these enactors to have some kind of a way in which they would
00:22:48organize their behavior into an order rather than just being arbitrary. And the way that they did
00:22:55that was locking. So what will happen is, instead of this person just doing whatever it is they're
00:23:03going to do, like, okay, I'm going to like, I finished this, I'm going to point this guy at
00:23:07plan 146 now. Instead of doing that, it attempts to acquire a lock on like this, right? And if it
00:23:14doesn't get the lock, it won't make the change. So only one of these enactors can be in the process
00:23:21of updating this at any given time. Does that make sense? Mm hmm. Now again, exactly what they were
00:23:28trying to do with that was never explained. They just said makes it easier to reason about and left
00:23:32it there. So I don't know why they thought this was an improvement. And amusingly, it's what ends
00:23:38up uncovering the bug. So it wasn't an improvement. If anything, it was probably bad. But so Casey,
00:23:42are you saying they don't have like, they don't have a good reason for they're saying we're going
00:23:47to make the enactors run almost like one at a time? Why do they have a, why do they have three
00:23:52enactors? I don't understand. Like, why do they not just have one? They just don't say that. We don't
00:23:56know why. And they didn't quite explain, like, I didn't really hear an explanation for how you
00:24:02have three concurrent enactors. You expect them to be able to go down, which is why you have three.
00:24:07Right. But they're taking a lock. So what happens if this guy takes the lock and then goes down?
00:24:13Like, I didn't hear an explanation for that either. So this was all very confusing to me. Like I,
00:24:18I, I'm not complaining about it as part of what we're talking about here, because it's not important
00:24:25for the cause to me. But as a presentation, I had so many questions. Like I was like, I don't
00:24:32understand why you did any of this to be completely honest. Right. And maybe that's, again, part of it
00:24:38could just be that I don't use AWS services. It might be that some of these things would be obvious
00:24:43if you are someone who regularly uses route 53 or something, you'd be like, oh, it's because
00:24:47locks can be set to a timeout or I mean, I don't know. Right. But anyway, so yeah,
00:24:53so they're doing that. And what ends up happening for, for this, the thing that uncovers the bug
00:25:02is that what ends up happening is these enactors, when they don't get the lock, they just do like
00:25:08a back off, right? They'll basically just be like, okay, let me wait and I'll try again. So an actor,
00:25:14this an actor tries to get the lock, but somebody else already has the lock. So he just waits a
00:25:18little while. He tries to get the lock again. That's what will happen. Right. And what they
00:25:24said happened was they hit a pathological case, quote unquote, where one of the enactors is,
00:25:29you know, has enacted some plan. And that plan, let's say was pretty old. I think they used 110
00:25:35was an example that they used. So it enacted plan 110. And it wants to point, you know, it's like,
00:25:43I got to set the API to point to my 110 tries to get the lock to update dynamodb.use.one or whatever,
00:25:51and fails because someone else is enacting plan 111 or something like that. Right. Or plan 109 could
00:25:57have been a previous plan. So the other enactors are doing it. It can't do it. It backs off. Right.
00:26:02And remember this an actor here, we're on 110. It's trying, it's it really wants to enact it.
00:26:07It tries again. Someone else has the lock. Now it tries again, still locked. This person is sitting
00:26:13on 110, desperately trying to enact. It can't do it. Apparently this just happened so many times
00:26:19that the other enactors and the planner is just churning out new plans this whole time. Right.
00:26:23The other enactors, they get up to like 145 or something and 146 they're enacting plans that are
00:26:28like way ahead of 110. Right. And this guy's still stalled because he just unluckily never gets the
00:26:35lock. Right. Finally, at some point after like plan 145 has already been enacted and pointed to by some
00:26:44other enactor and all that stuff, plan 110, this inactive still trying to do it finally gets the
00:26:49lock. I mean, it's like, yeah. And so then he says, okay, we're pointing to 110 now. Yes. Right.
00:26:58So now it's on a super old stale plan, but this really shouldn't be a problem. Right.
00:27:03Because eventually the next time some enactor has something, it's going to be a much later plan.
00:27:07They'll just enact plan, you know, 146 or seven or eight or whatever. And we'll re-point it back
00:27:12to this and we're back to a fresh plan. So everyone will just have bad load balancing for like a few
00:27:17minutes, but then it'll be fine. Right. They did have bad load balancing for at least a few minutes.
00:27:22Right. Yes. True. Well, it's a lot worse than that. That's what was supposed to happen. Right.
00:27:30Meaning that's how they would expect this to work too. Okay. The problem is these,
00:27:36they also didn't want Route 53 to become clogged with all of these records. Because if they just
00:27:42left them around, eventually after, you know, three months, you have like 8 billion records
00:27:49that you stuffed into Route 53 for every, you know, couple minutes you're putting in this big tree of
00:27:54weights and stuff. They were like, okay, at some point we should just clean up these plans.
00:28:00So enactors also look for plans that are older than a certain amount. And if they are older than
00:28:08a certain amount, they'll delete them. So what happened was they pointed to plan 110. This
00:28:13enactor finally gets the lock. It points to 110. Another enactor is like, oh, wow, 110, man, that
00:28:19is old. We should get rid of that and deletes it. So now the DynamoDB us-east-1.api.aws is pointing
00:28:29at a record that can't be resolved. Right. It's just something, it would actually, again, it
00:28:34wouldn't look like plan 110. It would look like OAFE129A, some hash, dot, right, DDB.aws. But
00:28:44it's pointing at that name. And if you ask that name, you get nothing.
00:28:46So what would happen at that point is everyone who was trying to get
00:28:51a endpoint to send stuff to would get back an unresolvable name, basically. Right. And I don't
00:28:56really know what happens in Route 53 when that occurs, but you would basically be getting back
00:29:01something that you either couldn't use or just gobbledygook for an IP, who knows. But whatever
00:29:07it was, if you attempted to actually use it, you weren't going to get a response. Right.
00:29:10Interesting. Is this because AWS doesn't use enough Rust because that's obviously a use-after-free
00:29:15lug? And so I think Rust would have solved that, right? If you rewrote Route 53 entirely in Rust,
00:29:21obviously, all of these problems are not there. No, to be specific, I do think in the presentation,
00:29:30they did say, not about Rust, but they did say what would happen specifically, which is I think
00:29:35when you asked for this thing or either this thing or this thing, I don't know which one they were
00:29:40referring to, because I can't quite remember, you would just get back a thing that says no records
00:29:44found. So that's the end game of what would happen, whether it was from asking for this or asking for
00:29:50that, I'm not sure, but just get back no records found. That's what you would have received when
00:29:55you were trying to call that API. So whatever library you were using to use DynamoDB, it would
00:30:01just be like, hey, no records found, bro. Sorry. Right. So this, if you ask anyone on the internet,
00:30:11right, they're all like, yes, they explained the bug. That's the bug. The bug is that there
00:30:16was this race condition, right? Everyone, because everyone, as soon as you say race condition,
00:30:20everyone's brain shuts off. They're like, oh, okay, well, it was a race condition. Done. Nothing to see
00:30:24here, right? So they're like, it's a race condition. They explain it. It's like, no, they didn't explain
00:30:30it. Because if you think about what would happen here, immediately after this, everyone's getting
00:30:36this, it's a new one actor. A new one actor will just enact a new one, right? And so the bug, right,
00:30:44is why didn't that occur? That's the actual RCA that I wanted to see is why didn't the next
00:30:52actor come and fix it? Can I throw out something else? Wouldn't it also be a bug? Like why write
00:30:57a record so old that it should be deleted immediately? Well, it wasn't it was because it
00:31:02was it was this guy had written it quite a long time ago. And it was it the weight. Well, I mean,
00:31:08if you're asking, why didn't they write an actor is with better code? Yeah, that's a pretty cool.
00:31:11Okay, fair. It seems like if you're updating to something that should be deleted immediately,
00:31:17isn't like that's like that feels like the problem right there. You've done something wrong
00:31:21long before. Yeah, even though it doesn't really fix the theoretical structure of this thing,
00:31:26a simple check in this guy when after he finished backing off on the lock, he should maybe check to
00:31:30see whether he's about to set this to something that he would delete if he was running his deletion
00:31:36code is probably a good safety measure. But yeah, so 100% agree with him. Okay, but an actor worked
00:31:41really, really hard to get that record. Waiting a long time. Oh, it's gonna have its Pokemon cards.
00:31:49Anyone ever waited. So just let him write the record. Okay. So, so I want to hear about that.
00:31:56Unfortunately, if you look at the presentation, and you look at the RCA, it's nowhere to be found.
00:32:03The presentation at least has one 12 second little tiny chunk where it does say where the bug roughly
00:32:13would be. And so let me explain what that is. So what apparently occurs alongside this, so when,
00:32:22when you do DynamoDB us east one, but when you point that at your plan, you also do
00:32:30another operation at the same time. And that operation is to set rollback.
00:32:40I think it's DD. Is it DDB dot rollback dot AWS? I don't remember exactly what it is here.
00:32:49There is a rollback record. It sets that record to whatever the old plan was. So if we were here
00:32:57pointing at 145, and we're now going to point at 110, right, this old enactors, like I'm moving to
00:33:03the 110, it attempts to set it, take whatever this name was, right currently, and move that new that
00:33:13name, which would have been playing 145 move that so that the rollback address points at the old plan.
00:33:18Right. And this is just for debugging. Or, you know, it's basically just for operator ease, right?
00:33:24If they want to roll back to the previous plan or something like that, or if you just want to know
00:33:29what the previous plan was, you can see it here, right? That's part one of how the how what they
00:33:35said about failure, I would want to point out one thing here was this also didn't make any sense to
00:33:40me. Because I was like, okay, you're telling me that these things update every like minute or
00:33:45something. What good is it to have one of those? Like, by the time you even logged in, it's been
00:33:53updated from the one that you wanted to roll back to to some new thing. That's actually the plan you
00:33:59don't want because everything went down, right? Like, it's it, right? If you you don't want this,
00:34:04you just want these names in a list. So you can be like, what was it at at 1230? Like that one,
00:34:10right? So this made no sense to me. I have literally no idea why why this would ever be good,
00:34:16right? It did not sound like it would do the thing you actually want, which is to be able to mark a
00:34:21point in time and go, we need to go back to 1pm because everything went to crap after that, right?
00:34:26Anyway, so that didn't make any sense to me. But again, not exactly there to the bug. So I didn't
00:34:31ask why I'm just saying, okay, that's what thing it had to do. And it can only roll back one version
00:34:36is what you're saying. Yeah, even though the other trees do exist. So you easily could by just knowing
00:34:42what the name was. So all this is, is an is putting a human readable name on something you almost
00:34:48certainly don't care about. Right. But they don't really they can't really store that much stuff.
00:34:54Casey, I don't think they can really put like, I don't know, Adam, like this, they don't have a lot
00:34:57of scale there, right? Like, that's a lot of lines. If it were me, I would have just made this a time
00:35:04stamp. If that's what you wanted, right? I would have said, when did the planner or when did this
00:35:09person point to this thing? Like when you got the lock, you change this name to the timestamp,
00:35:15and update this in one atomic. So then you just know if I want to roll back to 1pm, I just look
00:35:20for like, whichever had the timestamp, just, you know, the earliest timestamp, not after that time.
00:35:28And that's what we were running at that time. That's what I would have done. Right. But I don't
00:35:32know. So I have no idea why they did this. They did what they did. I you know, maybe it might make
00:35:36perfect sense. Again, I have no knowledge of their system. All these things, they make perfect sense.
00:35:40So I'm not really I'm just saying I don't understand them. I don't they might not be bad ideas, right?
00:35:45There might be good ideas, if you understood the rest of the system. So anyway, so what they say,
00:35:50and this is all we get is this operation, meaning setting the rollback to point to the old plan that
00:35:59was being you know, which in this case would have actually been newer in some cases, right? So it's
00:36:03not really the the previously pointed to plan, which may be older, maybe newer. Doing that activity.
00:36:11If that plan no longer existed, meaning like it had been deleted like this,
00:36:18then the enactor stops permanently. So every time, like once you get into a state where dynamodb.usc
00:36:26is that one, right? So we do the whole sequence of steps that we said here. This plan gets deleted.
00:36:31So now this is pointing at an invalid like unresolvable name, we cannot resolve plan plan
00:36:36dash 110, which is actually some hex code. But whatever that was, we can't resolve that anymore.
00:36:41Once that state is true, then the next time an enactor comes and tries to make it point to a new
00:36:50plan, whatever that new plan is, it cannot like when it actually gets this far and tries to set
00:36:58the rollback that will crash it permanently. Therefore, all three enactors will now stop
00:37:06because eventually all three will try to enact a new plan. They will try to set the rollback
00:37:11first to point to whatever the old plan was, find that there's no plan there. And that
00:37:16apparently is just a hard crash. Oh, that's crazy. I thought the three enactors was supposed to make
00:37:24it so that it had redundancy. Now, again, this is why I get grumpy with people online who are
00:37:32like replying. They're like, it was a race condition. It wasn't a race condition. The race
00:37:36condition is not necessary for this. The race condition is just why you ended up with this name
00:37:44being unresolvable. But if you didn't have whatever code did this badly, it would have just worked.
00:37:52You never would have known. You would have had a momentary minute outage of DynamoDB or something,
00:37:57but I'm guessing there are minute outages of DynamoDB from time to time. That's not global news.
00:38:04What's global news is taking it down permanently, which is what happened here. And until an actual
00:38:09human goes and figures this out, resets it, gets these enactors going again, it's just gone. It's
00:38:15just out permanently. So hours potentially. And it was long enough, I guess, in this case to then have
00:38:21cascading failures. You would never have had that. It's just a momentary out. If some people
00:38:26momentarily got an unresolvable name or no records, then they would just try again. That's usually
00:38:32like with DNS, that's like your phone, you went through a tunnel. That's all that would have been.
00:38:37So I want to know what did the code look like here? How did you write something
00:38:45that if this wasn't a valid name, which it wouldn't even be on standup, meaning if you were
00:38:50starting this system and the operator hadn't pre-configured it, it wouldn't be pointing to
00:38:55anything. That's the default case that you would think you'd start with. So if you're going to do
00:39:01this, you would think you would just handle that case because the rollback address could just not
00:39:07point to anything. Just take whatever this is. If it's nothing, set the rollback address to nothing.
00:39:12Done. So there's something really weird about the way they wrote this code. And that is what should
00:39:18have been in the RCA. That's the whole bug to me. This is just set dressing for how we ended up
00:39:25having this thing point to nothing. The same bug would have occurred if someone had accidentally
00:39:31deleted this record. Like some operator was just like, oops, crap, I set it to nothing.
00:39:35This same bug would have happened according to the presentation. So the root cause is not the
00:39:40race condition. The race condition is an aside. Does that make sense? Quick question. So I'm
00:39:46legitimately thinking through this. And so that means the thing that sets the rollback probably
00:39:51assumes some sort of struct with a bunch of memory or something has been passed in, does some sort of
00:39:56like some sort of access. It explodes. Or do you think this is the same style of bug,
00:40:03which is the one line that took down Cloudflare, which is they just assume it's there and unwrap it.
00:40:07It's in Rust. It is memory safe Rust. Unwraps it, explodes it.
00:40:12I really don't know. My guess, like in my head, I was like, what is the thing that I see people
00:40:19do a lot of times where I'm always like, why would you ever do this? But it's just because that's the
00:40:24way they learn to program. And I was thinking like, if you were writing in one of these languages that
00:40:28likes to throw exceptions for error conditions, this would be a great example of that. So if you
00:40:34had a thing where you were like, oh, I went to go get the DNS record that this thing points to.
00:40:40And normally in a sane programming environment, no one is throwing an exception there. If they
00:40:45get back nothing, they just return nothing. And then when the person goes to set ddb.robot.js,
00:40:51they just set it to nothing, which is the correct behavior. Like nothing flows, literally the value
00:40:56nothing flows correctly through this flow. So if you were writing it to be, since it is a core
00:41:03foundation service, assuming you were trying to write something that was fault tolerant,
00:41:08you would never do something like throw an exception. So in my brain, I'm thinking,
00:41:11I bet what happens in here is when you ask for this record, they just use some library
00:41:16call or something that throws an exception when the record doesn't exist. And it just
00:41:20threw an exception and the actor was done. That's my guess. And I could be very wrong about that
00:41:25because I'm just wild guess. But this is why I want to see the RCA. What was it? It could be
00:41:31exactly the stuff that Trash was talking about. I mean, it could be stuff that Prime was talking
00:41:34about. It could be the stuff that I just said. It could be anything. And I want to know because
00:41:38that's where the actual education would be here. Avoiding this race condition is completely
00:41:43unimportant. This race condition could have lived there. And while it was important eventually to
00:41:48fix it, to avoid those once a year weird outages for five seconds or something, it is not actually
00:41:56the thing that we most want to learn. What we most want to learn is don't write this thing. And we
00:42:00don't know what this thing even was. So how do we not write it? This is why I think it was the
00:42:04bad RCA. Does that make sense? Yes. Yes. All right. What is most of AWS written in, Adam?
00:42:11It was Java. I was about to say someone from the chat said Scala. They said they worked at
00:42:16AWS for seven years and they said most of it's written in Scala. Well, that's technically Java
00:42:21with extra steps. And that will anger all of them endlessly. So that's really it for me.
00:42:34This was a thing where I was like, I don't feel like I saw the explanation. And I actually feel
00:42:38like it's important to hear because there was a bad programming practice at the bottom of this
00:42:42summer. And I want to know what it was, especially because it helps people like me when I, you know,
00:42:46I don't really do a lot of architecture education right now, but at some point I probably would like
00:42:51to do some of that because I think there's a lot of bad architecture out there. And so I kind of
00:42:56try to pay attention to these things. Like what are the kinds of architectural mistakes that people are
00:42:59making? And I bet this was one of them. Right. And so I'd like to know. I'd like to know.
00:43:04Yeah. I think like what I would expect is like at least like one simple reproducible example of like
00:43:10why I blew up like a whole like little code snippet. So like, and this is something you
00:43:16brought up earlier is like kind of like how we approach these type of things. Like if I'm like
00:43:18reviewing someone's code and I see something that looks weird, I will always do my best to make my
00:43:23own little sandbox and like prove my theory out. And then like actually show them the code of like,
00:43:29this is why this is probably wrong. Here's like a small, simple reproducible step. So I would expect
00:43:33something like that. And that also helps me like truly understand. Cause a lot of people, like you
00:43:37said, they'll see something like that looks funny, but I don't know why it looks funny, but I can't
00:43:43stop there. I gotta like actually like build it out and then like understand. So that's what I would
00:43:48expect. And you know, like, like I said, the crowd strike and the Google outages, I thought were better
00:43:55like just telling you that they were like, look, it was a null pointer, D ref in here, or it was an
00:43:59out of bounds array because we thought there was only going to be 20 and we put 21 in the
00:44:03config file. Right. And like, okay, I know exactly what kind of code that, you know,
00:44:08is causing that kind of problem. Right. And furthermore, furthermore, to like an earlier
00:44:14comment, literally, as far as I know, everyone who programs in Rust only does it so that occasionally
00:44:21when they see something like this, they can say, well, if they'd had written in Rust,
00:44:24it wouldn't have happened. They were not given enough information to even make that comment.
00:44:29They probably made it anyway, to be fair, but they were not given it. So you have to give
00:44:34one rule that should be followed in RCAs is you have to give Rustations enough information to,
00:44:41if they so chose, correctly say that it would have been prevented in Rust.
00:44:46And this, we do not have that. We do not know whether this would have been prevented in Rust.
00:44:51We have no idea. It probably wouldn't have, but we don't know. Well, Casey, we do have a pretty
00:44:58good chance because it's like, probably would have never shipped. So it would have prevented it.
00:45:03True. We would have zero enactors because we would be designing set enactors. Yeah.
00:45:09CloudFlare does a really good job at this as well. They like go in and show like a lot of lines of
00:45:17code and say like, this is exactly what's going on. This is, you know, even though the problems up here,
00:45:21this is the line that exploded due to all these previous conditions. That was me making fun of
00:45:24Rust with the unwrap, which actually wasn't truly the problem. Uh, but you know, it's just like all
00:45:28these things kind of happen. So they, they do a really good job. I'm surprised at how poor of a job
00:45:33AWS has done for this one. Well, and the other thing too, is it, it was one of those things
00:45:39where it now it makes me, so it makes me unnecessarily suspicious of you, right?
00:45:44When I read this, I'm like, are you hiding something? Did you not really figure out what
00:45:48the bug was? Like you talked all about this race condition, but even from your own presentation,
00:45:52I can tell the race condition really wasn't important. That was just, that was just what
00:45:56led to the record having been set to nothing, but who cares, right? Like that's, that's like
00:46:00something that's nice to put in the RCA as like an explanation of why this bug occurred now,
00:46:05as opposed to some other time, but it's not the bug. So it's weird to me. Like when I see an RCA
00:46:10that doesn't talk about the bug now I'm suspicious. Right. And unnecessarily so, because if you actually
00:46:15did find it, then just tell me, and now I know you found it. Right. So it's like, I think it also is a
00:46:19confidence boost for the people who are looking from the outside who want to know, can they trust
00:46:24this DynamoDB thing? If it looks like you actually found the bug, I have a little more confidence in
00:46:28you. If it looks like you have no idea what the bug was, or don't seem to understand what the bug was,
00:46:33then I'm, that I'm more concerned. And so I think that's also another reason to do this in your RCA.
00:46:37It, it provides confidence to your customers. Maybe that's where they fired Adam as an AWS hero too.
00:46:43Maybe it's all connected. Could be. They didn't want him exposing these dirty secrets.
00:46:48Yeah. He knew too much. He knew too much. Could you give a, could you give a quick,
00:46:53like three minutes summary of the guitar shop? Like what that, what that was revealing? Because I'm
00:47:00trying to remember what it was because it involved like a single point of failure guy who was out here
00:47:05for this failure as well. So I don't know how to reconcile the two things. And of course we have no
00:47:10idea. We have no idea if either are telling us the truth now, right? Because this was such a bad RCA,
00:47:16I have no idea if it's correct or not, but yes, the password was wishbone 12, I think.
00:47:22There you go. Always try to kill me. That's my recollection anyway.
00:47:26So yeah, that story was that, that there was the, there was a thing that was designed to
00:47:34copy configurations. And that thing had kind of gone rogue and could not be stopped. Like it was
00:47:42just like, it was just copying configurations totally incorrectly and it needed to be like fixed
00:47:47or repaired or something. And we don't have any more information because it was an overheard
00:47:53conversation. Right. And so does that comport with this? Well, a little bit, cause those enactors do
00:48:00sound like the kind of thing that would be running a configuration copy, but on the other hand,
00:48:05it's not really a configuration for machines. It like a DNS entry is a DNS entry. It's not,
00:48:10it's not really a configuration. So I would say the two stories don't line up that well.
00:48:14And so that's another reason why I was kind of hoping that this RCA was a little bit more
00:48:19believable because I wanted to know for sure that the story was false. And I still don't really know
00:48:24based on how bad this RCA. What if, what if the tool that the guy wrote to copy the configs is
00:48:31just literally the enactor? Like they just productionized it and he, and like they haven't
00:48:35changed it in seven years. That was kind of my connecting the dots. There was, he's like, guys,
00:48:42I wrote that as a way for me to test stuff in my local environment. And you just decided to make
00:48:48three enactors and put them next to each other and prod. I don't, how did this happen? I do.
00:48:53I have alternative questions. Yeah. Alternatively, is it the rollback? Because that's the one that
00:48:57did the copying of like, Hey, here's the previous one. Right. And so I'm going to copy the previous
00:49:02one. Then it gets like this null issue going on. And it just like the script never encountered
00:49:07or knowledge just goes rogue and starts writing over and over and over and over again to where you
00:49:11can't, you can't do anything. I don't know. All I know is that like, as far as I can tell from
00:49:19their explanation, going only on what they were providing, I still just don't think the race
00:49:24conditions even relevant because again, literally an accidental update to the route 53 endpoint
00:49:31would have taken down all three and actors immediately. Cause according to them,
00:49:35all that's required to stop them is if the, if the endpoint points at an unresolvable name,
00:49:41that's all you need. And so if that's really true, literally an operator typo could have taken all
00:49:47this down, no race condition necessary. Right. And so again, the RCA just does not do a good job
00:49:52convincing me that you've talked about what the real bug was, because I can think of so many ways
00:49:57that you could have triggered this exact same thing that don't involve this race condition that you
00:50:00spent the entire RCA telling me was the bug, but I don't think it is. So thank you, Casey, for giving
00:50:06us that amazing presentation. I am actually genuinely Greenwood, jealous rage for whatever
00:50:10that writing instrument is. I got to figure out how to set up what you have. That thing is fantastic.
00:50:15Thank you everybody for watching. I, uh, for those that caught it live, I hope you enjoyed
00:50:18the pre banter and probably a little bit of the post banter. If you wish to hear the extended and
00:50:22all the kind of fun interactions, that's not a part of the main story, head on over to Spotify
00:50:27for the full podcast, which is just us yapping about, I don't know what trash is eating and
00:50:31snacks and such the name more yapping, more yapping again, and also Casey TJ and trash.
00:50:42Errors on my screen, terminal coffee and living the dream.