CaseyがAWSの障害を解説 | The Standup

TThe PrimeTime
Computing/SoftwareManagementInternet Technology

Transcript

00:00:00今回のスタンドアップは特別編です。なんとケイシーがイントロを担当してくれます。
00:00:05ケイシー、今日のトピックは何かな? 「皆さん、こんにちは。スタンドアップへようこそ。」
00:00:12「最新の調査によると、Spotifyで45、いや6番目に優れたテックポッドキャストだそうです。」 確かにな。
00:00:29さておき、ごめん。今日のスタンドアップで取り上げたいトピックがあります。10月に発生したAWSの
00:00:34障害について話そうと思うのですが、実はもっと大きなテーマについても触れたいんです。
00:00:41それは、「本当に理解していること」と「理解したつもりになっていること」の違いについてです。
00:00:49特によくあるのが、キャリアの浅いプログラマー、
00:00:56例えばジュニアプログラマーとして現場に入ったばかりの人たちです。
00:01:01これは僕自身もそうだったんですが、物事を知っているように見せたいんですよね。分かっていない
00:01:08とは思われたくない。だから周囲からの無言の圧力というか、実際にあるかどうかは別として、
00:01:14内容が少し曖昧だったり、完全には掴みきれていなかったりしても、
00:01:21「理解しました」と言わなきゃいけない、あるいは理解したふりをしなきゃいけないと感じてしまうんです。
00:01:26たとえ自分のせいじゃなくても。説明が不十分だったり、重要な情報が欠けていたりしても、
00:01:33知っているふりをするインセンティブが働いてしまう。「知っています」と言えば
00:01:39頭が良く見えるし、少なくとも未熟だとは思われないですからね。でも、
00:01:43プログラミングの経験を重ね、年を取るにつれて分かったことがあります。
00:01:50今の僕は、もうしつこいくらい説明を求めます。「バカに見えても構わない」というスタンスです。
00:01:58「ちょっと待って、戻って。そこが分からない」「その言葉はどういう意味?」という具合に。
00:02:03「どう見られるか」なんて、もうどうでもいいんです。本当に知りたいだけなんです。
00:02:06「分かったつもり」や「知っているふり」をしたせいで、後で手痛いしっぺ返しを食らった経験が何度もありますから。
00:02:12だから、バグの原因やパフォーマンス低下の理由を説明するときは、心の底から納得したい。
00:02:18「まだ真相に辿り着けていないなら、別の何かが潜んでいるかもしれない」と常に考えるようにしています。
00:02:22徹底的に調べ尽くさない限り、本当の原因はまだ隠れているかもしれない。単に
00:02:28都合がいいから次へ進む、という妥協はしたくないんです。
00:02:33今回DynamoDBの障害について話したい理由は、最近大きな障害が相次いでいるからです。
00:02:38例えばGoogleで発生した大規模障害は、空のフィールドを処理していなかったことが原因でした。
00:02:43JSONを読み込んだ際に、中身が空だとヌルポインタを参照してしまうといった、まさに初歩的なミスです。
00:02:48CrowdStrikeの件もありましたね。世界中でブルースクリーンを引き起こしたあの障害です。
00:02:55あれは非常に優れた説明がなされていました。「配列サイズの設計に問題があり、
00:03:01ルールが増えすぎて配列がオーバーフローした」という内容でした。
00:03:07これらはRCA(根本原因分析)として非常に分かりやすかった。なぜ落ちたのかという説明を
00:03:12読んだとき、疑問点はほとんど残りませんでした。コードの
00:03:17具体的な一行までは公開されていなくても、プログラマーがどうコードを書き、
00:03:23どんなヘマをしたのかを理解するのに十分な情報がありました。「あぁ、なるほど。それはやっちゃダメだよね」と納得できたんです。
00:03:28一方でDynamoDBの件ですが、このポッドキャストでも以前話題にしましたよね。
00:03:33ギターセンターにいた人が、パブで誰かが話しているのを小耳に挟んだというあの話。
00:03:39「いたぞ、滅多にお目にかかれない『プログラマー』だ。普段は暗闇で一人作業に没頭する単純な生き物だが、
00:03:45ネット上に間違いを見つけた瞬間、猛烈な勢いでタイピングを始める。時速120ワードに達したところで、
00:03:50天敵であるライトモードのウェブサイトに怯み、追跡は断念。続きはまた次回だ。」
00:03:55「PCに向かっていない時は、『ホワイトボード』と呼ばれる場所に奇妙な記号を描いて何時間も過ごす。
00:04:00何千もの方言があり、一つのオフィスで十数種類が使われることもあるが、その目的を解明した言語学者はいない。」
00:04:05「虚栄心の強い彼らは、奇妙な姿勢で自分のオンライン上の姿を眺められるよう進化してきた。
00:04:10コードレビュー待ちだと言い訳をしながら、何時間もじっとしている。そして、
00:04:17大した成果もない長い一日を終え、キーボード戦士は眠りにつく。少し読書をして消灯。おやすみ、プログラマー君。」
00:04:24さて、なぜ僕がこんなに安眠できているかというと、Sentryがバグ退治を助けてくれるからです。
00:04:31冬に死んでしまうようなか弱い虫じゃありません。ジャングルにいるような巨大で凶暴なバグの話です。
00:04:36でも怖くありません。SentryのSeerがあれば、そんなバグも一網打尽ですから。
00:04:42それで、DynamoDBの件についても「どれだけの情報が公開されているか」と興味が湧いたんです。
00:04:46事後のサマリーとしてRCAが公開されましたが、内容は非常に曖昧で、あまり多くを説明していませんでした。
00:04:53その後、12月のre:Inventでこの障害をカバーした完全なプレゼン動画が公開されたので、全部視聴しました。
00:04:58RCAを読み込み、プレゼンを最後まで見ましたが、それでも納得がいきませんでした。
00:05:03「バグの具体的な説明がどこにもないじゃないか」と感じたんです。
00:05:11なので、なぜ彼らの説明では不十分なのか、それを例にして「分かったつもり」になる危うさを話したいんです。
00:05:16「バグはこれだよ」と僕に説明してくる人もいましたが、結局みんな同じ内容を繰り返すだけで、
00:05:22それは本当のバグの説明になっていない。読んだだけで「理解した」と思いたくなる心理は分かりますが、
00:05:29本当のバグが何だったのかを語れないなら、まだ終わっていないんです。より詳細な説明が必要なんです。伝わりますか?
00:05:30「ケイシー、君が言いたいことは最初から100%分かっていたよ。質問も障害もない、完璧だ。また明日!」
00:05:38Spotifyで聴くのもいいけど、ケイシーの話なら1時間だって聴いていられますよ。
00:05:42Spotifyの宣伝ありがとう。あそこだとクオリティも高いし、ボーナストラックとして前後の雑談も入ってますからね。
00:05:50最近はSpotify向けに長めのバージョンを投稿しています。本題以外の「おしゃべり」が多いやつを。
00:05:56ライブ視聴者は、Trashのポケモン依存症の話とかも聞いてますが、YouTubeだとその辺はカットされてますからね。
00:06:02YouTube動画の最初の10分で、おじさん4人がよく分からない「DynamoDB」とかいう話をしているのは、なかなかハードですから。
00:06:07ポッドキャストを始める前に、アダムを紹介すべきでしたね。今日はなぜ来てくれたんですか?
00:06:14「TJの家に来ているから」というのが最大の理由ですね。TJは、家に来る客全員にポッドキャスト出演を強要しますから。
00:06:20たまに気まずいことになりますよね。で、君は一体何者? AWSヒーローという肩書き以外では。
00:06:26「もうヒーローじゃないですよ。更新されなかったんです。一期務めて、スーパーヒーロー・グループから追い出されました。」
00:06:32「お金を払ってなるものなの?」「いや、単に僕がもう興味を失って話さなくなったから、向こうも『引退かな』と思ったんでしょう。」
00:06:39今のケイシー、推理小説の登場人物みたい。ボードに何か描き始めるぞ。ケイシー・ムラトーリ!
00:06:45ムラトーリ? ムラトーリ? 視覚資料を使って説明してくれるんですね。
00:06:51「ムラトーリ」が正しい発音です。イタリア系の名前ですが、アメリカに移住したどこかのタイミングでそうなったんでしょうね。
00:06:56さて、彼らが説明した内容はこうです。DynamoDBにはAPIエンドポイント、つまりドメインアドレスがあります。
00:07:01DNSで検索して、どこにリクエストを送ればいいかを確認するための名前です。アダムも知っていると思いますが、
00:07:04「dynamodb.us-east-1.api.aws」のような形式です。IPv4かIPv6か、あるいは政府用かなどで名前は異なりますが、
00:07:09基本的にはアプリケーションにハードコードされているような名前です。ここまで大丈夫ですか、アダム?
00:07:14「ええ、その通りです。」 さて、リクエストを送ると、どこかへリダイレクトされます。当然ながら、
00:07:18全世界のトラフィックを1台のサーバーで捌くわけにはいきませんから。リージョンごとに分割され、
00:07:23適切なロードバランスが行われます。彼らはこれを「DNSツリー」と呼んでいました。
00:07:32「ツリー」の構造については詳しく説明されませんでしたが、重み付けされた配列のようなもので、
00:07:39負荷の低いマシンには高い重みを、高いマシンには低い重みを設定して分散させる仕組みのようです。
00:07:43ちょっといいですか? その「ツリー」の設計でL6に昇進したエンジニアがいるらしいですよ。だからツリーは重要なんです!
00:07:47確かに重要かもしれませんが、このバグには関係ありません。だから彼らが説明を省いたのも納得です。
00:07:52「根本原因分析(Root Cause Analysis)」だから「ツリー(木)」って呼んでるわけじゃないよね?
00:07:59冗談はさておき。このロードバランスの仕組みはRoute 53を通じて行われていて、
00:08:08「plan-145.ddb.aws」のような名前をルートとするレコード群で管理されています。これも実際のプレゼンでの例ですが、
00:08:15実際には「0afe129a」のようなハッシュ値が使われており、人間が読める名前ではありません。
00:08:19ユーザーがクエリを投げると、Route 53がこのロードバランス用のツリーに誘導し、最終的に接続先のマシンが決まる。
00:08:23Route 53の内部仕様は僕も詳しくありませんが、このバグを理解する上では、この「正常な流れ」を前提とします。
00:08:26「AWSヒーローのアダムがいるから、分からないことがあれば聞いてくださいね。」 「ええ、Route 53には重み付けなど
00:08:31多様なトラフィック分割方法があります。今の説明通りの仕組みだと思われます。」
00:08:36さて、なぜこれを「プラン」と呼んで更新し続ける必要があるのか。DynamoDBの稼働状況は常に変化しているからです。
00:08:41過負荷、クラッシュ、容量追加。これらに対応するため、ツリーは常に最新の状態に保たれなければなりません。
00:08:45彼らは新しいツリー(例えばplan-146)を作成し、準備が整うと、メインのレコードの参照先を新しい方に切り替えます。
00:08:50このプロセスは「プランナー」と「エナクター」という2つの役割に分かれています。「プランナー」は
00:09:00新しいツリーの構成を決定するプロセスで、どうやら1つしか存在しないようです。
00:09:07次々と新しいプランを生成し続けますが、自分ではRoute 53に登録しません。その役割を担うのが3つの「エナクター」です。
00:09:13なぜエナクターは3つなのにプランナーは1つなのか、説明はありませんでした。耐障害性のためだと言うなら、
00:09:18プランナーが落ちたらエナクターは何もしようがないはずですが、そこは一旦置いておきましょう。とにかく、
00:09:24プランナーが作成したプランを、3つのエナクターがRoute 53に反映させようとします。ここまではいいですか?
00:09:28さて、ここが「理解しやすくするため」という理由で、シリアライゼーション(直列化)が使われていました。
00:09:343つのエナクターが同時にバラバラにレコードを書き込むのではなく、ロックを取得して順序立てて実行する仕組みです。
00:09:39Route 53のアトミックな操作を利用して、DNSレコード自体をロックとして使っているようです。一度に
00:09:43一つのエナクターしか更新できないように。でも、これが結局バグをあぶり出す原因になりました。
00:09:49「ケイシー、エナクターを1台ずつ動かすなら、そもそもなぜ3台も必要なんですか?」 「それも説明がないんです。しかも、
00:09:54ロックを取ったエナクターがそのまま落ちたらどうなるのか、疑問だらけのプレゼンでしたよ。」
00:10:01AWSを使い込んでいる人には当たり前のことなのかもしれませんが、僕には不可解な点が多くて。
00:10:06で、何が起きたか。エナクターはロックが取れないと、一定時間待ってから再試行(バックオフ)します。
00:10:13あるエナクターが非常に古い「プラン110」を実行しようとしていたとします。他のエナクターが
00:10:23最新のプランを処理している間、このエナクターは運悪く何度もロック取得に失敗し続けました。
00:10:28その間にもプランナーは次々と新しいプランを生成し、他のエナクターは「プラン145」まで進んでいる。
00:10:36そしてついに、「プラン110」を抱えたままのエナクターがロックを取得し、古いプランを反映させてしまいました。
00:10:41一時的に古いプランに戻ってしまいますが、次の更新で最新に戻るはずなので、通常なら数分の不具合で済むはずです。
00:10:47「事態はもっと深刻でした。Route 53に古いレコードが溜まらないよう、古いプランを削除する仕組みがあったんです。」
00:10:53エナクターが「プラン110」を反映させた直後、別のエナクターが「110は古いから削除」として消してしまった。
00:11:00その結果、DynamoDBの接続先が「存在しないレコード」を指す状態になり、誰も接続できなくなりました。
00:11:06「Rustを使っていないからメモリ安全性の問題が起きたのでは?」 「いや、それとは違います。
00:11:14接続しようとすると『レコードが見つかりません』と返ってくる、単純な不在の問題です。」
00:11:19ネット上では「レースコンディション(競合状態)が原因だ」と納得している人が多いですが、僕はそうは思いません。
00:11:24レースコンディションは、あくまで『なぜレコードが消えたか』の説明に過ぎない。本当の疑問は、
00:11:28『なぜ次のエナクターがそれを修正できなかったのか』です。
00:11:34「そもそも、削除されるほど古いレコードを書き込むこと自体がバグじゃない?」 「まさに。実行前に
00:11:38一工夫あれば防げたはずですよね。でも、話はまだ終わりません。実は、この更新作業の裏で」
00:11:42「ロールバック用のレコード設定」という操作も行われていたんです。古いプランを保存しておくためのものですが、
00:11:53参照先のレコードがすでに削除されていると、エナクターがそこで致命的なクラッシュを起こす設計になっていました。
00:11:59その結果、3つのエナクターすべてが次々とクラッシュして停止。人間が手動でリセットするまで、
00:12:04DynamoDBは完全にダウンし続けました。これが数時間に及ぶ大規模障害の真実です。
00:12:09「レースコンディション」はきっかけに過ぎません。真のバグは、存在しないレコードを参照しただけで
00:12:13システム全体が死んでしまう脆弱なコード設計にあります。デフォルトで想定すべきケースを処理していなかったんです。
00:12:19「例外を投げるライブラリをそのまま使って、キャッチせずにエナクターが終了してしまったのでは」と推測していますが、
00:12:27これこそがRCAで語られるべき「教育的価値」のある部分なんです。レースコンディションを避けることよりも、
00:12:32いかに堅牢なコードを書くかを学ぶべきなのに、その肝心な部分が伏せられている。だから僕は納得がいかないんです。
00:12:38AWSの多くはJavaやScalaで書かれているそうですが、言語の問題以前の、アーキテクチャの失敗だと感じます。
00:12:44障害報告には、少なくとも再現可能なコードの断片を載せるべきです。「ここがこうなっていたから爆発した」と。
00:12:48それがあれば、僕らも自分のコードを見直す際の教訓にできます。AWSほどの大企業がこれを隠すような報告を出すのは、
00:12:53不信感を招くだけだと思うんですよね。「本当にバグを見つけたのか?」「何か隠しているのか?」と勘繰ってしまいます。
00:13:00しっかりとした報告は、顧客の信頼につながるはずです。さて、時間も押してきました。
00:13:05ギターショップでの噂話の真相は結局分かりませんでしたが、今回のRCAがあまりに不十分だったので、
00:13:09どちらが真実を語っているのかすら怪しいものです。とにかく、今日は素晴らしいプレゼンをありがとう、ケイシー。
00:13:14ライブを見てくれた皆さん、Spotifyで全編を聴いてくれる皆さんもありがとうございます。それではまた!
00:13:19So I'm okay that they skipped out on what the tree is doing. But I got a quick question as well.
00:13:25Yes. Is it called a tree because it's a root cause analysis or no?
00:13:29No more jokes. We're too off topic. I'm sorry. I'm sorry. So anyway, this is supposed to point to that.
00:13:37And that, that sort of this, this load balancing scheme basically of DNS entries and the way that
00:13:46they described this in like their presentation is they would use a thing like I'll say plan
00:13:52one 45 dot dynamo DB, like DDB dot AWS, right? Now this is the root of that tree, I guess,
00:14:02not root cause analysis, but like this tree, this would contain like, this is the top level record
00:14:07of a bunch of records that allow it to do its load balancing. And I assume route 53 kind of
00:14:13has this load balancing capability. I'm reading between the lines of the presentation. They didn't
00:14:17say that outright, but I'm assuming route 53, which are doing all this through, you know,
00:14:21which is their own DNS thing is allows that load balancing to happen by you just set stuff up in
00:14:26here that says how the load balancing should sort of be working right now. And then it will pick the
00:14:31correct machine based on like some kind of randomization in the weights or whatever. Now,
00:14:35what they said was this name, which really does exist. And apparently there's a tree or something
00:14:41like this. This name is one that they just kind of used for the presentation. They never actually
00:14:48used a human readable name for this plan, like one 45 that I've written here or whatever. It was
00:14:53really a hash of something. So it would really be like, you know, zero a F E one, two, you know, nine
00:15:00a or something like that, right. Is actually what would be there. So if you went and looked, you
00:15:05would not see a human readable name, or at least at that time, you wouldn't, I guess you wouldn't see
00:15:09like plan one 45. You just see that. And so the idea was, okay, a user goes to use it. They query
00:15:15this name route 53 will direct them like to here. And this thing is some kind of a load balancing
00:15:22tree that route three can use that will allow you to get where you need to go. Right. They will give
00:15:27you an actual machine you can send traffic to eventually. Again, they did not describe any of
00:15:32that. So I have no idea how any of that works. I've never touched or used route 53. So I have no idea,
00:15:38but we'll just assume that that happens because it doesn't matter for this bug.
00:15:41We do have an AWS hero. So if you do, if you are confused, you can always
00:15:45ask Adam and he may have further insights. I mean, yeah, go for it.
00:15:50Well, route 53 does have a lot of different ways you can like split the traffic. So yes,
00:15:54weighted is one of them. And that sounds like what they described.
00:15:57So somehow they've set up these records with that. And they just didn't say how, but something,
00:16:02something in a tree format did that. My guess is there's like a weighted, like the tree has like
00:16:07weighted like there's a couple of weights at the top that branch out to more weights or something
00:16:11like that, because that's easier for it to deal with because there's a lot of them or something.
00:16:14Who knows? Anyway, I have no idea. Point being, this is what's supposed to be happening normally.
00:16:20Now, the reason that this is called plan 145 here, even though it actually would have been
00:16:24some hash code, but they refer to it as like plan 145 is the load balancing, as you might imagine,
00:16:31has to be kind of continuous because the DynamoDB machines are like doing stuff all the time. They're
00:16:38becoming more overloaded. There's machines are going down or crashing or who knows what, right?
00:16:42Could be happening, being taken offline. New capacity can be added. And so this stuff has
00:16:49to be updated constantly, like all the time. So this main API endpoint that you connect to,
00:16:56it constantly has to have that tree that it's pointing to be adjusted. And so the way that
00:17:02they do that is they create another tree, the tree that they're going to move to, right? They create
00:17:09like, you know, plan 146 or something. And they make the whole tree here. And then when they're
00:17:18ready, like when this tree is done, they take this, you know, this record here, and instead of
00:17:24it pointing to that one, they point to this one, right? So you make the new one, and they move over
00:17:28to it by just changing that name. Now, for some reason, and this reason is not really explained.
00:17:36The way that they've set up that process is they split it into two pieces. There's something called
00:17:44a planner, which figures out what the new tree should look like, basically. So you can imagine
00:17:50there's some machine called a planner. And I don't know if it's an actual machine or if it's just a
00:17:56process running on some machine that's running other things, who knows. But there's something
00:18:00called a planner. And as far as I could tell, there's only one, meaning there's just a planner
00:18:06that sits there and figures out what should the new plan look like that we're going to switch to.
00:18:13And it's constantly doing this. So it generates plan 145, then it generates plan 146, then it
00:18:18generates 147, 148, 9, 10, you know, blah, blah, blah, blah, blah, right? And it just keeps putting
00:18:25out plans for all of eternity, because that's its job. Now, it never actually creates them,
00:18:31apparently. Its job is not to ever make them in Route 53. It's just to figure out
00:18:40what they would be if someone were to put it into Route 53. Then they have three enactors.
00:18:50These enactors get the plan from the planner, and they put it into Route 53.
00:19:06Does this make sense? Now, one planner, as far as I am to understand the pronunciation,
00:19:11three enactors. There was no explanation for why this would be the case. They said the reason there
00:19:18are three enactors is because it's supposed to be fault tolerant, like if one of them goes down or
00:19:22something. But they never explained why you wouldn't then need three planners, because if the planner
00:19:28went down, then the enactors have nothing to enact. So it didn't really make any sense. So there wasn't
00:19:33an explanation in the thing about why this structure looks the way it does. It's not really
00:19:38that important to the bug that it looks this way, although it kind of is, as we'll see later. So I
00:19:43was a little weirded out by the fact that they didn't justify this, but that's fine. So hopefully
00:19:50that makes sense. We have a planner. We have three enactors. The enactors are all trying to enact this
00:19:55plan. Now, what happens here is that for, again, reasons that the only thing they said in the
00:20:04presentation was it makes it easier to reason about. This is the only information about. They
00:20:11said it makes it easier to reason about. Because it makes it easier to reason about, these enactors
00:20:18use serialization. So instead of them just trying to create records, and if the records are already
00:20:26there, just not creating them or something, in other words, I have three people running.
00:20:29We all want to create, you know, let's say this top level record, plan146.ddb.aws, right?
00:20:36We all are trying to do that. One of us does it first. The next person tries to do it, and it's
00:20:42already there or something, right? We're all trying to create the same record. So in theory, we could
00:20:48just have three people randomly hammering on whatever part of the plan they're trying to hammer
00:20:52on, and in theory it should kind of all work, right? And I sort of got the sense, although
00:20:57you didn't come out and say it, I sort of got the sense from the presenter that he would agree with
00:21:01what I just said, meaning that they could have just had them run arbitrarily and it would or should be
00:21:08okay. But, he said, they use serialization to make it easier to reason about. What that means is
00:21:15instead of these enactors just hammering on it like that, what they do instead is they attempt to
00:21:21acquire a lock for whatever the endpoint is that they're trying to update. So in other words, if
00:21:28this person is trying to update one of these things, and I got the sense that it was if you're trying to
00:21:35update this one, but it could have been if you're trying to update this one, or it could have been
00:21:41on both. They never really 100% said, if I remember correctly, exactly where the locking
00:21:46was occurring. But the locking occurs by them going, okay, I'm going to create a lock that is
00:21:56a DNS record. And by using the fact that Route 53 has the idea of an atomic, which is,
00:22:02you know, I can do two things and if they both wouldn't succeed, then it won't do either of them.
00:22:08They basically made a locking system that locks via Route 53. So Route 53's DNS records are actually
00:22:15the lock record, if that makes sense. Can I ask a quick question? Yes. You said it does this through
00:22:21serialization? I don't quite understand what that means. Because I thought serialization is just
00:22:25converting from one memory to a different memory representation of some. I'm sorry, different
00:22:31serialization. So yes, that is serialization. In this case, we mean literally temporal
00:22:40serialization, meaning they wanted these enactors to have some kind of a way in which they would
00:22:48organize their behavior into an order rather than just being arbitrary. And the way that they did
00:22:55that was locking. So what will happen is, instead of this person just doing whatever it is they're
00:23:03going to do, like, okay, I'm going to like, I finished this, I'm going to point this guy at
00:23:07plan 146 now. Instead of doing that, it attempts to acquire a lock on like this, right? And if it
00:23:14doesn't get the lock, it won't make the change. So only one of these enactors can be in the process
00:23:21of updating this at any given time. Does that make sense? Mm hmm. Now again, exactly what they were
00:23:28trying to do with that was never explained. They just said makes it easier to reason about and left
00:23:32it there. So I don't know why they thought this was an improvement. And amusingly, it's what ends
00:23:38up uncovering the bug. So it wasn't an improvement. If anything, it was probably bad. But so Casey,
00:23:42are you saying they don't have like, they don't have a good reason for they're saying we're going
00:23:47to make the enactors run almost like one at a time? Why do they have a, why do they have three
00:23:52enactors? I don't understand. Like, why do they not just have one? They just don't say that. We don't
00:23:56know why. And they didn't quite explain, like, I didn't really hear an explanation for how you
00:24:02have three concurrent enactors. You expect them to be able to go down, which is why you have three.
00:24:07Right. But they're taking a lock. So what happens if this guy takes the lock and then goes down?
00:24:13Like, I didn't hear an explanation for that either. So this was all very confusing to me. Like I,
00:24:18I, I'm not complaining about it as part of what we're talking about here, because it's not important
00:24:25for the cause to me. But as a presentation, I had so many questions. Like I was like, I don't
00:24:32understand why you did any of this to be completely honest. Right. And maybe that's, again, part of it
00:24:38could just be that I don't use AWS services. It might be that some of these things would be obvious
00:24:43if you are someone who regularly uses route 53 or something, you'd be like, oh, it's because
00:24:47locks can be set to a timeout or I mean, I don't know. Right. But anyway, so yeah,
00:24:53so they're doing that. And what ends up happening for, for this, the thing that uncovers the bug
00:25:02is that what ends up happening is these enactors, when they don't get the lock, they just do like
00:25:08a back off, right? They'll basically just be like, okay, let me wait and I'll try again. So an actor,
00:25:14this an actor tries to get the lock, but somebody else already has the lock. So he just waits a
00:25:18little while. He tries to get the lock again. That's what will happen. Right. And what they
00:25:24said happened was they hit a pathological case, quote unquote, where one of the enactors is,
00:25:29you know, has enacted some plan. And that plan, let's say was pretty old. I think they used 110
00:25:35was an example that they used. So it enacted plan 110. And it wants to point, you know, it's like,
00:25:43I got to set the API to point to my 110 tries to get the lock to update dynamodb.use.one or whatever,
00:25:51and fails because someone else is enacting plan 111 or something like that. Right. Or plan 109 could
00:25:57have been a previous plan. So the other enactors are doing it. It can't do it. It backs off. Right.
00:26:02And remember this an actor here, we're on 110. It's trying, it's it really wants to enact it.
00:26:07It tries again. Someone else has the lock. Now it tries again, still locked. This person is sitting
00:26:13on 110, desperately trying to enact. It can't do it. Apparently this just happened so many times
00:26:19that the other enactors and the planner is just churning out new plans this whole time. Right.
00:26:23The other enactors, they get up to like 145 or something and 146 they're enacting plans that are
00:26:28like way ahead of 110. Right. And this guy's still stalled because he just unluckily never gets the
00:26:35lock. Right. Finally, at some point after like plan 145 has already been enacted and pointed to by some
00:26:44other enactor and all that stuff, plan 110, this inactive still trying to do it finally gets the
00:26:49lock. I mean, it's like, yeah. And so then he says, okay, we're pointing to 110 now. Yes. Right.
00:26:58So now it's on a super old stale plan, but this really shouldn't be a problem. Right.
00:27:03Because eventually the next time some enactor has something, it's going to be a much later plan.
00:27:07They'll just enact plan, you know, 146 or seven or eight or whatever. And we'll re-point it back
00:27:12to this and we're back to a fresh plan. So everyone will just have bad load balancing for like a few
00:27:17minutes, but then it'll be fine. Right. They did have bad load balancing for at least a few minutes.
00:27:22Right. Yes. True. Well, it's a lot worse than that. That's what was supposed to happen. Right.
00:27:30Meaning that's how they would expect this to work too. Okay. The problem is these,
00:27:36they also didn't want Route 53 to become clogged with all of these records. Because if they just
00:27:42left them around, eventually after, you know, three months, you have like 8 billion records
00:27:49that you stuffed into Route 53 for every, you know, couple minutes you're putting in this big tree of
00:27:54weights and stuff. They were like, okay, at some point we should just clean up these plans.
00:28:00So enactors also look for plans that are older than a certain amount. And if they are older than
00:28:08a certain amount, they'll delete them. So what happened was they pointed to plan 110. This
00:28:13enactor finally gets the lock. It points to 110. Another enactor is like, oh, wow, 110, man, that
00:28:19is old. We should get rid of that and deletes it. So now the DynamoDB us-east-1.api.aws is pointing
00:28:29at a record that can't be resolved. Right. It's just something, it would actually, again, it
00:28:34wouldn't look like plan 110. It would look like OAFE129A, some hash, dot, right, DDB.aws. But
00:28:44it's pointing at that name. And if you ask that name, you get nothing.
00:28:46So what would happen at that point is everyone who was trying to get
00:28:51a endpoint to send stuff to would get back an unresolvable name, basically. Right. And I don't
00:28:56really know what happens in Route 53 when that occurs, but you would basically be getting back
00:29:01something that you either couldn't use or just gobbledygook for an IP, who knows. But whatever
00:29:07it was, if you attempted to actually use it, you weren't going to get a response. Right.
00:29:10Interesting. Is this because AWS doesn't use enough Rust because that's obviously a use-after-free
00:29:15lug? And so I think Rust would have solved that, right? If you rewrote Route 53 entirely in Rust,
00:29:21obviously, all of these problems are not there. No, to be specific, I do think in the presentation,
00:29:30they did say, not about Rust, but they did say what would happen specifically, which is I think
00:29:35when you asked for this thing or either this thing or this thing, I don't know which one they were
00:29:40referring to, because I can't quite remember, you would just get back a thing that says no records
00:29:44found. So that's the end game of what would happen, whether it was from asking for this or asking for
00:29:50that, I'm not sure, but just get back no records found. That's what you would have received when
00:29:55you were trying to call that API. So whatever library you were using to use DynamoDB, it would
00:30:01just be like, hey, no records found, bro. Sorry. Right. So this, if you ask anyone on the internet,
00:30:11right, they're all like, yes, they explained the bug. That's the bug. The bug is that there
00:30:16was this race condition, right? Everyone, because everyone, as soon as you say race condition,
00:30:20everyone's brain shuts off. They're like, oh, okay, well, it was a race condition. Done. Nothing to see
00:30:24here, right? So they're like, it's a race condition. They explain it. It's like, no, they didn't explain
00:30:30it. Because if you think about what would happen here, immediately after this, everyone's getting
00:30:36this, it's a new one actor. A new one actor will just enact a new one, right? And so the bug, right,
00:30:44is why didn't that occur? That's the actual RCA that I wanted to see is why didn't the next
00:30:52actor come and fix it? Can I throw out something else? Wouldn't it also be a bug? Like why write
00:30:57a record so old that it should be deleted immediately? Well, it wasn't it was because it
00:31:02was it was this guy had written it quite a long time ago. And it was it the weight. Well, I mean,
00:31:08if you're asking, why didn't they write an actor is with better code? Yeah, that's a pretty cool.
00:31:11Okay, fair. It seems like if you're updating to something that should be deleted immediately,
00:31:17isn't like that's like that feels like the problem right there. You've done something wrong
00:31:21long before. Yeah, even though it doesn't really fix the theoretical structure of this thing,
00:31:26a simple check in this guy when after he finished backing off on the lock, he should maybe check to
00:31:30see whether he's about to set this to something that he would delete if he was running his deletion
00:31:36code is probably a good safety measure. But yeah, so 100% agree with him. Okay, but an actor worked
00:31:41really, really hard to get that record. Waiting a long time. Oh, it's gonna have its Pokemon cards.
00:31:49Anyone ever waited. So just let him write the record. Okay. So, so I want to hear about that.
00:31:56Unfortunately, if you look at the presentation, and you look at the RCA, it's nowhere to be found.
00:32:03The presentation at least has one 12 second little tiny chunk where it does say where the bug roughly
00:32:13would be. And so let me explain what that is. So what apparently occurs alongside this, so when,
00:32:22when you do DynamoDB us east one, but when you point that at your plan, you also do
00:32:30another operation at the same time. And that operation is to set rollback.
00:32:40I think it's DD. Is it DDB dot rollback dot AWS? I don't remember exactly what it is here.
00:32:49There is a rollback record. It sets that record to whatever the old plan was. So if we were here
00:32:57pointing at 145, and we're now going to point at 110, right, this old enactors, like I'm moving to
00:33:03the 110, it attempts to set it, take whatever this name was, right currently, and move that new that
00:33:13name, which would have been playing 145 move that so that the rollback address points at the old plan.
00:33:18Right. And this is just for debugging. Or, you know, it's basically just for operator ease, right?
00:33:24If they want to roll back to the previous plan or something like that, or if you just want to know
00:33:29what the previous plan was, you can see it here, right? That's part one of how the how what they
00:33:35said about failure, I would want to point out one thing here was this also didn't make any sense to
00:33:40me. Because I was like, okay, you're telling me that these things update every like minute or
00:33:45something. What good is it to have one of those? Like, by the time you even logged in, it's been
00:33:53updated from the one that you wanted to roll back to to some new thing. That's actually the plan you
00:33:59don't want because everything went down, right? Like, it's it, right? If you you don't want this,
00:34:04you just want these names in a list. So you can be like, what was it at at 1230? Like that one,
00:34:10right? So this made no sense to me. I have literally no idea why why this would ever be good,
00:34:16right? It did not sound like it would do the thing you actually want, which is to be able to mark a
00:34:21point in time and go, we need to go back to 1pm because everything went to crap after that, right?
00:34:26Anyway, so that didn't make any sense to me. But again, not exactly there to the bug. So I didn't
00:34:31ask why I'm just saying, okay, that's what thing it had to do. And it can only roll back one version
00:34:36is what you're saying. Yeah, even though the other trees do exist. So you easily could by just knowing
00:34:42what the name was. So all this is, is an is putting a human readable name on something you almost
00:34:48certainly don't care about. Right. But they don't really they can't really store that much stuff.
00:34:54Casey, I don't think they can really put like, I don't know, Adam, like this, they don't have a lot
00:34:57of scale there, right? Like, that's a lot of lines. If it were me, I would have just made this a time
00:35:04stamp. If that's what you wanted, right? I would have said, when did the planner or when did this
00:35:09person point to this thing? Like when you got the lock, you change this name to the timestamp,
00:35:15and update this in one atomic. So then you just know if I want to roll back to 1pm, I just look
00:35:20for like, whichever had the timestamp, just, you know, the earliest timestamp, not after that time.
00:35:28And that's what we were running at that time. That's what I would have done. Right. But I don't
00:35:32know. So I have no idea why they did this. They did what they did. I you know, maybe it might make
00:35:36perfect sense. Again, I have no knowledge of their system. All these things, they make perfect sense.
00:35:40So I'm not really I'm just saying I don't understand them. I don't they might not be bad ideas, right?
00:35:45There might be good ideas, if you understood the rest of the system. So anyway, so what they say,
00:35:50and this is all we get is this operation, meaning setting the rollback to point to the old plan that
00:35:59was being you know, which in this case would have actually been newer in some cases, right? So it's
00:36:03not really the the previously pointed to plan, which may be older, maybe newer. Doing that activity.
00:36:11If that plan no longer existed, meaning like it had been deleted like this,
00:36:18then the enactor stops permanently. So every time, like once you get into a state where dynamodb.usc
00:36:26is that one, right? So we do the whole sequence of steps that we said here. This plan gets deleted.
00:36:31So now this is pointing at an invalid like unresolvable name, we cannot resolve plan plan
00:36:36dash 110, which is actually some hex code. But whatever that was, we can't resolve that anymore.
00:36:41Once that state is true, then the next time an enactor comes and tries to make it point to a new
00:36:50plan, whatever that new plan is, it cannot like when it actually gets this far and tries to set
00:36:58the rollback that will crash it permanently. Therefore, all three enactors will now stop
00:37:06because eventually all three will try to enact a new plan. They will try to set the rollback
00:37:11first to point to whatever the old plan was, find that there's no plan there. And that
00:37:16apparently is just a hard crash. Oh, that's crazy. I thought the three enactors was supposed to make
00:37:24it so that it had redundancy. Now, again, this is why I get grumpy with people online who are
00:37:32like replying. They're like, it was a race condition. It wasn't a race condition. The race
00:37:36condition is not necessary for this. The race condition is just why you ended up with this name
00:37:44being unresolvable. But if you didn't have whatever code did this badly, it would have just worked.
00:37:52You never would have known. You would have had a momentary minute outage of DynamoDB or something,
00:37:57but I'm guessing there are minute outages of DynamoDB from time to time. That's not global news.
00:38:04What's global news is taking it down permanently, which is what happened here. And until an actual
00:38:09human goes and figures this out, resets it, gets these enactors going again, it's just gone. It's
00:38:15just out permanently. So hours potentially. And it was long enough, I guess, in this case to then have
00:38:21cascading failures. You would never have had that. It's just a momentary out. If some people
00:38:26momentarily got an unresolvable name or no records, then they would just try again. That's usually
00:38:32like with DNS, that's like your phone, you went through a tunnel. That's all that would have been.
00:38:37So I want to know what did the code look like here? How did you write something
00:38:45that if this wasn't a valid name, which it wouldn't even be on standup, meaning if you were
00:38:50starting this system and the operator hadn't pre-configured it, it wouldn't be pointing to
00:38:55anything. That's the default case that you would think you'd start with. So if you're going to do
00:39:01this, you would think you would just handle that case because the rollback address could just not
00:39:07point to anything. Just take whatever this is. If it's nothing, set the rollback address to nothing.
00:39:12Done. So there's something really weird about the way they wrote this code. And that is what should
00:39:18have been in the RCA. That's the whole bug to me. This is just set dressing for how we ended up
00:39:25having this thing point to nothing. The same bug would have occurred if someone had accidentally
00:39:31deleted this record. Like some operator was just like, oops, crap, I set it to nothing.
00:39:35This same bug would have happened according to the presentation. So the root cause is not the
00:39:40race condition. The race condition is an aside. Does that make sense? Quick question. So I'm
00:39:46legitimately thinking through this. And so that means the thing that sets the rollback probably
00:39:51assumes some sort of struct with a bunch of memory or something has been passed in, does some sort of
00:39:56like some sort of access. It explodes. Or do you think this is the same style of bug,
00:40:03which is the one line that took down Cloudflare, which is they just assume it's there and unwrap it.
00:40:07It's in Rust. It is memory safe Rust. Unwraps it, explodes it.
00:40:12I really don't know. My guess, like in my head, I was like, what is the thing that I see people
00:40:19do a lot of times where I'm always like, why would you ever do this? But it's just because that's the
00:40:24way they learn to program. And I was thinking like, if you were writing in one of these languages that
00:40:28likes to throw exceptions for error conditions, this would be a great example of that. So if you
00:40:34had a thing where you were like, oh, I went to go get the DNS record that this thing points to.
00:40:40And normally in a sane programming environment, no one is throwing an exception there. If they
00:40:45get back nothing, they just return nothing. And then when the person goes to set ddb.robot.js,
00:40:51they just set it to nothing, which is the correct behavior. Like nothing flows, literally the value
00:40:56nothing flows correctly through this flow. So if you were writing it to be, since it is a core
00:41:03foundation service, assuming you were trying to write something that was fault tolerant,
00:41:08you would never do something like throw an exception. So in my brain, I'm thinking,
00:41:11I bet what happens in here is when you ask for this record, they just use some library
00:41:16call or something that throws an exception when the record doesn't exist. And it just
00:41:20threw an exception and the actor was done. That's my guess. And I could be very wrong about that
00:41:25because I'm just wild guess. But this is why I want to see the RCA. What was it? It could be
00:41:31exactly the stuff that Trash was talking about. I mean, it could be stuff that Prime was talking
00:41:34about. It could be the stuff that I just said. It could be anything. And I want to know because
00:41:38that's where the actual education would be here. Avoiding this race condition is completely
00:41:43unimportant. This race condition could have lived there. And while it was important eventually to
00:41:48fix it, to avoid those once a year weird outages for five seconds or something, it is not actually
00:41:56the thing that we most want to learn. What we most want to learn is don't write this thing. And we
00:42:00don't know what this thing even was. So how do we not write it? This is why I think it was the
00:42:04bad RCA. Does that make sense? Yes. Yes. All right. What is most of AWS written in, Adam?
00:42:11It was Java. I was about to say someone from the chat said Scala. They said they worked at
00:42:16AWS for seven years and they said most of it's written in Scala. Well, that's technically Java
00:42:21with extra steps. And that will anger all of them endlessly. So that's really it for me.
00:42:34This was a thing where I was like, I don't feel like I saw the explanation. And I actually feel
00:42:38like it's important to hear because there was a bad programming practice at the bottom of this
00:42:42summer. And I want to know what it was, especially because it helps people like me when I, you know,
00:42:46I don't really do a lot of architecture education right now, but at some point I probably would like
00:42:51to do some of that because I think there's a lot of bad architecture out there. And so I kind of
00:42:56try to pay attention to these things. Like what are the kinds of architectural mistakes that people are
00:42:59making? And I bet this was one of them. Right. And so I'd like to know. I'd like to know.
00:43:04Yeah. I think like what I would expect is like at least like one simple reproducible example of like
00:43:10why I blew up like a whole like little code snippet. So like, and this is something you
00:43:16brought up earlier is like kind of like how we approach these type of things. Like if I'm like
00:43:18reviewing someone's code and I see something that looks weird, I will always do my best to make my
00:43:23own little sandbox and like prove my theory out. And then like actually show them the code of like,
00:43:29this is why this is probably wrong. Here's like a small, simple reproducible step. So I would expect
00:43:33something like that. And that also helps me like truly understand. Cause a lot of people, like you
00:43:37said, they'll see something like that looks funny, but I don't know why it looks funny, but I can't
00:43:43stop there. I gotta like actually like build it out and then like understand. So that's what I would
00:43:48expect. And you know, like, like I said, the crowd strike and the Google outages, I thought were better
00:43:55like just telling you that they were like, look, it was a null pointer, D ref in here, or it was an
00:43:59out of bounds array because we thought there was only going to be 20 and we put 21 in the
00:44:03config file. Right. And like, okay, I know exactly what kind of code that, you know,
00:44:08is causing that kind of problem. Right. And furthermore, furthermore, to like an earlier
00:44:14comment, literally, as far as I know, everyone who programs in Rust only does it so that occasionally
00:44:21when they see something like this, they can say, well, if they'd had written in Rust,
00:44:24it wouldn't have happened. They were not given enough information to even make that comment.
00:44:29They probably made it anyway, to be fair, but they were not given it. So you have to give
00:44:34one rule that should be followed in RCAs is you have to give Rustations enough information to,
00:44:41if they so chose, correctly say that it would have been prevented in Rust.
00:44:46And this, we do not have that. We do not know whether this would have been prevented in Rust.
00:44:51We have no idea. It probably wouldn't have, but we don't know. Well, Casey, we do have a pretty
00:44:58good chance because it's like, probably would have never shipped. So it would have prevented it.
00:45:03True. We would have zero enactors because we would be designing set enactors. Yeah.
00:45:09CloudFlare does a really good job at this as well. They like go in and show like a lot of lines of
00:45:17code and say like, this is exactly what's going on. This is, you know, even though the problems up here,
00:45:21this is the line that exploded due to all these previous conditions. That was me making fun of
00:45:24Rust with the unwrap, which actually wasn't truly the problem. Uh, but you know, it's just like all
00:45:28these things kind of happen. So they, they do a really good job. I'm surprised at how poor of a job
00:45:33AWS has done for this one. Well, and the other thing too, is it, it was one of those things
00:45:39where it now it makes me, so it makes me unnecessarily suspicious of you, right?
00:45:44When I read this, I'm like, are you hiding something? Did you not really figure out what
00:45:48the bug was? Like you talked all about this race condition, but even from your own presentation,
00:45:52I can tell the race condition really wasn't important. That was just, that was just what
00:45:56led to the record having been set to nothing, but who cares, right? Like that's, that's like
00:46:00something that's nice to put in the RCA as like an explanation of why this bug occurred now,
00:46:05as opposed to some other time, but it's not the bug. So it's weird to me. Like when I see an RCA
00:46:10that doesn't talk about the bug now I'm suspicious. Right. And unnecessarily so, because if you actually
00:46:15did find it, then just tell me, and now I know you found it. Right. So it's like, I think it also is a
00:46:19confidence boost for the people who are looking from the outside who want to know, can they trust
00:46:24this DynamoDB thing? If it looks like you actually found the bug, I have a little more confidence in
00:46:28you. If it looks like you have no idea what the bug was, or don't seem to understand what the bug was,
00:46:33then I'm, that I'm more concerned. And so I think that's also another reason to do this in your RCA.
00:46:37It, it provides confidence to your customers. Maybe that's where they fired Adam as an AWS hero too.
00:46:43Maybe it's all connected. Could be. They didn't want him exposing these dirty secrets.
00:46:48Yeah. He knew too much. He knew too much. Could you give a, could you give a quick,
00:46:53like three minutes summary of the guitar shop? Like what that, what that was revealing? Because I'm
00:47:00trying to remember what it was because it involved like a single point of failure guy who was out here
00:47:05for this failure as well. So I don't know how to reconcile the two things. And of course we have no
00:47:10idea. We have no idea if either are telling us the truth now, right? Because this was such a bad RCA,
00:47:16I have no idea if it's correct or not, but yes, the password was wishbone 12, I think.
00:47:22There you go. Always try to kill me. That's my recollection anyway.
00:47:26So yeah, that story was that, that there was the, there was a thing that was designed to
00:47:34copy configurations. And that thing had kind of gone rogue and could not be stopped. Like it was
00:47:42just like, it was just copying configurations totally incorrectly and it needed to be like fixed
00:47:47or repaired or something. And we don't have any more information because it was an overheard
00:47:53conversation. Right. And so does that comport with this? Well, a little bit, cause those enactors do
00:48:00sound like the kind of thing that would be running a configuration copy, but on the other hand,
00:48:05it's not really a configuration for machines. It like a DNS entry is a DNS entry. It's not,
00:48:10it's not really a configuration. So I would say the two stories don't line up that well.
00:48:14And so that's another reason why I was kind of hoping that this RCA was a little bit more
00:48:19believable because I wanted to know for sure that the story was false. And I still don't really know
00:48:24based on how bad this RCA. What if, what if the tool that the guy wrote to copy the configs is
00:48:31just literally the enactor? Like they just productionized it and he, and like they haven't
00:48:35changed it in seven years. That was kind of my connecting the dots. There was, he's like, guys,
00:48:42I wrote that as a way for me to test stuff in my local environment. And you just decided to make
00:48:48three enactors and put them next to each other and prod. I don't, how did this happen? I do.
00:48:53I have alternative questions. Yeah. Alternatively, is it the rollback? Because that's the one that
00:48:57did the copying of like, Hey, here's the previous one. Right. And so I'm going to copy the previous
00:49:02one. Then it gets like this null issue going on. And it just like the script never encountered
00:49:07or knowledge just goes rogue and starts writing over and over and over and over again to where you
00:49:11can't, you can't do anything. I don't know. All I know is that like, as far as I can tell from
00:49:19their explanation, going only on what they were providing, I still just don't think the race
00:49:24conditions even relevant because again, literally an accidental update to the route 53 endpoint
00:49:31would have taken down all three and actors immediately. Cause according to them,
00:49:35all that's required to stop them is if the, if the endpoint points at an unresolvable name,
00:49:41that's all you need. And so if that's really true, literally an operator typo could have taken all
00:49:47this down, no race condition necessary. Right. And so again, the RCA just does not do a good job
00:49:52convincing me that you've talked about what the real bug was, because I can think of so many ways
00:49:57that you could have triggered this exact same thing that don't involve this race condition that you
00:50:00spent the entire RCA telling me was the bug, but I don't think it is. So thank you, Casey, for giving
00:50:06us that amazing presentation. I am actually genuinely Greenwood, jealous rage for whatever
00:50:10that writing instrument is. I got to figure out how to set up what you have. That thing is fantastic.
00:50:15Thank you everybody for watching. I, uh, for those that caught it live, I hope you enjoyed
00:50:18the pre banter and probably a little bit of the post banter. If you wish to hear the extended and
00:50:22all the kind of fun interactions, that's not a part of the main story, head on over to Spotify
00:50:27for the full podcast, which is just us yapping about, I don't know what trash is eating and
00:50:31snacks and such the name more yapping, more yapping again, and also Casey TJ and trash.
00:50:42Errors on my screen, terminal coffee and living the dream.

Key Takeaway

AWSのDynamoDB障害の真因は、公表されたレースコンディションではなく、未定義の入力に対してシステム全体がクラッシュする堅牢性の欠如したコード設計にある。

Highlights

AWSの障害報告(RCA)が不十分であり、真の「根本原因」が隠されているという批判的な分析。

ジュニアエンジニアが陥りやすい「理解したふり」の危険性と、シニアエンジニアに求められる「無知を恐れず質問する」姿勢の重要性。

DynamoDBの障害を引き起こしたDNSレコード管理とロードバランス用の「ツリー」構造の解説。

プランナーと3つのエナクターという分散システムの設計上の矛盾(単一障害点の存在など)。

レースコンディションはきっかけに過ぎず、真のバグは「存在しないレコードの参照でシステム全体が停止する」脆弱な例外処理にあるという指摘。

CrowdStrikeやGoogleの障害報告と比較した、AWSの透明性の欠如に対する不信感。

Timeline

プロフェッショナルとしての「理解」の定義

番組のイントロダクションとして、Spotifyのテックポッドキャストでのランキングに触れつつ、本日の主題であるAWSの障害と「理解の質」について提示します。ケイシーは、特にキャリアの浅いプログラマーが周囲の圧力を感じて、完全には理解していないことに対しても「理解しました」と嘘をついてしまう心理的な罠について警告しています。知っているふりをすることには短期的には頭が良く見えるというインセンティブが働きますが、それは技術的な成長を阻害する要因となります。このセクションは、単なる技術解説ではなく、エンジニアとしての倫理観や学習姿勢についての議論から始まります。彼は自身の経験に基づき、曖昧さを残したまま進むことの危険性を強調しています。

シニアエンジニアが持つべき「バカに見える勇気」

経験を積んだプログラマーとしての現在の姿勢について語り、納得がいくまで徹底的に質問を繰り返すことの重要性を説きます。ケイシーは「自分がバカに見えても構わない」というスタンスを貫いており、バグやパフォーマンスの問題に対して心の底から納得できるまで掘り下げるべきだと主張します。過去に「分かったつもり」でいたことで手痛いしっぺ返しを食らった経験が、この執拗なまでの追求心の源泉となっています。都合の良い説明で妥協して次へ進むのではなく、隠れた真因がまだ潜んでいる可能性を常に疑う姿勢が重要です。このセクションは、技術者としての洞察力を養うためのメンタルモデルを提示しています。

優れたRCAの例とDynamoDBへの不満

Googleの空フィールド処理ミスやCrowdStrikeの配列オーバーフローなど、非常に分かりやすく透明性の高い他社の根本原因分析(RCA)の例を挙げます。それらの報告書はコードの具体的な一行こそないものの、プログラマーがどのようなミスをしたかを納得させるに十分な情報を含んでいました。一方、AWSが公開したDynamoDBのRCAやre:Inventでのプレゼンテーションは、具体的で技術的な詳細が欠けており、非常に曖昧であると厳しく批判します。ネットワーク上で「理解した」と言っている人たちも、実際には公式の表面的な説明を繰り返しているだけで、バグの本質に辿り着けていないと指摘します。彼は、教育的価値のある詳細な情報が伏せられていることに対して、技術コミュニティとして声を上げるべきだと考えています。

DynamoDBのDNSロードバランスの仕組み

DynamoDBのAPIエンドポイントがDNSを通じてどのように機能しているか、ホワイトボードを使って視覚的に解説を始めます。ユーザーが「dynamodb.us-east-1.api.aws」のようなドメインにリクエストを送ると、システムは内部的に「DNSツリー」と呼ばれる構造を用いて適切なマシンへ誘導します。このツリーは重み付けされたレコードで構成されており、負荷分散を自動的に行うための仕組みですが、その詳細は公開されていません。ゲストのアダム(元AWSヒーロー)も、Route 53の重み付け機能を用いた標準的な手法であると同調します。この基盤知識は、後に語られる致命的なバグがどこで発生したのかを理解するための重要な前提条件となります。

プランナーとエナクターのアーキテクチャ

DynamoDBの稼働状況を反映させるために、DNSツリーを常に更新し続ける「プランナー」と「エナクター」という2つの役割について説明します。プランナーは1つのプロセスで新しいプラン(設定値)を次々と生成し、3つのエナクターがそれを受け取ってRoute 53に反映させる役割を担います。しかし、プランナーが1つしか存在しないことによる単一障害点の懸念や、エナクターがなぜ3つ必要なのかという設計上の不明瞭さをケイシーは指摘します。AWS側は「耐障害性のため」としていますが、プランナーが停止すればエナクターも機能しなくなるため、論理的な矛盾を感じさせます。このセクションでは、大規模分散システムの設計における不可解な意思決定が浮き彫りにされます。

シリアライゼーションとロックの罠

エナクターがRoute 53を更新する際に、安全性を高めるという名目で「シリアライゼーション(直列化)」とロック機構を導入していたことが明かされます。DNSレコード自体をロックのフラグとして利用し、一度に1つのエナクターしか更新できないように制御されていましたが、これが逆にバグを引き起こす土壌となりました。あるエナクターが古いプラン(例:プラン110)を保持したままロック取得に失敗し続け、長時間バックオフを繰り返すという病的なケースが発生します。最終的にロックを取得したときには、すでに他のエナクターによって「プラン145」などが適用されており、古いプランを適用してしまうという問題が起きました。ここまではレースコンディションとしての説明ですが、ケイシーはこれが本当の悲劇の始まりに過ぎないと述べます。

連鎖的なクラッシュと真の根本原因

古いプランが適用された直後、クリーンアップ機能によってそのレコードが「古すぎる」として削除され、接続先が消滅するという事態に陥ります。しかし、本当の致命的な問題は、ロールバック用のレコード設定時に、参照先のプランが存在しない場合にエナクターがクラッシュするという設計にありました。この脆弱なコードのせいで、3つのエナクターすべてが次々と異常終了し、人間が手動で復旧させるまでDynamoDBが数時間にわたって完全にダウンし続けました。ケイシーは、例外をキャッチせずにシステム全体が死んでしまうようなコードこそが修正されるべき「真のバグ」であると断言します。公式報告書がこの「堅牢性の欠如」を隠し、レースコンディションという言葉でお茶を濁していることに対し、教育的機会の損失であると結論づけています。

Community Posts

View all posts