The Internet Is Full of Zombies and Nobody Is Talking About It

BBetter Stack
Internet TechnologyComputing/Software

Transcript

00:00:00Did you know that the Internet is infested with zombies? No, not those kind of zombies.
00:00:05The ones I'm talking about are called BGP zombies. BGP stands for Border Gateway Protocol
00:00:12and it is the way big networks announce which API addresses they can deliver traffic to.
00:00:18And the Internet is connected by this huge global map that tells networks how to reach each other
00:00:24and all of this is maintained through the BGP system. And for the most part,
00:00:29this system runs smoothly, but sometimes unexpected zombies appear in the system.
00:00:35But why does that happen? Well, that's what we're going to find out in today's video.
00:00:39So in the BGP system, when a network wants traffic to reach a new location,
00:00:49it advertises a route. When it wants traffic to stop coming through an old location,
00:00:54it withdraws that route. This happens all the time. Networks shift traffic between data centers,
00:01:00move customers to new edges, or take servers offline for maintenance.
00:01:04Withdrawing a route is simply how they tell the rest of the world that the path is no longer valid.
00:01:09But sometimes something strange happens. A route gets withdrawn,
00:01:13yet some networks keep believing it still exists. They continue sending traffic down a path that
00:01:20should be gone. And this is called a BGP zombie. It's an outdated route that refuses to disappear
00:01:26from the global routing table, even though the network that created it has already removed it.
00:01:32So what happens to the traffic when this zombie is alive? It does not reach its destination. It
00:01:38might loop between routers for a moment before getting dropped. It might also take a long detour
00:01:43across several networks until it reaches a dead end. Or it might land on a network that tries to
00:01:49forward it but still can't deliver it anywhere useful. From a user's perspective, this could
00:01:55translate to a page hanging or timing out, or an app failing to connect for a short period.
00:02:01Sometimes it's barely noticeable. But other times the slowdown is very visible.
00:02:06The next logical question is why routers fail to update the global map right away.
00:02:11The answer comes down to how BGP processes changes. When a more specific route disappears,
00:02:17routers search for a less specific fallback. That search takes time. During that window,
00:02:23some routers fail to clear the old entry. They get stuck with stale information. Cloudflare
00:02:29observed that these zombies lasted somewhere between 6 and 11 minutes in large networks.
00:02:34IP version 4 zombies tended to survive even longer than IP version 6 ones. But eventually,
00:02:40the system corrects itself because every route in BGP has a timer on it. If a router does not
00:02:46receive fresh updates for a while, it deletes the route automatically. Even if a router misses
00:02:52the withdrawal the first time, ongoing BGP chatter from its neighbors will eventually
00:02:57inform it that the route is gone. Once enough surrounding routers agree on the new state,
00:03:03the zombie disappears. Cloudflare discovered this behavior while working with BYOIP or
00:03:09Bring Your Own IP customers. In these situations, Cloudflare temporarily advertises a customer's
00:03:15IP space and then withdraws it after the handoff. The withdrawal itself is supposed to be clean,
00:03:21but instead they saw that sometimes some providers continue using the old route long after it was
00:03:27gone. That mismatch caused the traffic to take unexpected and inefficient paths into Cloudflare's
00:03:33network. To fix the problem, Cloudflare introduced a safer method. Instead of withdrawing the old
00:03:38route outright, they first announced the same route from a stable location. That forces routers
00:03:45around the world to switch cleanly to the new version. Only then do they withdraw the old
00:03:50announcement. This prevents the fallback surge that causes zombies in the first place. Cloudflare also
00:03:56tuned their internal systems so the transitions happen more smoothly in the future. If you want
00:04:02to read more about this topic, Cloudflare published a very detailed blog post explaining this issue.
00:04:07So in conclusion, it is a reminder that even the most fundamental parts of the internet can behave
00:04:13unexpectedly under certain conditions. And yet, most of the time, these issues are resolved before
00:04:19users notice anything. But sometimes zombies might appear. The internet is held together by millions
00:04:25of routing decisions happening every second. And occasional surprises like BGP zombies show just how
00:04:31much coordination is needed to keep everything running smoothly. So that's basically it. Now
00:04:37you know what BGP zombies are. The next time something hangs or fails to load, you might just
00:04:43have encountered a zombie on the internet. If you like technical breakdowns like these, be sure to
00:04:48smash that like button underneath the video. And don't forget to subscribe to our channel. This has
00:04:53been Andris from Better Stack and I will see you in the next videos.

Key Takeaway

BGP zombies are stale routing entries that cause internet traffic to fail or loop, but they can be mitigated by announcing stable paths before withdrawing old ones.

Highlights

Introduction to BGP zombies

Timeline

Introduction to BGP and the Zombie Concept

The speaker introduces the concept of "BGP zombies" as a hidden phenomenon affecting the internet's infrastructure. BGP, or Border Gateway Protocol, is defined as the system that manages the global map of IP addresses and traffic delivery. The speaker explains that while this system usually runs smoothly, these "zombies" represent unexpected glitches in how networks communicate. This section establishes the fundamental role of BGP in keeping the global internet connected and functional. It sets the stage for a deeper technical dive into why these routing errors occur.

The Mechanism of Route Withdrawal

In the BGP system, networks advertise new routes to attract traffic and withdraw old ones when a path is no longer valid. This process is standard for maintenance, moving customers to new edges, or shifting traffic between data centers. However, a BGP zombie occurs when a route is officially withdrawn but some networks continue to believe it exists. These outdated routes refuse to disappear from the global routing table despite the original network removing them. This mismatch between the actual network state and the routing table is the core definition of the zombie problem.

Impact on Traffic and User Experience

When traffic hits a BGP zombie, it fails to reach its intended destination because the path no longer physically exists. The traffic might loop between routers indefinitely or take long, inefficient detours across multiple networks until it hits a dead end. For the end user, this manifests as hanging web pages, connection timeouts, or apps failing to load for brief periods. While some instances are barely noticeable, others cause significant and visible slowdowns in internet performance. Understanding this impact is crucial for network engineers who aim to maintain high availability and low latency.

Why Routers Get Stuck with Stale Data

The delay in updating the global map stems from how BGP processes changes when a specific route disappears. Routers must search for a less specific fallback route, and during this computational window, some fail to clear the old, stale entry. Cloudflare's research indicates that these zombies typically persist for 6 to 11 minutes in large-scale networks. Interestingly, IPv4 zombies tend to survive longer than their IPv6 counterparts due to various technical factors. Eventually, the system self-corrects through BGP timers and ongoing communication between neighboring routers that forces a consensus.

Cloudflare's Solution and Best Practices

Cloudflare discovered this issue while managing Bring Your Own IP (BYOIP) customers, where clean handoffs are essential for traffic migration. To solve the problem, they moved away from simple withdrawals and adopted a more stable two-step announcement method. By first announcing a route from a stable location, they force surrounding routers to switch to the new path before the old one is removed. This prevents the fallback search that triggers zombies and ensures a much smoother transition for global traffic. The video concludes by emphasizing that while the internet is a complex coordination of millions of decisions, such technical hurdles can be overcome with smart engineering.

Community Posts

View all posts