Skype may have built a scalable network, however, they still have their work cut out for adding fault-tolerance, resilience and security in their infrastructure as their most recent outage over Christmas eve shows. This was their first outage in the last three years, and unluckily for them, came on heels of a planned IPO, and I am sure their brand identity took some hit.
The event is a classic example of all that can go wrong by relying on user machines while building a critical piece of your infrastructure (in their case, Supernodes). Hopefully, Skype and other services will realize that they are increasingly becoming similar to ISPs and hence must incorporate network monitoring and security solutions in to their infrastructure.
The following article from Skype provides a detailed analysis of what went wrong. It is very much like dominoes falling over each other. First, a few Skype servers got overloaded which led to some Skype client versions hanging while waiting for the server to reply and thereby crashing. This further led to the Skype clients rebooting and then attempting to connect to supernodes, all at once, whereby the supernodes themselves got overloaded and crashed. Then the user clients attempted to switch back to alternate supernodes which also consequently crashed.
Can attackers bring down Skype via DDoS?
This incident shows an Achilles heel in Skype’s infrastructure that could be taken advantage of by a clever adversary. While the current outage was a classic case of “flash crowd” effect, all flash crowds can be converted in to a DDoS attack by a clever adversary. For instance, an adversary only has to make sure he overloads a set of supernodes by figuring out which users’ machines are supernodes and sending them a traffic deluge. Given that Skype does not currently have systems in place to prevent from such route flapping (or hysteresis effects), this will easily lead to users being switched to other supernodes, which in turn will be brought down by the traffic deluge.
What could Skype have done to prevent this outage and the future ones?
Clearly, Skype needed an analytics and measurement system in place yesterday. This can be as simple as having a monitoring daemon that gathers “anonymized” statistics about CPU, network, and memory utilization on each Skype client and more importantly, at each client that has been elevated to supernode status. Then these statistics should be aggregated hierarchically, in fact by using the same overlay network that Skype uses to route calls (called Global Index). Given that such statistical information is only used for health monitoring, it is possible to save on network bandwidth by aggregating the statistics in a lossy manner, e.g. by using Bloom filters. Finally, the statistics should be aggregated at a central server or database where time-series forecasting techniques can be used to measure whether the aggregate CPU, network or memory utilization of Skype’s infrastructure is normal or above normal. Indeed, such a system if it were in place, would have alerted Skype in advance of the problems to befall their network.
Learning from BGP route-dampening:
Skype’s supernode oscillations are evocative of another oscillation issue that networking industry has dealt with in the past, BGP route flapping, route oscillations and route convergence. BGP or Border Gateway Protocol is Internet’s premier inter-domain routing protocol and when a router decides to prefer one route over another, it should not do so without considering the global implications of its decision. For instance, in Figure 1 below, suppose router R2 advertises to the rest of the Internet that the best way to reach it is via router R3. Now imagine that during an increased traffic onslaught, the link between R2 and R3 goes down due to heart-beat failures in the TCP channel established between the two routers. At that time, R3 may be tempted to advertise R1 as the best router to reach it (R3), however, that would simply mean that the traffic deluge will be shifted from the link R2-R3 to R1-R3. After some time, it is highly likely that this link R1-R3 also goes down and then R3 switches back to router R2. As you can imagine, this see-saw can keep going ad-infinitum and that’s why techniques like “route flap dampening” were invented. Luckily for Skype, there is a vast literature on route flap dampening and oscillation prevention that they can learn from.