YouTube and the "series of tubes"
Looking at this inspiring achievement, I fell to wondering exactly how YouTube is serving up its videos. Now, so far I can remember seeing YouTube content from hostnames with the form lax-vXX.lax.youtube.com or ash-vXX.ash.youtube.com, where the x stands for an arbitrary number. Clearly those are either LA or Ashburn, Virginia, where the big Equinix East Coast IX is located.
What interests me, though, is whether or not YouTube makes any effort to serve files topologically close to the user. They don’t use any multicast or CDN-ing, so do they just dump the stuff out there, or do they traffic-engineer at all? Now, being in the UK, I’d expect to get ash.youtube.com all the time, but I don’t – it seems to be about equally likely to come from Ashburn or LA. Now, if you were trying to shorten the haul, a simple way of doing it would be to replicate content between the two data centres and do some form of traffic engineering. That would also allow fail-over between them, which is nice.
YouTube might be doing this, but something else as well. For example, they might be working on the principle of serving from the nearest data centre, but also load-balancing across them, so you would get the nearest unless it’s congested. Alternatively, they might just be accepting uploads to whichever data centre and then pouring it forth.
Now, I seem to recall seeing somewhere that they account for 20Gbps/s of outbound traffic, rising fast. That was back in June last year. Their own blog claims 45 terabytes a day, i.e. 52 GBytes a second. What would be interesting to know would be the average number of times one of their items is viewed, which would give an idea of the net imbalance between their upstream and downstream traffic. In so far as the two match, YouTube could cover this through peering – after all, it might as well be a fair-sized ISP. But the excess outbound traffic is what they have to pay for.
Now, I did a quick check on the selection of “most viewed” videos. With the top one being viewed 553,934 times and the bottom 6,837, I used the Malatesta estimator to arrive at an estimate of 196,270 views for a top-100 video. Supposedly, 100 million clips a day are accessed, but it’s not clear whether those are unique – does anyone know how many are on there? But if the top 100 accounts for, say, 60 per cent of the viewing, we’d be looking at a figure of, say, 270,000 views per vid, with a highly skewed distribution.
To put it another way, YouTube is a giant copying machine, that kicks out 270,000 bytes for every byte it takes in. Call it the content replication factor. Because the replication takes place at source, and the replicated traffic has to be carried over the backbone network, this implies that essentially all YouTube’s traffic requirement must be covered by paid-for transit, which costs about $20/Mbits-sec/month at this scale. That would be – ouch – $8 million a month…
And at that price, you certainly don’t want things like this happening:
11 t3-1.mpd01.dca01.atlas.cogentco.com (18.104.22.168) 165.207 ms 108.370 ms 112.946 ms
12 t9-3.mpd01.iah01.atlas.cogentco.com (22.214.171.124) 186.815 ms 186.259 ms 191.197 ms
13 t7-1.mpd01.lax01.atlas.cogentco.com (126.96.36.199) 231.125 ms 187.475 ms 188.842 ms
14 t4-2.mpd01.lax05.atlas.cogentco.com (188.8.131.52) 187.148 ms 187.109 ms 191.818 ms
15 g0-3.na21.b015619-0.iah01.atlas.cogentco.com (184.108.40.206) 174.083 ms 175.773 ms 174.523 ms
16 10.254.254.233 (10.254.254.233) 350.527 ms 278.645 ms 202.520 ms
17 ash-v83.ash.youtube.com (220.127.116.11) 184.436 ms 183.616 ms 182.910 ms