scaling and scoping the NYT paywall

April 1, 2011 in engineering, Internet, networks, press

Amusingly for a comment on scalability, I couldn’t post this on D^2’s thread because Blogger was in a state. Anyway, it’s well into the category of “comments that really ought to be posts” so here goes. So various people are wondering how the New York Times managed to spend $50m on setting up their paywall. D^2 reckons that they’re overstating, for basically cynical reasons. I think it’s more fundamental than that.

The complexity of the rules makes it sound like a telco billing system more than anything else – all about rating and charging lots and lots of events in close to real-time based on a hugely complicated rate-card. You’d be amazed how many software companies are sustained by this issue. It’s expensive. The NYT is counting pages served to members (easy) and nonmembers (hard), differentiating between referral sources, and counting different pages differently. Further, it’s got to do it quickly. Latency from the US West Coast (their worst case scenario) to nytimes.com is currently about 80 milliseconds. User-interface research suggests that people perceive a response as instant at 100ms – web surfing is a fairly latency tolerant application, but when you think that the server itself takes some time to fetch the page and the data rate in the last mile will restrict how quickly it can be served, there’s a very limited budget of time for the paywall to do its stuff without annoying the hell out of everyone.

Although the numbers of transactions won’t be as savage, doing real-time rating for the whole NYT website is going to be a significant scalability challenge. Alexa reckons 1.45% of global Web users hit nytimes.com, for example. As comparison, Salesforce.com is 0.4% and that’s already a huge engineering challenge (because it’s much more complicated behind the scenes). There are apparently 1.6bn “Internet users” – I don’t know how that’s defined – so that implies that the system must scale to 268 transactions/second (or about 86,400 times the daily reach of my blog!)

A lot of those will be search engines, Internet wildlife, etc, but you still have to tell them to fuck off and therefore it’s part of your scale & scope calculations. That’s about a tenth of HSBC’s online payments processing in 2007, IIRC, or a twentieth of a typical GSM Home Location Register. (The usual rule of thumb for those is 5 kilotransactions/second.) But – and it’s the original big but – you need to provision for the peak. Peak usage, not average usage, determines scale and cost. Even if your traffic distribution was weirdly well-behaved and followed a normal distribution, you’d encounter a over 95th percentile event one day in every 20. And network traffic doesn’t, it’s usually more, ahem, leptokurtotic. So we’ve got to multiply that by their peak/mean ratio.

And it’s a single point of failure, so it has to be robust (or at least fail to a default-open state but not too often). I for one can’t wait for the High Scalability article on it.

So it’s basically similar in scalability, complexity, and availability to a decent sized MVNO’s billing infrastructure, and you’d be delighted to get away with change from £20m for that.

4 Comments

Dennis

April 1, 2011 at 4:00 pm

Pretty sure you’re overstating the case here, for a few reasons:

(1) The NYTimes front page takes 3.1 seconds to load from my wifi connection in manhattan. YSlow gives it a C — among other things, they’re not even using a CDN. Given the way the paywall works (applying extra CSS and Javascript to a loaded page), there’s no reason you need to do all the backend stuff in small numbers of milliseconds.

(2) HSBC payments and Salesforce are not great examples. Salesforce competes against desktop software — web *applications* need to be crazy fast to not feel slow and lame since you’re interacting with them constantly. Again, note from your link that Salesforce uses CDNs like crazy and the NYT doesn’t, suggesting they’re not super-concerned about every millisecond of load time.

Payments, on the other hand, are a whole other thing. They need to be (a) agreed by several parties, (b) securely, (c) correctly. The NYT can miss out all three.

The key point here is that loading a static webpage (and the NYT is fundamentally static; comments on articles and social crap can be loaded more slowly, later, and sometimes on demand) is an easy problem with lots of existing things that make it go faster. I tend to think that the big problem would just be designing a database that can handle hundreds of thousands of transactions per second (for a 1000x peak/average ratio), which just isn’t that many.

Reply
duaneg

April 3, 2011 at 7:38 pm

It doesn’t seem all that complicated to me — certainly nothing like as complicated as a telco billing system — but then I haven’t read the spec and these things always *seem* easy enough.

I agree with most of what Dennis said, but I doubt the DB needs to be anywhere near that fast *per-node*. The workload should be very parallelizable and you would hope that they could scale by simply filling racks.

To make sense of the figures I think we need to know what they include. Presumably hardware and software licensing. What about rackspace in datacentres? DR facilities? Does it include a support and maintenance contract, or the cost of doing the same in-house? For how long?

Reply
cianoconnor

April 4, 2011 at 12:47 pm

salesforce has all kinds of transactional/acid requirements that aren’t there for the NYT.

On the other hand, the NYT IT team are apparently pretty crap, and crapness can seriously raise the cost of these kinds of things.

Reply
Barry Freed

April 4, 2011 at 7:24 pm

There’s a Slashdot story and someone there has mentioned your post: http://news.slashdot.org/comments.pl?sid=2067458&cid=35706624

Reply

	Telegraph: “2,… on Some data points
	yorksranter on Python and CSV; know your…
	John Nash on Python and CSV; know your…
	Viktor Bout – Dosar… on dad, won’t you get me ou…
	A regrettable breach… on Against mass surveillance,…

Alternate Seat of TYR