Building Scalable Web Sites

Cal Henderson

Mentioned 17

A guide to developing Web sites using scalable applications.

More on Amazon.com

Mentioned in questions and answers.

Before you answer this I have never developed anything popular enough to attain high server loads. Treat me as (sigh) an alien that has just landed on the planet, albeit one that knows PHP and a few optimisation techniques.


I'm developing a tool in PHP that could attain quite a lot of users, if it works out right. However while I'm fully capable of developing the program I'm pretty much clueless when it comes to making something that can deal with huge traffic. So here's a few questions on it (feel free to turn this question into a resource thread as well).

Databases

At the moment I plan to use the MySQLi features in PHP5. However how should I setup the databases in relation to users and content? Do I actually need multiple databases? At the moment everything's jumbled into one database - although I've been considering spreading user data to one, actual content to another and finally core site content (template masters etc.) to another. My reasoning behind this is that sending queries to different databases will ease up the load on them as one database = 3 load sources. Also would this still be effective if they were all on the same server?

Caching

I have a template system that is used to build the pages and swap out variables. Master templates are stored in the database and each time a template is called it's cached copy (a html document) is called. At the moment I have two types of variable in these templates - a static var and a dynamic var. Static vars are usually things like page names, the name of the site - things that don't change often; dynamic vars are things that change on each page load.

My question on this:

Say I have comments on different articles. Which is a better solution: store the simple comment template and render comments (from a DB call) each time the page is loaded or store a cached copy of the comments page as a html page - each time a comment is added/edited/deleted the page is recached.

Finally

Does anyone have any tips/pointers for running a high load site on PHP. I'm pretty sure it's a workable language to use - Facebook and Yahoo! give it great precedence - but are there any experiences I should watch out for?

I've worked on a few sites that get millions/hits/month backed by PHP & MySQL. Here are some basics:

  1. Cache, cache, cache. Caching is one of the simplest and most effective ways to reduce load on your webserver and database. Cache page content, queries, expensive computation, anything that is I/O bound. Memcache is dead simple and effective.
  2. Use multiple servers once you are maxed out. You can have multiple web servers and multiple database servers (with replication).
  3. Reduce overall # of request to your webservers. This entails caching JS, CSS and images using expires headers. You can also move your static content to a CDN, which will speed up your user's experience.
  4. Measure & benchmark. Run Nagios on your production machines and load test on your dev/qa server. You need to know when your server will catch on fire so you can prevent it.

I'd recommend reading Building Scalable Websites, it was written by one of the Flickr engineers and is a great reference.

Check out my blog post about scalability too, it has a lot of links to presentations about scaling with multiple languages and platforms: http://www.ryandoherty.net/2008/07/13/unicorns-and-scalability/

How do you design/architect a scalable application? Any suggestion of books or websites that could help to understand how to scale out applications?

Thanks

Over the past year I've had to come up to speed on this question for a project my company's working on, and I've found these resources extremely helpful: Todd Hoff's highscalability.com; Scalable Internet Architectures, by Theo Schlossnagle; and Building Scalable Web Sites, by Cal Henderson. Highscalability.com in particular will point you to many good presenations, tutorials, books, and papers, and is a great place to start. All of the advice is practical, and based on experience at sites like Flickr, Twitter, and Google.

BTW, scalability is not performance. A perfectly scalable system is one that has a fixed marginal cost to add additional users or capacity.

Most web frameworks and "best practices" are not suitable for very high performance sites and the whitepapers from vendors out there ain't worth the paper they are printed on.

So where should someone look to find books, tutorials or other resources on this subject?

Have a look at Cal Henderson's, 'Building Scalable Websites' by O'Reilly

http://www.amazon.com/Building-Scalable-Web-Sites-Applications/dp/0596102356

he's the guy behind Flickr.

Also have a look at highscalability.com, They have some of the architectures of the most loaded sites out there.

I have done quite a bit of search on this.Finally boiled down to these three books.

  1. High Performance Websites

  2. Even faster websites

  3. The art of scalability

A number of applications have the handy feature of allowing users to respond to notification emails from the application. The responses are slurped back into the application.

For example, if you were building a customer support system the email would likely contain some token to link the response back to the correct service ticket.

What are some guidelines, hints and tips for implementing this type of system? What are some potential pitfalls to be aware of? Hopefully those who have implemented systems like this can share their wisdom.

Building Scalable Web sites has a nice section on handling email. It's written by a Flickr developer.

alt text

I am willing to learn about different architectures of highly scalable web applications like gmail, google, youtube, amazon, orbitz, linkedin, ebay etc. and would certainly appreciate if someone can point me to some online resource/book from where I can learn about details of their architecture and trade offs in selecting a particular design over other.

I've done a bit of web programming (using PHP and MySQL), but nothing too large in scale. I've been thinking about how someone would create a social networking type of site and I've ran into some problems.

  • How would you safely and securely store passwords in MySQL? What kinds of encryption would you use?
  • If users were allowed to upload pictures, would it be better to store them in the database or have them uploaded directly to the server?
  • What open source web applications (such as WordPress) would you recommend I read and study (preferably something simple but well written)?

Anything taught in class or written in books just don't seem to translate well into real production code. They just seem like very basic examples.

Thanks!

  1. Store a salted hash. I would personally move away from md5 and using something like sha instead. sha1 + salt will hold out for a while =]

  2. If you store the images as blobs in the db, you'll probably have an easier time in the future backing them up (along w/the db, fetching them, etc). But really, they'll be damn fast on the file system too, but I'd prefer them in the database as I have lots of code that interfaces w/the db and I'm comfortable working in that area. That's up to you.

  3. I'm not sure that wordpress will help you to build a social networking site...but its still good to read other's code. I'd take a look at some books on amazon on architecture just to get your mind thinking large scale. Also, take a look at some design pattern books.

I'd also look into something like the Zend Framework or CakePHP. Cake will probably get you up and running rather fast, but I prefer Zend, as its very powerful and doesn't force you to code a certain style. CakePHP is kinda of like rails for PHP.

You'll also want to get decent at security, both server and client side, watching for stuff like session hijacking, sql injection, xss, brute force attempts, remote includes, uploaded file exploits, etc.

Social sites offer many attack vectors to crackers.

Resources:

What various methods and technologies have you used to successfully address scalability and performance concerns of a website? I am an ASP.NET web developer exploring .NET remoting with WCF with SQL clustering and am curious as to what other approaches exist (such as the ‘cloud’). In which cases would you apply various approaches (for example method a for roughly x many ‘active’ users).

An example of what I mean, a myspace case study: http://highscalability.com/myspace-architecture

I've worked on a few sites that get millions/hits/month. Here are some basics:

  1. Cache, cache, cache. Caching is one of the simplest and most effective ways to reduce load on your webserver and database. Cache page content, queries, expensive computation, anything that is I/O bound. Memcache is dead simple and effective.
  2. Use multiple servers once you are maxed out. You can have multiple web servers and multiple database servers (with replication).
  3. Reduce overall # of request to your webservers. This entails caching JS, CSS and images using expires headers. You can also move your static content to a CDN, which will speed up your user's experience.
  4. Measure & benchmark. Run Nagios on your production machines and load test on your dev/qa server. You need to know when your server will catch on fire so you can prevent it.

I'd recommend reading Building Scalable Websites, it was written by one of the Flickr engineers and is a great reference.

Check out my blog post about scalability too, it has a lot of links to presentations about scaling with multiple languages and platforms: http://www.ryandoherty.net/2008/07/13/unicorns-and-scalability/

I'm developing a web site that (like many other sites) use a bunch of different tools such as php, xml, xsl, json, jquery, css etc.

I'm looking for resources/books that can provide tips on how to use these tools more efficiently. Best practices, tips and tricks and that sort of stuff. For example how to structure files, when to use json instead of xml/xsl, ajax or no ajax and that kind of stuff. Luckily, I don't have to worry about UI design.

Does anyone know any good books/resources that deal with this?

If you are a beginner then consider to start from PHP MySQL Web Development. And if you are intermediate then following are very good references concern on patterns, scalability, performance:

What does it mean to say - Engineering scalability into applications. Are there design patterns that would make an application more scalable? This question is mainly in the context of web applications or SOA middleware based applications.

Here are some great resources on web application scalability to get you started: Todd Hoff's highscalability.com, Scalable Internet Architectures by Theo Schlossnagle, and Building Scalable Web Sites by Cal Henderson. Highscalability.com will point you to a lot of presentations and articles well worth reading, including this one from Danga about how they scaled LiveJournal as it grew.

When I think about "large scale applications" I think of three very different things:

  1. Applications that will run across a large scale-out cluster (much larger than 1024 cores).

  2. Applications that will deal with data sets that are much larger than physical memory.

  3. Applications that have a very large source base for the code.

Each kind of "scalability" introduces a different kind of complexity, and requires a different set of compromises.

Scale-out applications typically rely on libraries that use MPI to coordinate the various processes. Some applications are "embarrassingly parallel" and require very little (or even no) communication between the different processes in order to complete the task (e.g. rendering different frames of an animated movie). This style of application tends to be performance bound based on CPU clock rates, or memory bandwidth,. In most cases, adding more cores will almost always increase the "scalability" of the application. Other applications require a great deal of message traffic between the different processes in order to ensure progress toward a solution. this style of application will tend to be bound on the overall performance of the interconnect between nodes. These message intensive applications may benefit from a very high bandwidth, low latency interconnect (e.g. InfiniBand). Engineering scalability into this style of application begins with minimizing the use of shared files or resources by all the processes.

The second style of scalability are applications that run on a small number of servers (including a single SMP style server), but that deal with a very large dataset, or a very large number of transactions. Adding physical memory to the system can often increase the scalability of the application. However, at some point physical memory will be exhausted. In most cases, the performance bottleneck will be related to the performance of the disc I/O of the system. In these cases, adding high performance persistent storage (e.g. stripped hard drive arrays), or even adding a high performance interconnect to some kind of SAN can help to increase the scalability of the application. Engineering scalability into this style of application begins with algorithmic decisions that will minimize the need to repeatedly touch the same data (or setup the same infrastructure) more than is necessary to complete the task (e.g. open a persistent connection to a database, instead of opening a new connection for each transaction).

Finally, there is the case of scalability related to the overall size of the source code base. In these instances, good software engineering practices can help to minimize conflicts, and to keep the code base clean. The book Large Scale C++ Software Design was the first one that I encountered that really took on the challenge of providing best practices for large source base software development. The book focuses on C++ as the implementation language, but the guidelines and practices can be applied to any project or language. Engineering scalability into this style of application involves making high level decisions about the structure of the code to minimize dependencies within the code base (e.g. do not have a single .h that when changed will force a rebuild of the entire code base, use a build system that will reuse .o's whenever possible).

I'm in the position where I may be creating a new web service from scratch - without much pre-existing infrastructure to have to contend with. What resources are there that talk about the architectural aspects of deploying a web service? [Clarification: I'm not talking about an Enterprise SOA orientation here - rather setting up one family of services for the public.]

A first list of topics that I'd like to see covered are:

  • SOAP vs. REST
  • JSON vs. XML
  • Relational Database Backed vs. SimpleDB backed vs. ?
  • Scaling
  • Availability
  • Models for restricting access
  • Models for throttling access

What would you recommend?

If you decide to use Microsoft technology (WCF) then you could check out the Microsoft Patterns and Practices group's online library of guidance.

They have a library located here as part of MSDN which deals with Web Service security, Enterprise Buses (obviously not applicable to you scenario) and PAG's own Web Service Software Factory.

Their main page is located here.

Otherwise, assuming you choose WCF it might be worth checking out further reading such as Juval Lowy's book on WCF, although I fear it may cover the implementation more than the theory and design facets.

Do you know roughly what technology platform you'll be working from?

I would recommend Restful Web Services. It's weel written, very complete and vendor agnostic. Also it has a fairly good coverage of both REST (with comparison to SOAP/WS-*), HTTP scaling, resource formats (JSON, XHTML, Atom, XML), security and service modeling.

If you have any specific scaling needs, then you might also want to read Building Scalable Web Sites. It will teach you everything worth knowing about etags, proxies, caching, edge computing and so forth. However if you are just starting out, then the Rest book I mentioned earlier will properly cover most people needs.

We're planning a new API server that will mainly serve JSON responses, with a RESTful interface. We're concerned about scale and availability. Are we on the right track using Restlet with Jetty (or another connector)?

One question we're asking is, is there anything in the Java world like Phusion Passenger? For example, a pre-built solution for keeping server instances up and healthy?

Your question actually is not as much about Restlet as it is about designing a high-scalability, high-availability site. We find that Restlet does scale very well with the right system architecture.

Generally speaking you want to:

  • Run a cluster of web server machines, not just one.
  • Make sure your application is shared nothing, ie, no application state stored in your web servers, if at all possible.
  • Use a load balancer to spread requests to the least loaded web servers.
  • Make sure your JSON responses are cacheable.
  • Add an HTTP reverse proxy cache (eg, Squid) at the border of your site. As the caches between your site and your clients warm up, most of the inbound traffic will be handled by them, and not your web servers.
  • Write your client code to retry requests that fail. This way if a web server dies the next request will be load balanced to a surviving machine.
  • And of course you want to automate your site to bring up crashed web servers, etc. (This is the part that is perhaps better asked on ServerFault.com.)

REST is an architectural style that is ideal for this type of setup.

As @matt mentions you do need to watch out for raw performance, but generally your first concern should be to get the scalable, high availability architecture in place.

Some good sources on this are:

and especially:

Overstock.com runs a highly scaled web site and makes heavy use of Restlet to do it.

Does anyone know where I can find a system architecture for a site that streams music for thousands of concurrent users and can also scale. Also I would prefer to use open source system components.

I found the book Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications by Cal Henderson, the architect of Flickr, to be a good overview issues involved with scaling a site.

I need to programmatically capture emails as well as any files that are attached to them using php. Also is running a cron job the only way to continue checking if there are any new emails or is there a way to automatically fire some code as a new email arrives? Thanks any help is appreciated!

To the second part of your question: If you run your own mail server and want to avoid polling to fetch new messages, then you can add an entry to /etc/aliases that lets your MTA know to forward to your PHP script, like so:

uploads: "|/usr/bin/php -q /var/flickr/uploads.php"

This entry will tell your MTA to pipe any emails for uploads@example.com to uploads.php. From there, you can read STDIN, parse the MIME message, and processes it as you please

(stolen from Cal Henderson's Book Building Scalable Websites. I highly recommend Chapter 6)

Do you have any experience of designing a Real Shared-Nothing Architecture? Would you have some readings to recommend me?

Building Scalable Web Sites by Flickr architect Cal Henderson is pretty much the holy book for scalable web architectures.

The presentations by Brad Fitzpatrick of Danga Interactive, creators of LiveJournal, are also excellent case studies. Check out this one first.

We're creating a web system using Java and Servlet technology (actually Wicket for the presentation layer) and we need our system to be available nearly always as our customers will be quite dependent on it.

This has lead us to look for a good book focusing on the subject or another resource which explains how to set up a more redundant and fail safe architecture for our system.

A non exclusive list of questions we have at the moment:

  • How do you have one domain name (like http://www.google.com) which are actually served by several servers with load balancing to distribute the users? Isn't there always a point which is weaker in such a solution(the two [as there can't be more] DNS servers for google.com in their case)?
  • It seems like a good idea to have several database servers for redundancy and load balancing. How is that set up?
  • If one of our web servers goes down we would like to have some kind of fail over and let users use one that is still up. Amongst other things the sessions have to be synchronized in some way. How is that set up?
  • Do we need some kind of synchronized transactions too?
  • Is Amazon Computer Cloud a good option for us? How do we set it up there? Are there any alternatives which are cost effective?
  • Do we need to run in a Java EE container like JBoss or Glassfish?

I just finished reading Architecting Enterprise Solutions: Patterns for High-Capability Internet-based Systems. Excellent introduction for me on scalability, availability, performance, security, and a whole lot of other aspects for Enterprise Systems

A friend of mine says that Building Scalable Web Sites is the definitive book on the subject:

I was wandering (because though I am a programmer I am not good with networking) if I have a site with multiple databases for user accounts, what unifies those servers/multiple databases so it doesn't check the wrong database or sever. So when I go from having 1 server to multiple, will I be able to keep the same application and the databases will expand into those server? If someone suggested a book that would be great!

Understanding how horizontal scaling works is concept which would give you clear understanding of how that's done.

I suggest you to read articles and books which related to that topic.

There are a lot of good books and articles related to that topic, few of them listed below:

I would like to read how to use caching effectively, optimize my database schema and queries, apply partitioning and load balancing. There are pretty much resources on optimizing code and low-level stuff but not the other.

I've read Building Scalable Web Sites by Cal Henderson and besides a single chapter actually on scaling, which barely scratches the surface, there is nothing interesting inside.

Is there any decent book or another resource on web application performance optimization?

Take a look at the Patterns and Practices Guide - http://www.codeplex.com/PerfTestingGuide

If you are using an Oracle database, this guide may also help.
http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/toc.htm

When we talk about performance improvement of a web application two must read books are High Performance Websites and Even faster websites, both of them focus on Improving the load time of a website and other client side optimizations techniques.

For the database side of things you can look at SQL Performance tuning.