The Art of Capacity Planning

John Allspaw

Mentioned 6

Success on the web is measured by usage and growth. Web-based companies live or die by the ability to scale their infrastructure to accommodate increasing demand. This book is a hands-on and practical guide to planning for such growth, with many techniques and considerations to help you plan, deploy, and manage web application infrastructure. The Art of Capacity Planning is written by the manager of data operations for the world-famous photo-sharing site Flickr.com, now owned by Yahoo! John Allspaw combines personal anecdotes from many phases of Flickr's growth with insights from his colleagues in many other industries to give you solid guidelines for measuring your growth, predicting trends, and making cost-effective preparations. Topics include: Evaluating tools for measurement and deployment Capacity analysis and prediction for storage, database, and application servers Designing architectures to easily add and measure capacity Handling sudden spikes Predicting exponential and explosive growth How cloud services such as EC2 can fit into a capacity strategy In this book, Allspaw draws on years of valuable experience, starting from the days when Flickr was relatively small and had to deal with the typical growth pains and cost/performance trade-offs of a typical company with a Web presence. The advice he offers in The Art of Capacity Planning will not only help you prepare for explosive growth, it will save you tons of grief.

More on Amazon.com

Mentioned in questions and answers.

I just read the book The Art of Capacity planning (BTW, I liked it), and in it the author explains how important is measuring your services, finding out your ceilings, forecasting your needs, ensure a easygoing deployment, etc.. etc.. But through the book he explains his experience in Flickr, where he has to face all the time the same product.

Lot of us, we work in companies where we face small-medium project sizes for other companies. We have to understand their business, their needs, plan an architecture, a model, etc.. etc..

Then, the customer says "I need to support 1000 users". Well, and how many requests per second is a user? how long are their sessions? how much data do they transfer? which operations do they execute? how long are they?

Sometimes it is possible to know those figures (monitoring their existing applications or because they have already done that measurements), sometimes it is not possible (because they do not have a current web site, or it is just to possible to know).

How do you make a guess about the number of servers, bandwidth, storage, etc... which figures of reference do you use?

Regards.

Some points that you need to know to make this planning

  1. How many users per day.
  2. How many data you going to control.
  3. How many data you going to show to each user.
  4. Average user bandwidth that may need.
  5. Average user time using your site.

The average numbers can give some idea what you need monthly. Of cource you need to think also the peak numbers - but when they rend web server computers and site they give bandwidth by the month and some gigabytes on hard disk, so the peak is not an issue at the start point. There you must think that if you run sql query that need too much ram, or if you share the computer with many other sites.

Measure

With out site, with out experience you do not have actually measures. With out measures, you actually can not be sure but you can follow some guides

  • What ever you do, try to make the grow of your data/features/runs linear and not logarithmic.
  • The speed of your site is not (only) depend from the capacity and the speed of your computer. Is depend only when the computer is on his limits. If the computer is reach his limit, you add additional resource. But the speed must be take care when you design the software and the good speed software is costing also.
  • Do you have millions of data every day in the database ? you need more ram and hard disk
  • Do you have video and many big files to send ? you need more bandwidth.
  • Do you have people that using the site to work ? you need more speed and stability
  • Do you make one more e-commerce site ? you need more security with stability

The goal is to have them all, and the priority on what you focus first actually change.

Planning for speed.

Performance and Capacity: Two diffident animals*. The Performance is base on more human work, and the capacity is base on more computer resources. To make it speed you need first to know how to make the computer run smooth and fast, then to know how general tricks to make programs runs fast, especial the one on the web, and then you actually need to spend more time to the actually program after its run, to improve it for performance in the critical areas.

Planning for expand.

Make good software design and take care the possibility of expand in case that you may need more so to give to your client the opportunity to start with little, and pay more only if he needed it. So when you design your software think like you going to use it in a web pool, take care of the synchronization, take care of common resource, give the ability to get data from different servers etc.

Planning with limits

Ok, let say that the customer say that have only 1000 users and did not interesting nether for expand, nether for speed, and just need a cost effective site that do his job. In this case you also design it with this limits. What are this limits. You do not place tens of checks for synchronizations, and you make it work like a single thread, single pool program. You do not use any mutex, any double checks, any thinks that happens when you have 2 pools or 2 computers running the same application. You only note that points of code to change them in case that needs upgrade.

You also not made any code that use multicomputer resources. And when you run it you take care that is run only under one pool to work correctly.

This single pool design is more easy to develop, more easy to debug, easy to control, easy to update buggy code, and cost less, but suffer from speed (one user wait the other on one thread pool) and can not be expand in resource, that actually have to do also with speed.

Finding Statistics

If you do not know how many users you may have, you can use alexa to see similar sites with yours and the average users/ and average page views they have per month. Then you may know the possible bandwidth.

Don't buy before you needed it

Start with your prediction to hardware, but do not go and rent 2 computers from the day one. Start with the first, make your measures, see how data grow, and only expand it when you need it.

Car or Formula One ?

When the programs runs, if you follow it you can find many many thinks that need correction. I can say you only two from my life.

After we place the program online our customer starts to add data. After some months we notice the database grow too much - something that we did not expect it from the data enter. We spend almost one week to find why and fix it, it was a design error that make some statistics data grow logarithmic, we correct it and move on.

After two years of running we notice that we make too many un-necessary calls to SQL server. We trace it down and found again a design error, we correct it and we move on.

Actually we have found and fix many small points for performance every month. For me its like the formula one. You decide what car you have, a formula one that needs all the time correction to gain the maximum of it, or a simple car that only needs a yearly service ?

Customer Point of View

Then, the customer says "I need to support 1000 users" Well the customer did not know programming and try to find a measure from his point of view to compare proposals. Actually there are many more factor here and the 1000 users is not a correct parameter. Is 1000 users per day per minute or per month ? Are needed to suport with live chat, or needed to see large amount of data, or needed to work fast ? So maybe its up to you to sell correctly your program to the customer ether by explain to him that the good program is good the same for one user of for one million users, and actually the start of it is cost by the development and not by the users.

Now if this is a question for actually planning a site, then the simple end point answer is to start do it, and the rest will be reveal. If this is a question because you search answers for your client, then you must ask your self: why the Formula One have sit only for one and your car can fit five ? or how much a movie cost ? or we all knows how to write but why not all of us have write and publishes a book ? My point is that the cost is actually get from the time you spend to make the project, and the users by him self can not be determine that.

Guess, Knowledge or Prediction ?

How do you make a guess about the number of servers, bandwidth, storage, etc... We actually do not guess, we have many sites, we collect every day many statistics automatically, many years experience, and we know from the content of the site, how many users can have per day and how many bandwidth can eat. We also have many databases that runs on our servers and we can see how many data they use. For 99% of our sites all that are low numbers. So this is knowledge and experience, with real live statistics. The prediction come by monitoring the traffic and the use of them, we try to make them better, to get more traffic, more users, and from what we archive we try to predict if they need more resource in the future. Also 99% of the sites are single pool running very simple presentations.

'* From the book

As a web developer I've been asked (a couple of times in my career) about the performance of sites that we've built. Sometimes you'll get semi-vague questions like "will the site continue perform well, even during product launch week?", "can the site handle a million users?", and even "how is the site doing?"

Of course, these questions are very legitimate, and I have always tried to answer these questions to the best of my ability, using a combination of

  • historic data (google analytics / IIS logs)
  • web load test tools
  • server performance counters
  • experience
  • gut feeling
  • common sense
  • a little help from our sysadmins
  • my personal understanding of the software architecture in question

I have usually been able to come up with reasonable answers to these questions. However, web app performance can be influenced by many things (database dependencies, caching strategies, concurrency issues, etcetera, user behaviour).

I'm a programmer and not a statician, and my approach to this problem has always felt deeply unscientific. So I did a little more research... and all of my google results seem to focus on tools and features and metrics (and MORE metrics) when I am really looking for a way to make sense of these things.

The question: What are some good resources (books?) to read on the best practices for a developer to read on the subject of web load testing, that will help me answer these types of questions?

First your question proves you do understand the problem. It can sometimes be tricky enough creating the tools, scripts etc. to generate the load but the real challenge lies in evaluating the results and what to monitor.

A very easy answer to your question could be to Generate load on a production-like environment that is similar to current or expected usage. If it runs ok without any crashes or slow performance that is usually good enough. After that, increase load to see where your limits are.

When you reach your limit my experience is that this is purely a project budget question. Will we invest more time/money/resources etc to evaluate the cause.

I work as a test professional and I do recommend respect load testing as a vital part of the development process but unfortunately that is not always in line of what management decides.

So the answer to your question is that almost everyone needs to be involved in this process: developers to monitor their code; system admins need to monitor CPU, memory usage etc.; DBA; networking guys; and so on. They all probably need their own source of knowledge to be able to get all this info recorded and analysed.

A few book tips:

The Art of Application Performance Testing: Help for Programmers and Quality Assurance http://www.amazon.com/exec/obidos/ASIN/0596520662/

The Art of Capacity Planning: Scaling Web Resources http://www.amazon.com/exec/obidos/ASIN/0596518579/

Performance Testing Guidance for Web Applications http://www.amazon.com/exec/obidos/ASIN/0735625700/

I can recommend two books published in 2010:

The first is "ASP.NET SITE PERFORMANCE SECRETS" by Matt Perdeck, was published in late fall 2010. It is written more from performance optimization standpoint, but also has detail material on load testing. It is a free pdf eBook.

The second book is ".NET Performance Testing and Optimization - The Complete Guide", by Paul Glavich, Chris Farrell". It is pretty complete source on performance / load testing

What are the best practices for database design and normalization for high traffic websites like stackoverflow?

Should one use a normalized database for record keeping or a normalized technique or a combination of both?

Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?

or

Should the main database be denormalized but with normalized views at the application level for fast database operations?

or some other approach?

Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.

That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.

The Art of Capacity Planning is pretty good.

I've been designing a site over the past couple days, and been doing some research into different aspects of scaling a site horizontally. If things go as planned, in a few months (years?) I know I'd need to worry about scaling the site up and out, since the resources it would end up consuming would be huge.

So, this got me to thinking, when is the best time to start thinking about, and designing for, scalability? If you start too early on, you could easily over complicate your design, and make it impossible to actually build. You could also get too caught up in the details, the architecture, whatever, and wind up getting nothing done. Also, if you do get it working, but the site never takes off, you may have wasted a good chunk of extra effort.

On the other hand, you could be saving yourself a ton of effort down the road. Designing it from the ground up to be big would make it much easier later on to let it grow big, with very little rewriting going on.

I know for what I'm working on, I've decided to make at least a few choices now on the side of scaling, but I'm not going to do a complete change of thinking to get it to scale completely. Notably, I've redesigned my database from a conventional relational design to one similar to what was suggested on the Reddit site linked below, and I'm going to give memcache a try.

So, the basic question, when is a good time to start thinking or worrying about scaling, and what are some good designs, tips, etc. for when doing so?

A couple of things I've been reading, for those who are interested:

http://www.codinghorror.com/blog/2009/06/scaling-up-vs-scaling-out-hidden-costs.html

http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html

http://developer.yahoo.com/performance/rules.html

Under a certain point of view, scaling techniques are quite accepted and consolidated. So instead on relying on web links/articles, I'd read books on the topic before starting the probject.

I suggest:

please can you suggest me a good book about writing scalable web-application/web-services (possible using Spring framework tho not mandatory)?

Thank you very much

Randomize

I want to understand when is my system under load (memory and CPU) when should I plan to scale.

Memory

I am using an ec2-instance. I have multiple processes running. They consume memory between 80-90% all the time. Should I worry or should I be happy that I am utilizing maximum of the available.

What should be memory consumption and under what circumstances I should worry about scaling?

CPU

I have another ec2-instance that runs some other processes. Most of the times the system cpu utilization is only 18-20% but at time for some of the processes it jumps to 90-100%.

Can anything might go wrong or is that only the processes might get slow due to non availability of cpu cycle and in some time they will get complete. Also any new process will wait for the availability of cpu cycles.

Can anything go wrong?

Basically I want to understand what is the scenario and what are the values when one should consider to scale up (vertically or horizontally)

In line answers or pointers to read, anything is appreciated.

First of all: you have to define the thresholds when to scale yourself. This mainly has to do with some factors that you have in your quality or stability guidelines and in your application. There is hardly any general rule for this. Here are some points to consider:

  • Some applications can run fine with 100% CPU usage (as long as there are no other jobs on this machine), and some applications might need to scale when using a 80% threshold for example. The same goes for memory.
  • Think about if you have some critical tasks that must be finished in a specific time. If so, you have to think about getting enough CPU and/or memory for them to do their job.
  • Observe and measure your system data throughout the whole time. I suggest to have a system like munin to show your performance data (and its changes) over time. Interesting points to measure are system load, cpu usage, memory consumption, i/o service time etc.
  • Try to get an idea what limits your application. For example if you have a lot of CPU-intensive tasks, CPU is your limit. If you have a lot of I/O to do, have an eye on the I/O stats, delay times etc.

To sum up: the need for scaling depends on your application. Get to know it better in terms of system resource usage. If you have a monitoring system set up, you can watch your system performance over time.

A good read is "The Art of Capacity Planning". Also if you google a bit about "capacity planning", you will find some more points.