September 9th 2015 - Drupal sites outage

What was the reason?

The database (DB) servers hosted by Acquia were experiencing issues on September 8th around 4 PM because of large number of INSERTS created on Drupal 6 sites. This was causing a lag in the DB master to secondary replication process resulting in slowness and in some cases outages of sites. We disabled DB logging on Drupal 6 sites to address this issue. The DB server was beginning to recover. However at that time, it started getting high volume of requests for certain site. The requests have been building up since evening eventually resulting in outages on September 9th morning. This site was following a content feeder model where a Services View was acting as the central repository of content. There were multiple sites that were reading this content and displaying the same on the fly. Below is our analysis.

  • Service Views that provides the data feed from the feeder site were not cached
    • Each request was trying to call the DB
    • We suspect the DB server was slow to respond because of the issues with the replication process causing the request processes to go into a wait mode
  • Views XML Backend module, that was used on the requesting websites, does not seem to have a request timeout component and were not cached
    • This meant that all the previous requests were waiting infinitely
    • Meanwhile, new requests were being added to the PHP processes queue
    • DB server was not keeping up with the requests resulting in exponential growth of PHP processes in queue

Why did all websites get affected?

Drupal sites at SF State are setup in such a way that all DBs are on the same DB server. As discussed in community of practices meetings, there are pros and cons for this approach. One of the cons is that with 175+ DBs being present in one server, an issue on one DB can impact the DB server performance resulting in issues for all websites that are dependent on this server.

How was it resolved?

There were several things done to stabilize the environment. Few items were specific to certain websites and we have sent details to those specific site owners. In general, the below changes were made.

On all sites:

We enabled caching on all Drupal 7 sites so that all requests do not hit the DB server. The cache expiration is set to 1 hour by default. This might mean that some of the content addition and updates might not be available for public for up to 1 hour. This change, understandably, will cause inconvenience to content authors. We will use this configuration for a couple of weeks as we figure out the cache expiration time that gives us the best balance between keeping the websites running versus having better user experience with fresher content.

We also disabled a few cron jobs that are either no longer needed or can be avoided.

On the content feeder site:

Added 1 hour caching on all the Service Views. Also added 1 hour caching on the requesting Views XML Backend views.

What steps are being taken to prevent similar issues in future?

We are having an internal discussion on long-term resolution. We are proposing multiple ways to compartmentalize resource intensive sites such that the issues do not propagate to other sites on campus. Once we have an official proposal, we will approach groups on campus with our proposals.