Avoiding the Performance Panic Spiral of Doom

The following warning applies to anytime you try to fix a misbehaving system without understanding the cause of the problem, but especially relevant when trying to fix performance issues without knowing the cause:

The trouble started when the site started randomly slowing to a crawl at random times. The tech team met to discuss the issue.  Having failed to extract the cause by the act of stuffing enough smart people in a room, the topic shifted to solutions.

“Let’s switch our caching from memcached to redis” I said. The testing went well, and the change was made. The following testing, accompanied by a dose optimistic thinking, let to the conclusion that the issue was improved.
Everyone was happy, until it was discovered that the registration system was broken, because in one specific function, PHP failed to set the redis cache, causing a redirect loop. We fixed the problem, but the performance issue returned.Following this, another dozen configuration and code changes were tried. Since we could not consistently reproduce the performance issue, it was questionable whether any of these changes helped. The only thing which became clear was that our site was becoming increasingly unstable, and we had little experience dealing with all the new components. In desperation, we decided to start over with a new server build.

The testing of the new server went OK, until I decided to throw another new wrench in at the last second – switching from MySQL to AuroraDB. “AuroraDB is 5X faster and 100% wire compatible with MySQL” according to Amazon, but it turns out that the PerconaDB client library on the server was not, the AuroraDB default parameter group is not configured properly for high query rates, and WordPress+mysqli PHP library+AuroraDB don’t play well together.

So now, we had all our existing problems, plus the issue of configuring a new server a new set of management tools, plus the issues of switching to a new database server. Eventually, we solved all the problems we created by either learning to use the new components or reverting to old ones, but we never did figure out the cause of the performance issue, and simply patched it over with more hardware.

What’s the moral of this story?

If a website is suddenly slow, unreliable generally misbehaving for performance-related reasons, 
DO NOT TRY TO MAKE PERFORMANCE IMPROVEMENTS WITHOUT UNDERSTANDING THE ROOT CAUSE
  •  Any performance-related change should be tested to see if it makes things better.  This is impossible without a stable site.
  • Performance-related configuration and code changes should be based on evidence – quantitative proof that the specified change will help.
  • Making changes based on hunches and Internet guides is a potentially endless process as software like MySQL, Varnish, Nginx, etc offer hundreds of parameters with millions of opinions online about what is best.
  • The approach of making optimizations in the dark is a huge time drain when a quick and short-term solution is needed.  You will make many changes with unknown effectiveness, possible falling into the dreaded Performance Panic Spiral of Doom:
    1. Try to fix a problem with guesswork without understanding the cause
    2. Break an unrelated component in the process
    3. Try a more drastic fix to fix both issues
    4. Repeat, until the site is a disaster zone
HOW TO ACTUALLY RESOLVE PERFORMANCE OUTAGES: EVIDENCE-BASED ROOT CAUSE ANALYSIS
  1. Enhance your environmental awareness (by improving your monitoring & diagnostic tools*) until you can visualize, isolate, and identify the problem.
  2. Fix the problem.

* For example, using New Relic, error logs, htop, ntop, xdebug, etc.

Leave a Reply