Performance | Auto-Magical

The following warning applies to anytime you try to fix a misbehaving system without understanding the cause of the problem, but especially relevant when trying to fix performance issues without knowing the cause:

The trouble started when the site started randomly slowing to a crawl at random times. The tech team met to discuss the issue. Having failed to extract the cause by the act of stuffing enough smart people in a room, the topic shifted to solutions.

“Let’s switch our caching from memcached to redis” I said. The testing went well, and the change was made. The following testing, accompanied by a dose optimistic thinking, let to the conclusion that the issue was improved.
Everyone was happy, until it was discovered that the registration system was broken, because in one specific function, PHP failed to set the redis cache, causing a redirect loop. We fixed the problem, but the performance issue returned.Following this, another dozen configuration and code changes were tried. Since we could not consistently reproduce the performance issue, it was questionable whether any of these changes helped. The only thing which became clear was that our site was becoming increasingly unstable, and we had little experience dealing with all the new components. In desperation, we decided to start over with a new server build.

The testing of the new server went OK, until I decided to throw another new wrench in at the last second – switching from MySQL to AuroraDB. “AuroraDB is 5X faster and 100% wire compatible with MySQL” according to Amazon, but it turns out that the PerconaDB client library on the server was not, the AuroraDB default parameter group is not configured properly for high query rates, and WordPress+mysqli PHP library+AuroraDB don’t play well together.

So now, we had all our existing problems, plus the issue of configuring a new server a new set of management tools, plus the issues of switching to a new database server. Eventually, we solved all the problems we created by either learning to use the new components or reverting to old ones, but we never did figure out the cause of the performance issue, and simply patched it over with more hardware.

What’s the moral of this story?

If a website is suddenly slow, unreliable generally misbehaving for performance-related reasons,

DO NOT TRY TO MAKE PERFORMANCE IMPROVEMENTS WITHOUT UNDERSTANDING THE ROOT CAUSE

Any performance-related change should be tested to see if it makes things better. This is impossible without a stable site.
Performance-related configuration and code changes should be based on evidence – quantitative proof that the specified change will help.
Making changes based on hunches and Internet guides is a potentially endless process as software like MySQL, Varnish, Nginx, etc offer hundreds of parameters with millions of opinions online about what is best.
The approach of making optimizations in the dark is a huge time drain when a quick and short-term solution is needed. You will make many changes with unknown effectiveness, possible falling into the dreaded Performance Panic Spiral of Doom:
1. Try to fix a problem with guesswork without understanding the cause
2. Break an unrelated component in the process
3. Try a more drastic fix to fix both issues
4. Repeat, until the site is a disaster zone

HOW TO ACTUALLY RESOLVE PERFORMANCE OUTAGES: EVIDENCE-BASED ROOT CAUSE ANALYSIS

Enhance your environmental awareness (by improving your monitoring & diagnostic tools*) until you can visualize, isolate, and identify the problem.
Fix the problem.

* For example, using New Relic, error logs, htop, ntop, xdebug, etc.

I want to give a brief tutorial on understanding CPU processor utilization since it is commonly an area of confusion during performance analysis.

CPU time

CPU time is the percentage of clock ticks than a processor spends waiting for instructions. This is opposed to wall-clock time, which is the total time the whole computer takes to perform an operation. The “load” on a system is the ratio of clock ticks which are performing operations versus the click ticks spent waiting for instructions over a given time period. Thus, load only makes sense as an average over a particular time period. During any single clock tick, the CPU is either processing an instruction or a HLT.

Idle time

The CPU processes a set of instructions fed from memory. The memory contains a set of opcodes (commands) which tell the CPU which operation to perform and the memory where it is located. The rate of clock ticks is constant – several gigahertz in a modern CPU – and during each clock tick, the CPU processes either an instruction to do some work, or a HLT – an opcode which tells the CPU to turn keep components idle until the next cycle.

Monitoring CPU time

So if we were to “observe” a CPU at the clock-tick level of detail, we would see that it’s always either working or waiting. But we can’t actually do that, since the more closely we monitor clock cycles, the more clock cycles are needed to do the monitoring. So to get an accurate picture of activity, we have to step back to a level where the monitoring tool does not interfere too much with the process being observed.

Input/output overhead

Any given computer task has several components: CPU instructions, memory IO, disk IO, network IO, and many others. Only extremely simple tasks (like calculating π) can fit in the small (but very fast) memory buffers on the CPU itself. Real-world tasks will almost always require the CPU to wait for other components to finish shuffling data back and forth in a state where it is ready for the CPU to work on it. But the ideal scenario is for the CPU to be kept as busy as possible (constant 100% load) until the task is completed. This is the minimum possible time in which a task can be completed. If we were to monitor CPU activity during such a task, we’d see the load jump to 100% during the task than back to the baseline when it’s done. But if the task is shorter than the measurement frequency of the CPU monitoring tool, it would be an average over two periods, or it might not be detected at all.

Multitasking

The situation is more complicated when there are multiple tasks to run, each off which requires some fraction of CPU time. The operating system will run the tasks concurrently. Both the processing time and the IO of tasks will overlap, but because the operating system takes turns running each task, the total CPU load will be some combination which may be higher or lower than the total of the individual tasks – depending on the other competing resources the tasks use. For example, two disk-intensive tasks will use less CPU than the sum of both running individually because the disk IO will be the bottleneck. But two CPU-intensive tasks would more than double total load because the CPU will have to run both tasks and have to handle the context switching between concurrent tasks.

Further reading

http://en.wikipedia.org/wiki/CPU_time
http://en.wikipedia.org/wiki/Load_(computing)
http://en.wikipedia.org/wiki/No_op
http://en.wikipedia.org/wiki/Idle_task
http://en.wikipedia.org/wiki/Computer_performance

n. 1: automatic, but with an element of magic. 2: too complex to understand and/or explain

Auto-Magical

Tag Archives: performance

Avoiding the Performance Panic Spiral of Doom

Understanding processing time

n. 1: automatic, but with an element of magic. 2: too complex to understand and/or explain