This week started with a failed release.
Something we held for about three weeks was deployed, causing de-serialization issues coming back from memcached. We rolled back to the previous release and regrouped the next day to examine causes.
I spent the next morning learning about memcached, but I couldn't put a finger on the root cause. We knew its configuration was wrong. We knew the corrupt objects crashed during de-serialization and the software couldn't remove them.
I had a hunch that the servers weren't communicating with all three caches. A lsof run showed that each host made 5 connections, and they weren't connecting to all 3 caches. By post-mortem we had a partial plan.
We were going to add more connections from each host and dump the cache during the release.
All three servers were balanced to fail-over if one of the servers fell over. This lead to hosts reading and writing to the wrong caches. In a very extreme case, they could potentially write the correct key to the a host that was surreptitiously load-balanced out!
One of the ops guys suggested dropping the configuration changes bar one: using a single server, since it could easily afford the load and memory. We decided at last minute to give it a try.
The result was an end to the strange errors and the memcached log going eerily silent. Success. Some people didn't believe the lack of errors. This fixed a number of the transitional errors, and fixed the biggest one: now video playback works consistently.
I am glad that this problem is behind us. I just wish we hadn't stumbled into the solution. It's humbling to think that chance can play such a huge role in maintaining a system.
It's also a bit unnerving, and my teammates are concerned that we understand less after this incident. We'll keep seeking though, because that's how we do.