You might want to go and grab yourself a big cup of tea - this is going to be a big post with lots of technical detail and might take a while to digest. I’ll start with some background information, and then I’ll dive into the specifics of the 2 hour outage we had on Monday, 2nd September.
On Tuesday 27th August we migrated from our old community site of nearly 10 years to a brand new community site. The migration itself went well, despite taking longer than initially expected, however we experienced some issues in the 2-3 hours immediately following throwing the switch to make the new community available to everyone. Thankfully, we were able to resolve these quickly and the community has been very stable since then, with a couple of notable exceptions.
Our expectation had been that code deployments would be effectively invisible to members, seamlessly being rolled out. We can accomplish this as the site is hosted in the cloud on multiple ‘nodes’ (servers, pretty much) using multiple Kubernetes Pods (or Docker Containers, if you’ve heard of them) that can be removed and created as necessary. During a code deployment, each pod should be sequentially destroyed and recreated. So while one node is unavailable and being updated, the other pods keep the community running, until within a few minutes all of the nodes have been updated.
Having seen that the seamless deployments weren’t working correctly, over the following days we worked on investigating the root cause of the problems and then putting in place changes designed to prevent any repetition of the issue. While we worked on that, we paused any further code releases to the live site while retaining the ability to deploy code to our staging environment. This was important, as it allowed us to continue working on both a fix for the issue we’d seen as well as other updates for the community in general.
Monday 2nd September – 2 hours of problems
On the morning of Monday, 2nd September, we became aware of a critical defect on the community that needed to be addressed urgently. The fix was simple, and we were also confident that we had code ready that would address the problem from the previous week. So, we were keen to get the code released and see how it performed.
Unfortunately, instead of a smooth and seamless release, we suffered 2 hours of the community site being largely unusable. So what happened?
At its simplest, it was the same problem we had on the 28th – except our first attempt to fix it made the problem (replication latency in the Network File System) worse. This time the race condition did not resolve itself, but it had the positive side effect of neatly highlighting exactly what we needed to do to fix it forever.
Tuesday 10th September - Permanent fix implemented
As a result of the fixes put in place, we have now been able to make a number of code releases with no discernible interruption to the functionality of the community platform. To prove this, we have made over 15 deployments since the fix was introduced on Tuesday morning - and no-one ran into any trouble. This means that we will now be able to continue with our plans to iteratively develop the platform based on your feedback. We’ll share more on those plans next week.
If you have any questions about any of the above, please do raise them here. Either I, or one of the team, will do our best to answer them.