As Mike mentioned this morning in his blog, we want to share with you some further technical details of the outage that occurred yesterday.
Just after 4pm the volume of sigtran traffic between some of the core nodes on our network (the HLR/MSCs and charging nodes) spiked to unexpected levels, causing an outage for all calls and the majority of texts. Alarms were raised across the network nodes and routers, which we investigated, as well as checking the processors that failed on Friday.
By diverting some of the signaling and taking some services off line the volume of signaling traffic dropped to levels that meant we could restore service, shortly after 8pm. Sadly with the layers of complexity, it took us much longer than we would have wanted.
Since implementing the fix, our suppliers and technical teams continued to work through the night investigating the reasons behind this issue. As part of a planned support / operational change on Tuesday, we introduced a new MSC/HLR pair. The post implementation testing and checks all passed and it wasn’t until we hit the peak traffic time around 4pm that the signaling volumes resulted in the failure of the routes between the MSC/HLRs.
We backed out this change in the early hours of Wednesday morning and are now evaluating the best way to implement this in future. It is likely that we will have to redesign and retest our approach and we will do so before re-attempting this necessary work.
We are confident that the issues on Friday and yesterday, whilst being in the same area of the network, are not connected. We cannot however be complacent and so have already implemented extra monitoring and plans in case we start to see similar issues.
In addition to the problem that happened yesterday, some of you may have been impacted by an issue with the wider HLR network today, which has now been resolved for the majority of giffgaff members.
We are committed to providing a great service and understand that this has not been the case this week.