I wanted to provide a bit more technical information for those that had asked about monitoring at our Open Day as well as providing some details around how we monitor and capacity manage both our internal and externally provided systems.
giffgaff's systems are spanned across multiple service providers and as such, we use a multitude of different systems, both internal to giffgaff and external (by the suppliers themselves) to allow monitoring of our systems.
For our portal environment which consists of a typical 3 tier system, we monitor both physical and virtual servers for the metrics that show us metrics that determine load and utilisation.
For anyone that visiting our open day, we had a brief overview of our portal topography - for those that did not attend, this is spanned across front end, back end and database servers.
Our other segments of infrastructure includes our Quartet (billing system) which is managed by one of our 3rd party suppliers. They use a system called MARRS which is an internal developed monitoring system. Some elements of the Quartet system are outsourced to parties that support our current partner to ensure that this system remains stable and is monitored 24/7.
O2 provides our radio network and a few core elements which are shared such as the O2 provisioning system. These systems are monitored by O2 24/7 with their internal systems - any faults are reacted upon based on a fault / priority classification and then restored accordingly. There are elements with O2 that serve giffgaff and are unique however generally if a system is impacted, it will impact O2 and giffgaff.
We monitor our Portal infrastructure with Nagios which is commonly used by 'Sysadmins' and use Centreon as a plugin to manage the way the mass of data / information is presented.
Typical measurements are CPU load, idle time, disk I/O count, session counts, Ram utilisation, swap utilisation and more - all very standard measures.
With our databases we will have more metrics available to us to analyse not only load but also looking at performance using AWR reports and live reporting on 'top 5' queries as an example which lets us see how quickly / efficiently the databases are performing.
All of this information is then use for day to day management of servers as well as the longer term requirements of capacity management where typically we try and work 12-18 in advance to allow enough time for any major upgrades. Smaller upgrades are a lot easier as 80% of the infrastructure is virtual which allows for easy upgrade paths, assuming enough capacity in the core support components.
Our hosting provider themselves will also monitor many of the same areas we are monitoring which gives is the 24/7 support required. Any values that go out of expected value will raise a ticket, this ticket will then be followed up. Based on a support metric, this may or may not require a call to one of my on call engineers.
Additionally our hosting provider will have the ability to monitor components in their network which are required for giffgaff's service however giffgaff do not have the ability (or need) to monitor these. Typically these will be components such as switches, routers, firewalls, load balancers as well as the all important SAN (Storage Area Network) which our data is stored on. Some of these devices are unique to giffgaff and are not contended, some however are although these typically are resilient and needs no input from giffgaff to be managed.
In terms of reporting, Technology Operations report into the giffgaff team, on a weekly basis, on the following statistics. These are certainly not exhaustive however give a very good overview towards the general health of our systems:-