Knowledge Base
Community

Tech Update - Operational Monitoring Processes

former giff-staffer
I wanted to provide a bit more technical information for those that had asked about monitoring at our Open Day as well as providing some details around how we monitor and capacity manage both our internal and externally provided systems.
 
giffgaff's systems are spanned across multiple service providers and as such, we use a multitude of different systems, both internal to giffgaff and external (by the suppliers themselves) to allow monitoring of our systems.
 
For our portal environment which consists of a typical 3 tier system, we monitor both physical and virtual servers for the metrics that show us metrics that determine load and utilisation.
 
For anyone that visiting our open day, we had a brief overview of our portal topography - for those that did not attend, this is spanned across front end, back end and database servers.
 
Our other segments of infrastructure includes our Quartet (billing system) which is managed by one of our 3rd party suppliers. They use a system called MARRS which is an internal developed monitoring system. Some elements of the Quartet system are outsourced to parties that support our current partner to ensure that this system remains stable and is monitored 24/7.
 
O2 provides our radio network and a few core elements which are shared such as the O2 provisioning system. These systems are monitored by O2 24/7 with their internal systems - any faults are reacted upon based on a fault / priority classification and then restored accordingly. There are elements with O2 that serve giffgaff and are unique however generally if a system is impacted, it will impact O2 and giffgaff.
 
We monitor our Portal infrastructure with Nagios which is commonly used by 'Sysadmins' and use Centreon as a plugin to manage the way the mass of data / information is presented.
 
Typical measurements are CPU load, idle time, disk I/O count, session counts, Ram utilisation, swap utilisation and more - all very standard measures.
 
With our databases we will have more metrics available to us to analyse not only load but also looking at performance using AWR reports and live reporting on 'top 5' queries as an example which lets us see how quickly / efficiently the databases are performing.
 
All of this information is then use for day to day management of servers as well as the longer term requirements of capacity management where typically we try and work 12-18 in advance to allow enough time for any major upgrades. Smaller upgrades are a lot easier as 80% of the infrastructure is virtual which allows for easy upgrade paths, assuming enough capacity in the core support components.
 
Our hosting provider themselves will also monitor many of the same areas we are monitoring which gives is the 24/7 support required. Any values that go out of expected value will raise a ticket, this ticket will then be followed up. Based on a support metric, this may or may not require a call to one of my on call engineers.
 
Additionally our hosting provider will have the ability to monitor components in their network which are required for giffgaff's service however giffgaff do not have the ability (or need) to monitor these. Typically these will be components such as switches, routers, firewalls, load balancers as well as the all important SAN (Storage Area Network) which our data is stored on. Some of these devices are unique to giffgaff and are not contended, some however are although these typically are resilient and needs no input from giffgaff to be managed.
 
In terms of reporting, Technology Operations report into the giffgaff team, on a weekly basis, on the following statistics. These are certainly not exhaustive however give a very good overview towards the general health of our systems:-
 
Service & Target
 
Overall website availability (giffgaff.com) - 99.9%
Topup availability - 99.9%
Activation availability - 99.9%
Ability to purchase and/or recur Goodybag - 99.9%
Port In availability - 99.9%
Voice, 2G, 3G, SMS availability - 99.97%
17 Comments
head honcho
I feel this could have been made easier to understand with a few diagrams of some of what you are saying.

Hmm...sounds like there's a lot more going on behind the scenes keeping the giffgaff network running than anyone would have guessed! Glad to see that there's plenty of techies keeping giffgaff running...

pathfinder

Who monitors the monitors though

 

Great post

 

phenomenon

Thanks for this update, interesting to see the inner workings of giffgaff, and that you are very transparent about this to the members. Smiley Happy

newcomer

I don't know about all of you people out there, but I don't really care how it works, as long as it runs.  Lately, Giffgaff has dissapointed me in their service.  My friends and I have been experiencing a lot of issues with the service, calls  and texts don't come and internet is constantly failing.  You can talk all you want about what is keeping Giff gaff going, but I suggest you stop talking and actually do something to improve the service.  

oracle
Thanks! I really hope the voucher topup is well within this scope as improvement in that field and payment systems in general on giffgaff are desperately needed by members.
thanks foe the tech update
for*
oracle

@clintinger

Wow!! Smiley Surprised

 

I am quite technically minded and like to think I have a decent grasp on the way things work, but, after attempting to absorb the first couple of paragraphs I've given up Smiley Frustrated

 

I can appreciate your aim is to make people understand how things work, but this blog is a very long way from that Smiley Surprised.

It becomes "text-book" and "gobble-de-**bleep** very quickly even for those technically minded among us -

 

result .......... yaaawwwnnn Smiley Sad Sorry for the attitude, but I do mean well Smiley Embarassed

 

ps - the swear filter kinda shocked me there Smiley Surprised ... maybe I should have said "gobbledegook" instead Smiley Very Happy

pupil

Thanks for the update but a little over my head too.