We are always aiming for everything to be working all of the time and we are pretty confident about living up to it, and we have an SLA in place guaranteeing it. The reality of life is that sometimes unexpected bad things do happen though, also occasionally we might run scheduled maintenance to improve or upgrade our infrastructure.
Whatever the reason that something might be down, we and our clients need to know about it instantly (or in advance for scheduled stuff) and to be able to track it. The obvious way to do that is through system monitors that connect to each of our servers and sites every minute and will immediately alert us if something is down. They will also output the results on to a live status page.
Our first solution
There are a lot of third party services that will do this, like Uptime Robot or Hetrix. They are good at what they do, they will host and generate status pages for you or let you pull the info into your own page via an API. They are also cheap (or free) and easy to use.
So, we chose and used one, Uptime Robot. We created monitors for everything, set up alerts, created a page on this site and pulled all of the monitor info in via API, and all was good(ish).
We don’t want anything to just be goodish though and we had a few issues…
- Hosting everything sustainably is important to us and we don’t have any control over how Uptime Robot host their monitoring system.
- We want our key staff to receive notifications over a lot of different of different mediums, to be sure they are instantly aware wherever they are.
- Uptime Robot has had a lot of slowdowns and issues due to DDOS attacks over the last few months, and that has been negatively affecting our status page speed and monitor accuracy.
- Our status page being on the same server as anything else we host meant that if our homepage was down, so was the status page.
So, we decided to put together a new solution…
From today, we are using our new self-hosted monitoring system and status page and we are pretty happy about it.
We looked at a lot of different open source software solutions and narrowed things down to either Statping or Cachet. We set up test implementations of both, and had a very hard time choosing between them before finally deciding on Statping.
We now have an implementation on a new VPS in a sustainable data center in Amsterdam, which remotely monitors all of our services hosted in Helsinki, bombards us with multiple alerts the moment anything goes down. It also hosts our shiny new status page, which will be up even if our homepage is down.
Since we are now using an open source solution we also have far more control and can adapt the software to fit our needs as they change and grow. At the moment we are implementing a system for clients to opt-in to receive alerts automatically for any services that affect their sites.
But, what happens if the monitoring server goes down?
We’re glad that you asked. In that case, DNS failover will automatically switch to a duplicate install back in Helsinki. If both these unconnected data centres in different countries are down, then European internet has bigger problems!