UK Server Hypervisor Upgrade
The UK server will be down for a short period of time in the small hours of tomorrow (Monday) morning so that the underlying ESXi hypervisor can be upgraded to the latest version.
All the other servers on the network will operate as normal.
Update @ 03:00: A very smooth, successful update
UK Server Downtime – RapidSwitch Network Issue
A network connectivity problem occurred earlier, which affected all (or at least most) of the datacenter in which the UK server resides.
RapidSwitch have not yet provided an RFO (Reason For Outage) report, but they did say this:
At approximately 00:20 a network issue was detected affecting connectivity to the North cluster in RapidSwitch Spectrum House.
Our on call engineers were contacted and the problem has been resolved. Service was restored at approximately 01:35. Unfortunately we do not currently have any further details available at this time but will issue them as and when they become available.
An update from RapidSwitch [emphasis mine]:
At approximately 00:20 a network issue was detected affecting connectivity to the North cluster in RapidSwitch Spectrum House. The cause of this appears to have been a very large amount of malicious traffic directed at the RapidSwitch network. This traffic has unfortunately caused a routing issue within the RSH.North cluster.
Our on call engineers were contacted and the traffic has been removed from the network. Normal service was restored at approximately 01:35. This problem will have affected different clients in different ways. It will have ranged from no effect, to an increase in latency and some packet loss, to a more substantial loss of connectivity.
UK Server Crash
The VMware ESXi bare-metal OS that runs the UK server VM crashed at around 13:00 GMT today.
After establishing that it was the server that had crashed, and not RapidSwitch's network, the server was power cycled and came back up as expected
UK Server Downtime – RapidSwitch Router Problems
Due to RapidSwitch encountering unforeseen problems with their router software upgrade last night, there have been large bouts of downtime over the last 24 hours when the UK server has been unreachable.
We have set up a secondary/backup DNS service which will store a copy of the main DNS server's records, so that you are still able to resolve the US and DE server's IP addresses during any future downtime.
In addition, we have removed the UK server from the DNS round-robin for irc.J3di.org, and are also redirecting the records for the UK server to point at the US server until this crisis has abated.
For more frequent (and always available) updates, you might like to follow @J3di_IRC, @Fr3d_org or even @RapidSwitch on Twitter.
We'd also like to pass on RapidSwitch's profuse apologies for the disruption caused by this downtime.
Emergency Maintenance @ RS – Will Affect UK Server
We have just been notified by RapidSwitch, whose datacenter Fr3d uses to host the UK server, that it is likely that there will be a loss of connectivity on their network this evening at around 23:00 UK time.
This is due to RapidSwitch having to perform emergency network maintenance:
The maintenance is to perform an emergency upgrade of Cisco software. [...] The Cisco TAC team have diagnosed a fault with the software on the router in the form of a memory leak. Cisco has supplied us with a new version of the software for the router which will fix the memory leak and slow performance.
- RapidSwitch Support
They say that this maintenance should take no longer than 45 minutes, and that they will do all they can to speed everything up and reduce the maintenance time.
To maintain network cohesion (as much as possible), the DE server will automatically failover its link to the US server in the event it loses its connection, and cannot reconnect, to the UK server.
Update @ 01:50 25/09: RapidSwitch seem to be back online now. No word from them yet as to why the maintenance took so long; we will update this post when we know more.
