Melodeon.net Forums

Please login or register.

Login with username, password and session length
Advanced search  

News:

Welcome to the new melodeon.net forum

Pages: [1]   Go Down

Author Topic: Unplanned downtime: melnet offline 24th October  (Read 1298 times)

0 Members and 1 Guest are viewing this topic.

Theo

  • Administrator
  • Hero Member
  • *****
  • Offline Offline
  • Posts: 11462
  • Hohner Club Too
    • The Box Place
Unplanned downtime: melnet offline 24th October
« on: October 25, 2014, 10:07:06 AM »

Apologies to everyone for melnet being unavailable for much of the day yesterday.  This was a result of hardware failure in the datacentre where our server is located.  Here is a report from our web hosts with summary of what happened:

Quote
Hi All,

Today has not been a fun day for most of us so I’ll quickly explain what happened and what went wrong.

At 11:30 this morning our new cloud setup lost all power.  It was not accessible to the network and would not respond to any commands from the control servers.  We contacted the datacentre and asked them, specifically, to check if our cloud setup was getting enough power and it hadn’t been capped or had tripped anything that would have knocked it offline.  They said that the power to our cloud setup was 16 amps and was fine and as all our other servers in the rack were running perfectly, which they were, the fault must be with the main cloud server.

We contacted our engineer and asked him to go physically look at the servers.  While he was in transit we contacted Dell and explained the issue we had to them, they suggested that based on our description the only feasible explanation was a backplane failure on the main chassis, this controls power to the blades in the chassis and according to Dell has a failure rate of one in ten million and hence why it’s not replicated in the chassis.  Our engineer took this advice and spent the next 4 hours dismantling the main cloud server and changing every part he could replace, including the power supply’s the backplane and even the chassis itself, attempting to reboot the server after each new part was installed.  After a while our engineer started to notice a cracking sound from the cabinet next to ours every time he attempted to power up our cloud servers.  After hearing this he took our cloud server over to another one of our racks, installed it in there and it worked first time.  The problem was not with the server it was with the rack.  Our engineers contacted a tech from the datacenter and pointed out what was happening to which they quickly diagnosed a faulty PDU powering our cabinet and the cabinet next to ours.  Replacing the PDU immediately fixed the issue.

We shall be meeting with the datacenter next Friday to discuss why our initial query about the power to our cloud was not fully investigated, I can only assume they checked the supply of power to the PDU which was fine and not from the PDU to the servers which was broken.  Had this been physically checked at the time this issue would have been fixed around 1pm instead it wasted an additional 5 hours.

Despite this being out of our control I’d like to apologise for the inconvenience caused today, it’s not good for any of us when things like this happen.  We do our best to prevent problems and constantly keep our equipment as up to date as possible so its extra frustrating when the problem is caused by failed hardware at the DC, we ultimately chose them as our DC so we do have to take a bit of blame.   On a positive note our disaster recovery plan went well so we do at least know that it’s reliable if it’s ever needed again. (hope not!).

A copy of this email is available on our forums link below, password ‘caratumba’ where you can comment and discuss

https://ariotek.co.uk/forums/f9/fridays-downtime-8141/#post54996

Logged
Theo Gibb - Gateshead UK

Proprietor of The Box Place for melodeon and concertina sales and service.
Follow me on Twitter and Facebook for stock updates.
Pages: [1]   Go Up
 


Melodeon.net - (c) Theo Gibb; Clive Williams 2010. The access and use of this website and forum featuring these terms and conditions constitutes your acceptance of these terms and conditions.
SimplePortal 2.3.5 © 2008-2012, SimplePortal