Friday, last week, 15th Feb, two of the services I most depend on, failed. Now as it turned out, neither really concerned me at the time, as that same day my brother was taken seriously ill (he’s now doing fine and on the way to recovery). It’s only now I’ve had the time to think about the implications of these failures.
The first was my fixed wireless broadband provider, OmniTel (aka Callidus,aka Torque, aka IFA Telecom Wireless Broadband). Its signal was down yet again (4th time since Xmas, two of those for close on 7 days each!). Large areas of rural Ireland depend on providers such as Omnitel to supply them with what is now a basic service and I think many would agree that the end-user experience is, to put it as charitable as possible, sub-optimal. I’ll leave it to others to explain why we’re in such a mess, but a mess, it is.
For several reasons I’m not that overly concerned about this as the area I live in, now has at least one alternative wireless provider (Irish Broadband – and I see two three four of my neighbours have changed over to them since last week!) and I’m also within 3KM of an Eircom exchange, which means I have my trusty ISDN backup and will eventually (we’re on the “list”) have access to ADSL. Now, ISDN is not a suitable alternative if your house is full of iTune/YouTube obsessed young adults or if you need to constantly download large amounts of data (e.g. 10MB plus) but for “normal business stuff” it’s fine, I could live with it.
But, isn’t a datamith’s stock and trade large datasets? Well, yes and no. Many micro tasks such as data analysis tend to be carried out using Excel, which by its nature means you’re dealing with relatively small datasets or sub-sets of large databases, neither require significant bandwidth to load/upload. For larger datasets and more powerful ETL/analysis tasks I don’t depend on my local machines, I use Amazon EC2/S3. In fact, most of my business and personal computing infrastructure is now “cloud” based with my laptop reduced to the task of local cache/processor/communication’s device, similar to the role of my mobile phone, just a bigger keyboard and screen!
Which brings me neatly to the other failure of Friday the 15th, Amazon’s cloud services, EC2,S3,SQS and SimpleDB. As it turned out, it wasn’t the services themselves that failed rather the AWS authentication infrastructure was subjected to what could be described as a “friendly/unintentional” DoS attack. Existing publicly accessible S3/SimpleDB resources were still accessible and EC2 instances continued to operate, but anything requiring authentication failed. It reminds me a bit of the early days of RAID storage systems, the “miracle” of stripping and mirroring worked but failures still happened due to faulty power supplies or controller sub-systems.
The major complaint first-timers have when coming to terms with EC2 is the lack of post-shutdown/failure persistence on the virtual machine’s disks, data must be backed up to S3, otherwise it’s gone in the event of an instance failure. I’m guessing that the “oddness” of this architecture is to do with its suitability for the purposes that Amazon originally designed it for, and having proved it in their day-to-day business over the last decade or so, they’re sticking with it. Which is good, those of us who are now becoming dependant on this architecture want a robust and proven service. I suspect the authentication service is a new layer on the existing internal Amazon stack and is only now being stress-tested.
So it failed, and was fixed relatively quickly, but what’s more important, Amazon acknowledged the problems (not just the reason for the failure itself, but the less that perfect way their users were kept informed during the outage) and I’m reasonably confident they’ve learned from their mistakes. (To return to my rant on my broadband provider; I think the most annoying thing when the service goes down, is that the whole of Omnitel, help-line, accounts, even sales refuse to answer the phone (no forum, no status page) leaving their customers to wonder have they gone out of business or are they all hiding under their desks with their fingers in their ears shouting “Go away, go away”).
As a side note, two other services I use had hiccups this week, WordPress.Com was down for several hours on Wednesday (as a result of a DoS attack, I believe) and on Friday my Hamachi VPN service was down for a hour or so due to server resource problems.
So am I less confident in the viability of the “cloud” after this week of outages? No, I’m a believer in “risk management” rather than “risk avoidance”, as long as I’ve a “good enough” alternative (ISDN for broadband, standard Linux hosts for EC2) or a high degree of confidence in the supplier (Amazon S3 for backup and secure storage) I’m sticking with it. Not only that, I’m betting my career on it.
Update: Monday 25th
A bit windy today. You guessed it,broadband down again! So make that 5 times since Xmas. I see in their terms and conditions Callidus (OmniTel’s legal entity) promise 99% uptime within any month, that’s a little over 7 hours of acceptable outages per month, if only! On the plus side, I was talking to one of my neighbours who’d recently changed over to Irish Broadband, her experience with her new supplier where very positive. “Professionals. know what they’re doing, excellent customer service”, is how she described them.
Update: Wednesday. March 12th
Windy again last night; yep! gone again. Well I assuming it’s the wind, no reply at any of Omnitel’s numbers. Maybe they’re gone out of business!
Update: Weekend 28th-30th March
Omnitel down again Friday night (28th), my son says it was back at some stage during the weekend, but when I went to use it tonight (Sunday 30th) still not working. Left a text message on 087 2826671 their out-of-hours number (twice), but to no avail.
And this crowd were ..
… recently shortlisted in the Government’s National Broadband Scheme to provide broadband to the remaining areas currently unserved by broadband in the Republic of Ireland.
… and if they win those areas will continue to be “unserved” !
Update: Monday 31st March
Service back up and running at 12 noon! More amazing, when I rang the help line this morning, there was a message acknowledging the problem (could it be true, Omnitel have started to invest in customer relations!). Mind you, should I let them in on that other secret of modern customer service, the “status blog”?
Simply set-up a blog, e.g. http://omnitel.wordpress.com, and post network problem and resolution details, along side “good news” stories (e.g. network upgrades) and maybe even allow customer comments!
I know too much to hope for.
Update: Sat 12th April
Down again since 6PMish, actually this is the 3rd weekend in a row, but the last two were “just” Sunday night/Monday morning outages (or extreme slowness as per last Sunday PM /Monday AM) so I didn’t report them.
Update: Monday 28th April 19:00
Keeping with the now well established tradition of a weekend failure, Omnitel network down since Saturday 15:00ish, seems to be a major outage, still no sign of a return to “normal service”. Time to phone Irish Broadband I think, Lo-Call 1890 56 44 56.
UPDATE: September 2008
No major outages in the last 5 months, and when they do happen, they’re fixed quickly and Omnitel are also now much better at keeping customers informed. So praise where praise is due, well done; a huge improvement.