Dropbox outage shows the fragility of the cloud

Dropbox - 500

Cloud storage and file-hosting service Dropbox suffered a severe service outage this week that began on Friday night and, as I write these words on Saturday evening UK time, is still continuing.

If you go to the Dropbox website, you’ll probably get only a screen with the headline ‘Error (500).’ That signifies some kind of server error.

I say “probably” because on Twitter, Dropbox said the problem is resolved and the site is back up. I saw people tweeting that this is so. Yet I see others saying it’s not and certainly, in my case, it definitely is not back. And note that Dropbox’s tweet was over 13 hours ago.

I look at the ‘Connecting…’ dialog box in the status area on my Windows desktop PC. It’s been saying that since I turned on this PC earlier this morning, some ten hours ago.

dropboxisstarting

In that dialog box, I should see reference to recent activity in my Dropbox account – files and images uploaded or downloaded, deleted, etc. But as you can see, nothing.

As a long-time Dropbox user with a paid account, this concerns me. That said, it’s the weekend so at the moment, I’m not overly alarmed about that lack of being able to synchronize the files I store on Dropbox’s servers and which sync with the various devices I’ve enabled to do that.

Equally, I’m not too worried at comments I’ve seen around the social web on Twitter and elsewhere that Dropbox has been hacked. In a note on the last status update on its tech blog, Dropbox says its not so.

Yet this has some strong ingredients to suddenly surge into a major crisis of confidence if that kind of talk escalates – such as into mainstream media like this report less than an hour ago in USA Today, one of the big-circulation US papers:

Popular backup service Dropbox went down late Friday – and a hacker group is claiming responsibility.

A group called The 1775 Sec says via Twitter it compromised Dropbox’s site in honor of Internet activist and computer programmer Aaron Swartz, who committed suicide a year ago.

Dropbox is saying the outage, which appears to be resolved, arose from routine maintenance.

If you can get to the user-supported discussion forums – just about everything connected to the Dropbox domain is extremely slow – you’ll see many comments by very angry users.

Dropbox Forums

In particular, check out the seven pages of comments in relation to this specific outage.

I have no doubt that Dropbox is working feverishly to fix whatever the problem is that they said, 13 hours ago on Twitter, was “caused during internal maintenance.”

They do have a status update post on a tech blog that announced the problem and then said it’s solved. The post is dated today, January 11, but there’s no indication of when it was originally posted or when it was updated.

What I see now is, well, little more from Dropbox and a lot more from anyone else with an opinion – including repeating rumours about a hack attack – that stokes up good old FUD: fear, uncertainty and doubt.

A classic point from which a crisis can erupt: no trusted words from the organization, lots of untrusted words from everyone. I say “untrusted” because the majority is uninformed opinion that’s passed along via everyone’s social networks.

Correcting incorrect information just can’t happen as fast as a tweet, a retweet or a like.

I hope Dropbox can fix the problem quicker than it looks like they’re doing. Yes, feverish work in the background, of course, I have no doubt as I mentioned earlier. But in reality, I have no idea exactly what that means.

Some people are holding this outage out as an example of never relying on the cloud for your content. Always have a secondary option: if one fails or you can’t access the content, you have a backup.

Common sense and it’s certainly a practice I follow.

Yet I think what this illustrates mostly is how a lack of timely, continuous and trustworthy communication is the spark that creates FUD, and that can lead to a collapse in confidence and the potential subsequent exit of many customers to something else (Google Drive, maybe).

It does highlight the fragility of the cloud when things go wrong, to be sure. But that fragility is seriously – and unnecessarily – compounded by a lack of communication.

And once things are fixed – surely by Sunday at the latest – most people will likely breathe a sigh of relief and carry on as usual.

At what cost to Dropbox, I wonder, when the next outage happens.

[Later:] And the jokes will flow… Thanks, Sean Trainor, for this timely cartoon:

[Update Jan 12, 08:00 UK time:] Dropbox is up. At least, I can now access my account on the website even if the desktop app still cannot connect to the web and synchronise files.

An update last evening US Pacific time on the previous status update post on the Dropbox tech blog adds this discomforting note:

Dropbox is still experiencing lingering issues from last night’s outage. We’re working hard to get everything back up, and want to give you an update.

No files were lost in the outage, but some users continue to run into problems using various parts of dropbox.com and our mobile apps. We’re rapidly reducing the number of users experiencing these problems, and are making good progress.

We’re also working through some issues specific to photos. In the meantime, we’ve temporarily disabled photo sharing and turned off the Photos tab on dropbox.com for all users. Your photos are safely backed up and accessible from the desktop client and the Files tab on dropbox.com.

Clearly, whatever happened is very serious – that’s how it looks in the absence of any detailed explanation from Dropbox. So far, the outage has gone on for nearly two days.

Good luck with the complete fix during today Sunday, Dropbox, looking forward to hearing what you have to say about it all.

[Update Jan 13, 08:25 UK time:] And Dropbox is finally working as expected, not only on the website but, more importantly, on all devices that synchronize content with and via a Dropbox servers account.

In a concise post to the Dropbox blog at about 7.20pm US Pacific time last evening, Dropbox VP of Engineering Aditya Agarwal said the service “should be up and running for all of you.”

He went into more technical detail on the Dropbox tech blog in explaining what went wrong during the maintenance update that Dropbox has said was at the heart of the outage, and what they did to fix it

[...] A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down.

Your files were never at risk during the outage. These databases do not contain file data. We use them to provide some of our features (for example, photo album sharing, camera uploads, and some API features).

To restore service as fast as possible, we performed the recovery from our backups. We were able to restore most functionality within 3 hours, but the large size of some of our databases slowed recovery, and it took until 4:40 PM PT [on Sunday] for core service to fully return.

That, and the rest of the post, should provide some comfort that a) no data has been lost (a thorough check of your content on Dropbox would confirm that); and b) they’ve fixed the problem.

Agarwal’s post on the tech blog talks about what lessons Dropbox has learned from this experience, talking about “distributed state verification” and “faster disaster recovery.”

These are core elements of the overall technology and its management, and something that is part of reassuring users and others that Dropbox is a service you can trust.

Yet I see no mention of “communication.” That’s the missing element in the learning experience for Dropbox, essential for completing the customer reassurance circle.

During the weekend, one alarming and confidence-draining element in much of Dropbox’ communication, especially from its support Twitter account, was a clear signal that they had no idea when service would resume normally.

It’s tricky line. What level of detail do you communicate that isn’t going to alarm people even more? Or do you keep mum other than saying “we’re working on it” and risk seriously alarming people and fanning the flames of FUD? It seems they chose the latter path.

What would have served Dropbox well in this situation would have been a community of advocates – brand champions, if you will – who could have taken the detail, and communicated it within their own communities of influence.

I saw no evidence of any such activity, just the occasional update to the original status post, and lots of auto-like tweets such as the one above.

Dropbox may have fixed the serious problem technically. Where work is pending is building strong and trusted connections with their users and others who can support them online.

When I first wrote this post on Saturday evening, I included a screenshot of the user forums, showing a thread containing 193 posts across seven pages about this outage.

Now, that forum thread contains 1,290 posts spread across 43 pages, filled with anger, frustration and negativity (hostility in some cases).

There’s your feedback, Dropbox, a rich resource to learn from on what your customers think.

[Update January 13, 21:00 UK time:] Shel and I discussed the Dropbox outage kerfuffle at some length in this week’s episode 738 of our weekly business podcast, published a short while ago. The segment starts at 16:39 into the show.

About Neville Hobson

Entrepreneurial business communicator with a curiosity for tech and how people use it. Early adopter (and leaver) and experimenter with social media. Co-host of the weekly business podcast For Immediate Release: The Hobson and Holtz Report. Also an occasional test pilot of shiny new objects. Follow me on Twitter and Google+.