SIW Roundtable: The reliability of cloud security services

Last month, an outage suffered by Amazon Web Services at one of its data centers near Washington, D.C., brought down several large websites including Reddit and Foursquare, for an extended period of time.

Though the outage was only a blip in the world of cloud computing, it does make many people concerned about the reliability of hosted services. As more and more security end-users switch to hosted video and access control for their numerous benefits, such as reduced maintenance and infrastructure costs, outages like this have the potential to negatively impact people's perception about the cloud.

To address some of the concerns raised from this outage, SIW spoke with some of the leading providers of hosted security services to get there take on the outage and how similar incidents could impact end-users.

How would a cloud outage like this impact hosted security services offered by your company?

Steve Van Till, president and CEO of Brivo Systems: (Amazon) divides up their system into what are called availability zones and they have a bunch of different availability zones throughout the Unites States and the way it works is that if you are hosting with them, you can pay to be in one availability zone for one price or if you want redundancy and protection against failure, you can pay to be in two zones, or three zones or five zones. All the people that were affected (by the outage) paid to be in one zone and that just violates the principal of system design forever in engineering which is redundancy. The kind of outage that Amazon had would not affect Brivo or companies like us because we have multiple data centers, any one of which can go down and the service keeps running.

Brian Lohse, director of business development for Secure-i: Generally speaking, we are always planning based on failure. That happens at multiple levels. It happens at the user level where all of our systems are going to use some form of a NAS (network attached storage) drive or SD card locally, so that if whether it's an Internet connection that is failing or the entire cloud that's failing or anything in between, they still have recording locally. Level two is hardware redundancy. Even within a data center, within a single rack of equipment, you have servers and hard drives that are prone to failure. Switches, cables, all of those things can go bad, so there is a hardware redundancy. In our case, its end-plus-two, which means there are essentially three of everything from the hardware level. Towards the top of that, there is the data center itself and you can build redundancy at that level too.

Matt Krebs, business development manager for hosted services at Axis Communications: I think you could take a couple of different perspectives. Most of the partners we work with, in today's cloud world, a lot of folks just assume there are redundancies and backup plans in place with most cloud data centers and it's not always the case. What we've done is the folks we've selected for our hosted video program we make sure that they do have redundancies, backups and contingency plans in place in case any of their primary centers do fail. We rely on our partners to have that kind of redundancy and backup. In the case where we don't have those kinds of redundancies and we have some partners that don't necessarily have those capabilities, we've made provision for local onsite backup storage in the form of a network attached storage device. If anything does fail in the cloud, the customer or end-user that's consuming these hosted services will still have the ability to retrieve that video from a local drive. A third stop gap to make sure our customers don't experience data loss, we also have a number of our cameras that carry SD card slots onboard in the camera.

Jon Herlocker, chief technology officer for EMC's Mozy cloud services division: We don't leverage Amazon in any way. EMC has its own physical assets, we have a separate data center and we have our own hardware, networks and Internet connectivity. One of the things that we at EMC have done that we haven't really seen other folks do to the same extent is we really invested enormously in ensuring that our service is robust and protected against all of these potential failures. We have incredible amounts of redundancies across the board so there is no single point of failure. We believe that the probability of us having a failure like Amazon saw is much smaller.

Are there local appliance backups that can kick-in when an outage occurs?

Van Till: In access control, the technical term for this is "hybrid architecture," meaning that there is a component of the product or service that runs locally and there is another component that runs in the cloud. For example, most access control systems are that way. The local component is an edge device or a control panel. It has its own cache of data that it needs to do its job. In our architecture, if the Internet went away or the website crashed or the communications link was cut off to that facility, the control panel still does what it's supposed to do, which is to authorize users to come through doors. The fact that this happened in the cloud and that people have made this a cloud story really obscures the fact that this it's really just a system design story. Good engineering is good engineering.

Lohse: Most of our systems, even outside of outages, are hybrid in their nature, meaning that the cameras are recording both locally and to the hosted (service) all the time. And that's for various reasons. One of those reasons is for failure, most commonly of the user's Internet connection. It's much more likely that they would lose (their local Internet connection) than the data center falling off the map. Even one step beyond that is you lose power. That happens all the time.

What impact does an incident like this have on the adoption of hosted security services?

Van Till: There are a lot of careless journalists out there who are writing it as a cloud story. People love to put alarmist headlines out there. CNN called it "the Amazon Titanic event" and there are other people who said "this is a real wake-up call for the cloud industry." A lot of business people who aren't really interested in the underlying technology or what really went on are going to see these headlines and that does impact their thinking about adoption. Anyone in engineering is going to understand that the buyers basically didn't do their own homework and should've had redundancy as part of their overall service plan.

Lohse: It's not good, of course. I think it's mostly bad because of the hype more so than anything else. When you're looking at the reliability of a solution, looking at that is only relevant when you're comparing it to the alternative. I think to look at this accurately; you have to look at it next to a study of how often do DVRs fail? The reason that this is such a big news item and whether it's DVRs or people's local servers or storage getting corrupted in local networks, these things happen all the time, but they happen one at a time, so they don't typically make news. The difference with the cloud is that with something like this, you get millions of people affected at once. I think it's unfortunate.On some level I think it can be a good thing in the sense that there is some good to come out of it if you read and take the proper precautions these things can be avoided. People have to realize that all technology fails. No matter whether it's in your own office or Amazon, having the proper plans in place is the only thing that's going to reduce your downtime.

Krebs: I think it certainly gives people pause for concern. They will look at this, take a step back and those that were completely against it to start with will continue to fuel the fire from news like this. Those that were on the fence may take a step back and say "gosh we need to reconsider this." And I think those folks already onboard with cloud services can maybe take a page out of this book of what happened, why did it happened and how they can protect themselves to make sure that something like this doesn't happen again in the future. I think, all-in-all, it's about education and people have a perception and perception certainly is reality. If they just see on the surface that there was a cloud failure at Amazon I think that could have a bit of a detrimental effect, but I think overall adoption won't be significantly affected by this. In fact, I think if anything, lessons can be taken from it to make the space even stronger in time.

Herlocker: People are looking at the cloud who don't necessarily understand the whole picture and it's definitely going to give them some concern. The reality is that companies that are thinking about doing video surveillance only have to look internally at their own IT organization and look at what kind of availability they have within their own IT system to really understand the cloud is often a more available solution than something that they would host themselves. It's a common blind spot that people have, they see outages happening at other sites and they forget that they have their own outages on a regular basis with their own IT services. The reality is that well, robust, reliable, reputable operators of clouds are basically going to have higher availability services than your own IT offering in almost every case. They can afford more redundancy, they can hire network engineers that have personal relationships with all the Internet providers to manage those Internet connections and they can afford more hardware.

As more and more end-users migrate to cloud services, do you see outages become a bigger problem?

Van Till: How you mange redundancy and how you manage capacity is going to depend a little bit on the nature of your service. What most people start by doing is having two data centers and they are mirror images of one another. When you get up to four or five or six data centers, what you'll start doing in order to keep costs in line is data center "A" will be paired with data center "B" and part of "B" will be mirrored over to "C" and the strategy becomes a little bit more complex, but as long as every piece of data and every application has a counterpart somewhere else in your network then you can vastly reduce the likelihood that you'll find yourself in this situation.

Lohse: I think that responsible decision makers will be able to look at this and place themselves on what I would call the risk curve. There's always going to be a tradeoff if you're looking at a scale here. This is the same whether you're running local servers or using the cloud. Financial and risk are on the two sides of it. You could say "well, we're going to buy two of everything and its going to cost us twice as much, but our risk is reduced." There's going to be a balance and every company has their own curve of acceptance of risk versus financial costs. The lesson to be learned is, like anything, whether it's a local or cloud solution, there is a right and wrong way to do it and companies who are responsible and are willing to pay for high quality service will get it.

Krebs: If you don't learn from history, you're doomed to repeat it. I think everybody will learn from a substantial outage like this. Why did it happen? How do we avoid it? In the future, what contingency plans and redundancies and backup strategies can we put in place to make our offering even more robust, even more reliable? I think they will also try to strengthen their service level agreement (SLA) with their end-users. When you do sign up for cloud services, I think people need to be very aware of what your SLA says that your provider has to provide for you. And I think what you're going to see more of is people standing behind stronger SLAs, which means they are going to provide higher availability, more uptime.

Herlocker: There are probably going to be less outages in quantity, because as the scale increases, you have the ability to invest in more redundancy. You also have better (knowledge) on how to manage that.

The security of the cloud itself has always been a sticking point with many end-users with regards to cloud services. Do you think this outage and others like it will hurt people's perception about the security of cloud?

Van Till: People read things superficially and they react. On one hand, I will say yes it will have repercussions in some peoples' minds. Should it? I don't think so. There isn't a company out there that hasn't had internal system failures. Before any of us were using the cloud, didn't your IT department sometimes say, "well, the mail server is down or the server is down or you can't file backups today." This is a fact of life with computer systems and the ways you deal with it are redundancies, multiple systems, all of those kinds of techniques which are the same whether it's in your own data center or in the cloud.

Krebs: I think a failure and security measures are two separate issues. When it comes to security measures, I think some of the most advanced security encryptions and security measures are in place to protect that data from the source to the cloud. If you separate the two ideas, a fail over versus security, I think we've addressed the security concern.