High Availability

Jan. 27, 2009
Keeping your access system up and running can mitigate your risk

Today, you cannot open a magazine or walk a tradeshow floor without being inundated with the word “convergence.” But “risk management” along with “high availability” are quickly joining this important buzzword, and for good reason. When you tie everything together, and run it all on the same network, you are putting all the proverbial eggs into one basket. That can seem risky if not handled properly. But many of today’s IT systems and IP networks have been designed with and offer a number of fault-tolerant features and capabilities making these systems rival the reliability (and availability) once only offered by our plain old telephone system (note, this is not the same as our cell phone networks).
The general measure of a given system’s availability (or its lack of planned and unplanned downtime) can be expressed as its percentage of operation (availability) over an entire year. And the frequently used benchmark for mission-critical systems, such as a telephone system, is to allow for only 5 minutes of downtime over an entire year, or 99.999-percent available. Let’s examine several aspects and approaches that the IT world has used in combination to reach this level of availability (and reliability) and how they can be used to support your mission-critical security systems and applications.

Network Infrastructure Redundancy and Failover
Unlike many traditional analog security systems which have several single points of failure, IP network design best practices feature redundant switches and routers with multiple interconnected paths to get from point A to point Z. This approach is used for the Internet as well as many organizations’ intranets. In these networks, IP network designers will commonly split the redundant links between two other devices such that if one link or the connected device fails, there is another path for data to transit (see Figure 1). When a given connection fails, the intelligent features (including network protocols) in these devices will automatically switch over to the alternate link, eliminating the need for manual intervention. For greater efficiency, these redundant links and devices can be made operational all of the time, not just for failover. As a result, the redundant links and devices provide traffic load-balancing in addition to delivery of higher system and application availability.
In some cases, even network edge devices or hosts (such as servers or PCs) will feature dual network connections. Assuming each network connection goes to a different network device (usually a switch), this also provides an additional measure of fault-tolerance and resiliency. This may be referred to as “dual homing.” To be fair, it should also be recognized that intrusion and access control systems have featured dual connections (such as a plain old telephone system modem connection and, perhaps, a cellular connection) to achieve the similar level of accessibility. Nonetheless, dual homing is prevalent in many applications servers found in data centers, and is becoming more popular in various networked physical security edge devices.
While these protocols will automatically route traffic to a viable path, network devices will also send real-time messages to the network operations management consoles such that failed devices and/or links can be flagged for remediation. Messages can even be sent as e-mails or to pagers, reaching network administrators or key application users (such as security system operators) wherever they may be — facilitating even faster resolution.

Redundancy through Clusters
The concept of redundancy is frequently extended to networked application servers, such as network video recorders, access control and/or security servers. In many cases, servers/recorders may use fault-tolerant RAID (redundant arrays of inexpensive disks) techniques, allowing for the redundant storage of data. RAID, with various levels of availability, is offered by many vendors. Several of the RAID levels ensure data is not lost even if a hard drive fails. For the sake of brevity, I am assuming you are familiar with RAID storage, so this article will not go into more detail on that topic. In any case, fault-tolerant redundancy may also be extended across multiple servers through “clustering.”
Clustering enables two separate servers to appear as one, generally with one acting as the primary application server and the second as a back-up. Software or a portion of the server software known as a service (such as Microsoft Cluster Services) monitors the health of the server and its backup server. If there is a problem detected, the service can signal the back-up server to take over primary operation. The service also controls the IP addresses of the server cluster such that other networked devices or clients will be unaffected in the event of a server failure when the back-up server takes over.
There are several ways to configure redundant servers to deliver a “highly available” system. In the most basic configuration, sometimes referred to as “cold-standby,” only the primary server is working, or active, and the back-up server is “offline.” In this configuration, the standby server is not supporting any transactions or operations. In the event of a primary server failure, the standby server may require additional configuration prior to taking over. This may include restoration of the data in the primary server’s database. Any computation or transaction that the primary server was handling during failover may have to be re-initiated when the standby server becomes operational, depending on the frequency of the database replication function.
In “warm-standby” configurations, the back-up server may have been partially configured, but some parameters may require updating before the application can resume normal operation. This scenario assumes that any relevant existing database was uncorrupted or that database replication minimized any data loss. The benefit of a warm-standby configuration over cold-standby is generally the time savings for the system to return to an operational state.
In some scenarios, both servers can be configured to operate concurrently (sometimes called an active-active configuration, or “hot-standby”) with each server acting as the back-up for the other. In this case, the servers may be running the same computation or function, so the failover is completely invisible. Each server has a current version of the database (data in each of the servers is completely synchronized), so normal operation continues even during the primary server’s failure.
For an additional measure of high availability, the servers in any of the configurations could be operating in different locations. As such, a catastrophic event or failure at one server location would not affect the operation at the backup server’s location.
For an access control application using a Microsoft-based server, cluster services are supported in Microsoft Windows 2003. From a database perspective, the servers would also require the Microsoft SQL Server to support “cluster awareness.” Finally, the access control system vendor’s application software may also need to be configured to be aware of a back-up server failover capability. In some cases, a single vendor license for the application may be required as only one access control system is supported at any given time. But in other cases, it may be necessary to have a second license for the back-up server. 

Resiliency Through Message Queues
Many network protocols or application communications assume that a direct connection exists between hosts (i.e., servers and/or edge devices) at all times. Unfortunately, if the link fails, a message or alert may never be received by the intended host. This is where other resiliency features can complement some of the high availability features discussed thus far. By using a messaging protocol known as Microsoft Message Queuing (MSMQ), applications on disparate servers keep a list (or queue) of recent events, alarms or other alerts such that they can be sent to another application or device once a communication link is restored. As a result, MSMQ provides reliable and resilient (but not necessarily timely) delivery of messages between hosts and applications.
Today’s IT networks are ideally suited to support various applications, from efficiently supporting financial markets transacting equity trades in the billions of dollars, to IP telephony, to physical security systems. They have proven their availability in times of crisis such as the Sept. 11 attacks or Hurricane Katrina — being the only systems to remain up, or the last system to go down and the first to come back up.
Implementing a high-availability solution should take into consideration the criticality of security to a given organization. A portion of system down-time risk can be mitigated simply by selecting vendors whose systems are more reliable and by following installation and maintenance best practices. At a minimum, the application and its database should be backed up regularly and religiously. 
You may find that your IT group has already implemented some of these features and capabilities. If not, some physical security vendors and systems integrators offer technical/professional services and support to handle the system configuration for you. As a result, you may be able to enjoy the benefits of a highly available security system for a relatively modest incremental investment.

Bob Beliles is vice president of enterprise business development for Hirsch Electronics (www.HirschElectronics.com), a manufacturer of IP-based access control and identity management systems. Prior to joining Hirsch, Mr. Beliles co-founded Cisco Systems’ physical security initiative and led a number of product development efforts. He can be reached at [email protected].