Guest Column | June 9, 2014

A Holistic Approach To Backup And Data Availability

Managed Services, Backup and Recovery, And Networking News From September 2014

By Sheldon D’Paiva, Product and Solutions Marketing, Nimble Storage

Storage silos have long ruled the traditional datacenter.  Businesses have been purchasing separate primary and secondary storage systems — and for good reason. The end users of business critical applications such as email and databases demand the performance that primary storage systems deliver. When backing up those applications, however, the backup data does not need to be accessed in real time, and the decision point often shifts back to cost, at which secondary storage systems excel. But there is an unrelenting tide eroding those dividing walls between silos. Nimble Storage recently conducted a survey on data protection with 1,600 participants, which found the majority of enterprises believe they cannot afford to lose more than six-hours-worth of data (recovery point objective, or RPO), and that they must be able to recover protected data in less than six hours (recovery time objective, or RTO).  Using separate primary and secondary storage systems mandates that data be read from the primary system, moved across the network, and then written to the secondary storage system — making six-hour RPOs and RTOs all but impossible. 

Meeting strict data protection requirements at scale requires a different approach — one that doesn’t require the traditional read-move-write methodology that impacts production infrastructure. To meet strict requirements, businesses are increasingly turning to integrated storage systems — that is, storage systems that can serve as primary storage systems, but that also have integrated data protection. This integrated data protection is achieved through the use of very efficient storage snapshots. In fact, Gartner estimates that by 2016, 20 percent of large enterprises will transition from traditional backup and recovery solutions to employ only snapshot and replication techniques, up from 7 percent in 2013 (“Magic Quadrant for Enterprise Backup/Recovery Software,” published June5, 2013). And a survey from the Enterprise Strategy Group shows that 55 percent of customers plan to augment their traditional backup with snapshots or replication or both (“Trends in Data Protection Modernization,” published August 16, 2012).  Three key underpinnings — a fault tolerant architecture, modern storage snapshots, and data analytics — can work together to deliver a foundation for both high data availability, as well as an effective data protection platform.

To deliver high data availability, a storage system must be built on a fault tolerant architecture.  A fault tolerant architecture means that the system should be designed to tolerate failures at multiple levels, and is essential in order to deliver “five nines” availability — or system availability of 99.999 percent. Failures at the component level must be detected and corrected.  For example, RAID technologies use redundancy to recover from drive failures at a component level. A system should also be designed to be fault tolerant at a sub-system level to eliminate a single point of failure. Enterprise-class storage systems are typically designed with redundant controllers that can take over in the event of a failure. Aside from mitigating the risk of sub-system failures, the second controller can also help maintain uptime and data availability during upgrades — in such a scenario, the second controller can be upgraded while the first controller services workloads, and then take over servicing workloads from the first controller once the upgrade is complete so that the first controller can in turn be upgraded. When using a primary storage system that has integrated data protection, a fault tolerant architecture at the component and sub-system level also ensures the integrity of the data in the storage snapshots, and continuing operation of data management services such as snapshot orchestration even if a component or sub-system fails.

When it comes to data protection, storage snapshots provide RPOs and RTOs unmatched by traditional methods that have a read-move-write impact on production infrastructure.  Storage snapshots are essentially a point-in-time version of the data on disk.  For effective use in aggressive data protection scenarios, they need to be very efficient — with a well-thought-out implementation and metadata (pointers to the data) layer, so the data itself is not moved or copied every time a snapshot is taken, and only the changed blocks are stored.  For very aggressive data protection needs, the storage system must be able to take snapshots every 15 minutes (15 min RPOs) and also be able to retain those snapshots cost-effectively. Leveraging SSDs (solid state drives) for snapshot metadata and high density disk for snapshot data enables a storage system to take frequent snapshots without impacting the performance of critical workloads, and also store the snapshot data cost-effectively to address the majority of data recovery cases. Lastly, storage snapshots can also be used to recover data even in the event of a complete array failure or site outage, by replicating the snapshots to another storage system — which is often at a remote site for disaster recovery purposes. The benefit of replicating snapshots is that only the changed blocks are replicated in each snapshot, so the replication process does not consume much network bandwidth.

 A fault tolerant architecture is the baseline for high data availability, and an efficient storage snapshot implementation can deliver on the service level agreements required for aggressive data protection, but data analytics can contribute to proactively ensuring the overall wellness of the storage environment. In fact, data analytics can improve data availability beyond “five nines.” In order to do this though, data analytics need to be well integrated with the storage system. The storage environment must be monitored for various metric values that number in the millions for a single system — from physical metrics such as fan speeds and power supplies to data services metrics such as volume sizes and performance levels. Although the storage system needs to be able to collect and track a vast number of metric values, processing them all on the storage system itself would put an additional load on the system that could impact performance. A cloud-deployed service can collect the analytics telemetry data from the storage system and then processes it in the cloud to offload the heavy lifting. It also removes the need to deploy additional infrastructure onsite, while enabling the ability to monitor multiple systems from anywhere to ensure data availability and data protection compliance. Another benefit to a cloud-deployed service is that the analytics engine is “always on,” and not coupled to site-wide outages as an on-premise solution might be.  Not only can a cloud-based analytics solution notify the user in the event of a complete storage system failure, it can also improve system uptime by proactively notifying both the user and customer support of events that may interrupt data availability such as a drive or controller failures, so that corrective action can be taken before a problem occurs. From a storage system vendor perspective, a cloud deployed analytics engine can help investigate the root cause of unplanned downtime for specific customers, and fixes can be pushed out not only to affected customers, but also to all customers who are at risk of the same issue. Proactive notifications can also help ensure that the data protection strategy is in fact working as expected — for example, the analytics platform should be able to notify the user whenever there is a problem with the storage snapshot, application integration, and replication. 

A fault tolerant architecture, modern storage snapshots, and powerful data analytics form the basis for a holistic approach to backup and data availability, resulting in storage systems that operate in peak condition at all times. These three key underpinnings can provide both high data availability, as well as an effective data protection platform for aggressive requirements. By deploying primary storage systems with integrated data protection, businesses can not only realize additional cost savings through the elimination of secondary storage silos for data protection, but also through simplified management and operation.