White Paper | February 27, 2013

Using Active Archives For Long-Term, Highly-Accessible Data Retention

Source: Active Archive Alliance
Dave Thomson Active Archive Alliance

By Dave Thomson, Senior Vice President of Sales and Marketing for QStar Technologies and Active Archive Alliance Board Member

The need to store critical digital content at the petabyte level and beyond can no longer be accomplished with traditional storage solutions. Big Data, cloud, back up and protection, data preservation, compliance, and the explosion of unstructured data are all driving the need for more advanced storage capabilities. Many organizations are architecting active archives to solve this problem. An active archive is an online file and storage system that gives users real-time access to data stored for long-term retention. Active archives are made up of a combination of hardware and software systems that are available from a number of hardware and software vendors. Multi-vendor active archive solutions can be designed and tailored to meet each organization’s unique needs.

Active archives drive significant efficiencies and have been implemented in all industries, all over the world. Organizations that benefit most from using active archive solutions are ones that have been creating data over long periods of time and have accumulated large amounts of data. Organizations in the high performance computing (HPC) market were some of the first to discover the benefits of active archives, and while that sector remains at the forefront of adoption today, organizations across all types of industries are accelerating adoption of these technologies. So, what is driving this movement and what benefits can an organization achieve with an active archive?

The greatest benefit to an organization from using an active archive is reduced cost of acquiring additional disk storage.

IT Directors are well aware of the ever increasing requirement for storage capacity within their organizations. Both the average size of files and the quantity of files created, downloaded or shared are increasing exponentially. This growth drives a never-ending requirement for more capacity to store all the content.

Disk storage manufacturers, which now also include SSD manufacturers, are realizing that data cannot stay in one location for its entire useful life. As a result, many vendors are creating hybrid primary storage systems.

A hybrid primary storage system consists of at least two forms of storage with a mechanism to move files or blocks between one form and another based on the frequency of access. They consist of a fast storage technology using SSD or 15k RPM Fiber Channel disk drives plus lower performance 7.2k or 5.4k SATA disk. SSD is often referred to as “Tier 0”, Fiber disk as “Tier 1” and SATA disk as “Tier 2”. Data is automatically moved, both down the hierarchy and back up, from one tier to the next, so reducing the amount of expensive Tier 0. The solutions typically will include a larger percentage of cheapest Tier 2 disk to keep the total cost as low as possible. These hybrid solutions are SAN- or NAS-based storage systems; however, like all forms of disk storage, they are not completely secure. As a result, all data must either be backed-up or replicated.

Organizations backup their data to secure it from disk or user errors, and for disaster protection. Backups are scheduled on a periodic basis and typically consist of full backups plus incremental backups. The intention is that no matter when the data was lost or corrupted a copy of the relevant data before loss or corruption occurred will be secured on media that is not online. In the past, tape was the method most often used. More and more, organizations are using de-duplicated appliances to manage backups within available backup windows. De-duplication technology typically speeds the operation and reduces total capacity required.

Alternatively, and sometimes in addition, replication of primary data to a second store or site can be implemented.  This can be very expensive as a second storage system must be purchased along with a high performance Wide Area Network (WAN) connection between the two locations.

Active archives work on the same principles as hybrid storage. By identifying files that do not need to be accessed as quickly, data can be moved from primary storage to lower cost technology such as tape, allowing organizations to reuse their primary storage capacity and achieve considerable savings. The art of using an active archive successfully is defining files and data that users infrequently access.

Active archives can reduce the cost of acquiring new primary storage and the cost of its associated backup.

Example 1:

In the example below, an organization requires 200TB of storage. The first option (option A) is to purchase 200TB of primary storage protected with a de-duplicated backup appliance. The second option (option B) is to use 100TB of primary storage and 100TB of active archive using a tape library.

If the organization was able to store 50% of their data in an active archive, they would save $749,000 in acquisition costs.

Chart 1

Example 2:

Below is a case where an organization requires 1PB of storage. In option A, they purchase 1PB of primary storage with a de-duplicated backup appliance. In this example, a larger percentage (60%) of their data could be stored in an active archive. This would result in 400TB of primary storage plus 600TB of Active Archive using a tape library (option B).

In this second example, by reducing their 1PB requirement of primary storage to 400TB and adding 600TB of active archive the organization would save over $4.6 million in storage acquisition costs.

Chart 2

The benefits do not end there. One of the key considerations in using de-duplicated backup systems was to fit the backup job within the window available. By moving data to an active archive, the time required and volume of data in each backup has been significantly reduced, or more accurately, is not growing at the same rate.

By using an active archive, organizations begin to create a data hierarchy that can provide significant benefits in the event of a disaster. The recovery time in the event of a catastrophic primary storage system failure can be reduced considerably. Only primary data need be restored. In the above examples, the organizations would recover their data in 50% or 40% of the time previously taken.

Another advantage is that active archive hardware and software has a typical usable life that is at least twice as long as that of the disk based primary and de-duplicated backup systems. This means the cost of ownership for an active archive is even more advantageous since it will only need to be replaced every 8 to 10 years, versus disk-based systems that need replacing every 4 to 5 years. And, when it is time to update their active archive system, users would also have the option to upgrade the tape library with new, higher capacity drives that would increase capacity but not increase storage footprint, power or management costs.

A final advantage of an active archive is that it can store compliance data securely. Some organizations must adhere to specific compliance regulations; others impose their own retention requirements to meet their internal needs. An active archive can optionally be configured to store data to archive media that cannot be over-written or that is written in a protected way that prevents deletion or virus-based contamination.

The Anatomy of an Active Archive

An active archive should look exactly like a NAS interface to the user or application. Subdirectory structures can be created in the same way as on disk storage. Using standard network protocols (CIFS / Samba or NFS) files can be moved to or from the archive. Files can be moved manually, using standard “cut and paste” or automatically using policy-based software that will find files, based on standard file metadata and move them to the archive. Policy-based software can either move or “stub and migrate” data (which leaves an automatic re-director in the place of the file). Files stored in the archive can be found using standard operating system searches.

Data stored in an active archive is secured by copying to additional media. One of these copies should be stored offsite and away from the primary facility, providing an ultimate disaster recovery copy. Consequently, data stored in an active archive does not need to be backed up.

The archive technology chosen for storing active archive data can be a single technology or a combination of technologies. The user must choose which technology best fits their own environment and meets their particular capacity and performance requirements. Smaller capacity users may chose optical or RDX, while larger capacity users, such as high performance computing organizations, will chose tape or object-based disk storage. Any size user can benefit from Cloud storage, especially if they do not have a remote second location themselves.

More information about designing an active archive and the benefits they provide can be found at the Active Archive Alliance website (www.activearchive.com), which comprises a group of companies (both hardware and software) who believe in creating an archive storage platform that complements primary storage with its associated backup.

About The Author

Dave Thomson has over 20 years’ experience in the archive storage industry and currently serves as Senior Vice President Sales and Marketing for QStar Technologies; a leading global provider of data management and active archive software solutions.  Dave leads a global team of dedicated archive professionals and fosters industry partnerships at all levels. He is a tireless educator on methodologies which encourage the adoption of archiving best practices, such as the 3-2-1 Archiving and Data Protection Best Practice.  http://www.qstar.com/company/3-2-1-best-practice/. Dave is a board member of the Active Archive Alliance, of which QStar is a founding partner.

Active Archive Alliance

The Active Archive Alliance is a collaborative industry alliance dedicated to promoting active archives for simplified, online access to all archived data.  Launched in early 2010 by founding technology partners Dell, FileTek, QStar Technologies, SGI and Spectra Logic Corporation, the Active Archive Alliance is a vendor neutral organization open to leading providers of active archive technologies including file systems, active archive applications, cloud storage, and high density tape and disk storage, as well as individuals and end-users.