Universal fact which will always remain true is "Data will keep growing in size and types" with the progress of time. We are living in the world where we need digital information on our finger tips. Some of this information we need immediately and some we save it for later.
Applying this to a typical IT setup, the data we save for later can be termed as Archival. Interestingly Dictonary.com explains word archive as "any extensive record or collection of data". This data can become very complicated when not taken seriously.
Here, I'm going to share my experiences with data management that makes it important to think about archiving the old data.
Large volume of unstructured data resides in file servers belonging to several current & past users. Users keep storing their data on the file server sometimes not even realizing if it is really business critical. In all such cases, there is no data handling process when a user leaves the organization. No one to decide which of his data is really business critical and even needs to be retained.
For users running a project in collaboration, data is regularly updated from multiple sources into a common shared folder. Many a times I have seen the project owners making a copy with them when the project is over and again leave a copy on the file server as well.
Over a period of time and employee turnover, the new management has no way to decide what action can be taken against this data. Even Compliance does not allow deleting the data for a minimum of 7 to 10 years. This keeps pilling up the file server forcing you to regularly upgrade the file servers both in terms of performance as well as capacity.
Large volumes of emails for the user mailboxes create the same threat for the mailing application as well.
Does it sounds familiar ??
Users for organizations running with quota based mailboxes keep moving their data to archival mailboxes manually to free up space to handle more emails. The over loaded mailboxes do not allow sending & receiving emails any further and until the user gets back into his limits, he is struck with email movement.
The manual process of moving mails out of the mailbox is cumbersome. It also needs a regular watch on the mailbox size to try & ensure that this is performed before the size goes beyond the limit.
This process has all the ingredients to result in a potential data loss if not handled very carefully.
To store such large volumes of data, you need to regularly upgrade the infrastructure both in terms of capacity & capability to handle more data with the exception of course that the original investment is over-sized. Even in that case, the technological advancement and Warranty renewals prompt you to look for a newer storage equally over-sized in every warranty renewal cycle.
Archival helps move data to a secondary storage.
For files, you can set an archival policy based on your data ageing, access patterns as well as data types. A scheduled archival job run would move this data to archival storage and reduce the footprint on the primary storage. It leaves a stub in the primary storage which is sized in KBs. The movement is transparent to the user with the only difference that when he accesses this old file, it gets downloaded from the secondary storage.
For mailboxes, Email archival can be set to push the emails and the attachments to the archival storage and bring in a considerable reduction in the mailbox capacity by deleting them from the mailbox.
Archival jobs don’t need to be online especially for file servers so an offline job can ensure zero production impact without bothering the users.
Mailing applications can be configured for online journalling which copies an incoming/outgoing mail to the archive while delivering it to the user thereby again no performance impact. Removing them from the mailbox can also be done offline and synchronize the mailbox when the user comes online.
Overall it helps reduce the storage footprint and maintain application performance.
Combining the above two & introducing an automated tool configured with the desired polices, you can enhance employee productivity by letting the tool take care of archives especially scheduled during off-peak hours. Users would no more be engaged in manual copying and their systems keep working seamlessly.
Journalling emails and then short-cutting the old emails from the mailboxes automatically reduces users burden to perform manual archiving and deletion and reduces the risk of data loss.
The SLA to read an old data does not need to be online. You can use a low performance storage for storing the archival data. Infact, you can utilize the low-cost cloud storages provided by cloud services providers like AWS, Azure etc. as the destination for the archived data.
This would actually be a better idea as the cloud service providers provide better redundancy and options of multi-region storage at a very low cost. You can utilise their transition mechanism to move your data from one storage tier to another based on the access patterns you have.
With high levels of redundancy promised, there is no need to take additional backups of the archived volumes. Instead, they can be replicated to another region if additional safety is required.
Especially for file servers, backing up millions of small files is a big challenge. The need for a weekly/monthly full backup is a real threatening challenge.
If you are still using tape technologies, I am sure you find it difficult to cope up with these full backups. You could relate to several Data Centers where a larger file servers weekly backup spans over a couple of week days as well. In a couple of environments managed by us, we had to move our policies to a monthly full eliminating weekly fool backups.
The small files take longer to backup & recover and tapes have no mechanism to eliminate common files. Just upgrading tape technology does not completely resolve the problem.
The cost of infrastructure maintenance & tape management which includes new tapes every month. This brings up the additional task of tape storage in local facility & off-site movement of cloned tapes. Holding the backup infrastructure for file server backups, delays all other backup jobs until over-sized to accommodate everyone.
We experienced a Data Center handling backups in a different way. File servers get backed up only starting Friday evening holding the tape library dedicated for it and MS Exchange backups happen only on week days. Therefore, file servers are left unprotected for 5 days and MS Exchange unprotected for the weekend in a 24x7 media environment.
With primary data getting reduced, there is a considerable improvement in backup windows as it now needs to backup only the current & critical data.
Conducting assessments over the last 20 years, we have observed that enterprise file archiving brings down the production data size to 15-20% and reduce the backup window also accordingly. Archiving tools do help find the duplicate files and if found irrelevant you can delete them giving you multiple benefits.
If the archives are stored on-premise, you can backup from the archive storage any time of the day without bothering about the application performance.
In an unfortunate incident of data loss, you need to recover lesser amount of data to get back to production. The users actually need only the online data. The balance data is old & is rarely accessed and that is the reason it was archived. The back to production is much faster when recovering lesser amount of data instead of recovering the entire data.
The archival data is generally away from the primary data and it can continue to be there as it follows the archival mechanism of recovery (To be recovered only when accessed) and is not required immediately.
If your mailboxes are backed up on tape, you need to recover the entire mailbox or Storage Group to be able to locate a single mail to recover. You need to maintain a parallel mechanism to handle this. Backing up individual emails takes much longer and needs large indexing capabilities and cumbersome search of the lost email.
An archival tool can quickly help you search the desired mails from the archive and retrieve them back quickly without the need to wait for the entire mailboxes getting recovered from the tapes.
In an event of data loss from the on-premise archival infrastructure, it can be recovered from the backups. Since it is not an online data, a longer recovery SLA is acceptable and the recovery times do not impact the production performance.
While copying emails to archive mailboxes manually, there could be an accidental deletion of an email. This could be a critical email required by the user or the management for litigation purposes.
Sometimes, users do not realize the importance of some emails and may end up deleting such critical emails.
Backups run once a day and cannot guarantee to capture all the data because data created and then deleted within the same day or between two backup cycles could never get backed up. This is one of the most critical aspects of the backup approaches which archive jobs can take better care of especially for the mailing applications.
For such sensitive information, online archival/journalling is recommended to ensure a copy of the email is preserved even before it is delivered to the user mailbox. Journalling works on the mail server and copies every incoming & outgoing mail and stores it in a separate journal mailbox. Even if the user now deletes the email, there is a separate copy available in the journal and can be easily retrieved.
You can actual enable journaling on your mailboxes for all users and eliminate the need for backups of the mailing application. It is actually safer than the backups.
Online archiving/journalling ensures a copy of all email records moving across the mail server. It ensure that you can actually use your emails from archival in case primary mailbox is not accessible. If not otherwise mandated, journalling acts as a better backup than the other backup approaches. It runs 24x7 without impacting the user experience and makes a copy while delivering it to the user which frees it up from a user deletion.
Archiving tools allow recovering your entire mailbox for a given or all users into pst files so if a user looses his entire mailbox, it can be rebuilt from archives itself in the desired form.
By archiving old data, you can understand information risk with comprehensive data mapping, manage the legal hold and conduct investigative search across billions of documents in seconds. Archiving captures file and message content, attachment information, and all metadata for faster, more accurate search results. Many archiving products leverage advanced e-discovery features, such as searching by full text key word and key phrase, word proximity, file size, format, data, sender, recipients, and more.
Archiving tools allow you to search files or emails from the archive. With wide variety of search parameters possible, archiving tools help the management to search related content all across the organization without the need to depend on the user’s availability & acceptability to share the information. Especially with journalling enabled, it help search across the emails which might otherwise have been deleted by the user.
Compliance requirements for most organizations require retaining their data for long periods which could be 7 to 10 years depending on the industry and data types. Retaining the huge volumes of data on online storage and production databases has adverse effect as discussed above. Archiving gives an automated way to answer this. Archiving tools these days are compliant and certified by the various global records management standards.
Compliance works in conjunction with security. The archiving tools and absolutely secure in terms of managing them i.e. role based access to the team managing the archiving infrastructure and complete logging & alert mechanism for all activities being performed. They comply to all major Global Standards of Compliance related to data management including GDPR standards.
From infrastructure point of view, archiving to a cloud service provider gives you an automatic compliance to International Compliance Standards as the Cloud Service Providers ensure that their infrastructure is secure & meets all Global compliance standards.