A universal fact that will always remain true is \"Data will keep growing in size and type\" with the progress of time. We are living in a world where we need digital information at our fingertips. Some of this information we need immediately and some we save for later
A universal fact that will always remain true is \"Data will keep growing in size and type\" with the progress of time. We are living in a world where we need digital information at our fingertips. Some of this information we need immediately and some we save for later.
Applying this to a typical IT setup, the data we save for later can be termed as Archival. Interestingly Dictonary.com explains word archive as \"any extensive record or collection of data\". This data can become very complicated when not taken seriously.
Here, I'm going to share my experiences with data management that makes it important to think about archiving old data.
The large volume of unstructured data resides in file servers belonging to several current & past users. Users keep storing their data on the file server sometimes not even realizing if it is really business-critical. In all such cases, there is no data-handling process when a user leaves the organization. No one to decide which of his data is really business critical and even needs to be retained.
For users running a project in collaboration, data is regularly updated from multiple sources into a common shared folder. Many times I have seen the project owners making a copy with them when the project is over and again leaving a copy on the file server as well.
Over a period of time and employee turnover, the new management has no way to decide what action can be taken against this data. Even Compliance does not allow deleting the data for a minimum of 7 to 10 years. This keeps pilling up the file server forcing you to regularly upgrade the file servers both in terms of performance as well as capacity.
Large volumes of emails for the user mailboxes create the same threat for the mailing application as well.
Does it sound familiar?
Users for organizations running with quota-based mailboxes keep moving their data to archival mailboxes manually to free up space to handle more emails. The overloaded mailboxes do not allow sending & receiving emails any further and until the user gets back into his limits, he is struck with email movement.
The manual process of moving mail out of the mailbox is cumbersome. It also needs a regular watch on the mailbox size to try & ensure that this is performed before the size goes beyond the limit.
This process has all the ingredients to result in potential data loss if not handled very carefully.
To store such large volumes of data, you need to regularly upgrade the infrastructure both in terms of capacity & capability to handle more data with the exception of course that the original investment is over-sized. Even in that case, the technological advancement and Warranty renewals prompt you to look for newer storage equally oversized in every warranty renewal cycle.
Old data Archival helps in moving data to secondary storage.
For files, you can set a data archival policy based on your data aging, access patterns as well as data types. A scheduled data archival job run by the archiving tools would move this online data to archival storage and reduce the footprint on the primary storage. It leaves a stub in the primary storage which is sized in KBs. The movement is transparent to the user with the only difference being that when he accesses this old file, it gets downloaded from the secondary storage.
For mailboxes, Email archival can be set to push the emails and the attachments to the archival storage and bring in a considerable reduction in the mailbox capacity by deleting them from the mailbox.
Archiving tools allow data archival especially for file servers to be run as offline jobs to ensure zero production performance impact while archiving old data.
Mail archiving can be performed online through journalling which copies incoming/outgoing mail to the archival infrastructure while delivering it to the user thereby again no performance impact. Removing them from the mailbox can also be done offline and synchronize the mailbox when the user comes online.
Overall it helps reduce the storage footprint and maintain application performance.
Combining the above two & introducing automated data archiving tools configured with the desired policies, you can enhance employee productivity by letting the archiving tools take care of archives especially those scheduled during off-peak hours. Users would no more be engaged in manual copying and their systems keep working seamlessly.
Journalling emails and then short-cutting the old emails from the mailboxes automatically reduces users' burden to perform manual archiving and deletion and reduces the risk of data loss.
The SLA to read old data is never online or on the fly. You can use low-performance storage for archiving old data. In fact, you can utilize the low-cost cloud storage provided by cloud service providers like AWS, Azure, etc. as the destination for data archival.
This would actually be a better idea as the cloud service providers provide better redundancy and options for multi-region storage at a very low cost. You can utilize their transition mechanism to move your data from one storage tier to another based on the access patterns you have.
With high levels of redundancy promised, there is no need to take additional backups of the archived volumes. Instead, they can be replicated in another region if additional safety is required. Building a cloud-based archival infrastructure could be more efficient and cost-effective.
Especially for file servers, backing up millions of small files is a big challenge. The need for a weekly/monthly full backup is a real threatening challenge.
If you are still using tape technologies, I am sure you find it difficult to cope with these full backups. You could relate to several Data Centers where larger file servers' weekly backup spans over a couple of weekdays as well. In a couple of environments managed by us, we had to move our policies to a monthly full eliminating weekly full backups.
The small files take longer to backup & recover and tapes have no mechanism to eliminate common files. Just upgrading tape technology does not completely resolve the problem.
The cost of infrastructure maintenance & tape management which includes new tapes every month. This brings up the additional task of tape storage in local facilities & off-site movement of cloned tapes. Holding the backup infrastructure for file server backups delays all other backup jobs until over-sized to accommodate everyone.
We experienced a Data Center handling backups in a different way. File servers get backed up only starting Friday evening holding the tape library dedicated for it and MS Exchange backups happen only on weekdays. Therefore, file servers are left unprotected for 5 days and MS Exchange is unprotected for the weekend in a 24x7 media environment.
With primary data getting reduced, there is a considerable improvement in backup windows as it now needs to back up only the current & critical data.
Conducting assessments over the last 20 years, we have observed that enterprise file archiving tools archiving old data bring down the production data size to 15-20% and reduce the backup window also accordingly. Archiving tools do help find duplicate files and if found irrelevant you can delete them giving you multiple benefits.
If the archives are stored on-premise, you can back up from the archive storage any time of the day without bothering about the application performance.
In an unfortunate incident of data loss, you need to recover a lesser amount of data to get back to production. The users actually need only the online data. The balance data is old & is rarely accessed and that is the reason it was archived. The back-to-production is much faster when recovering a lesser amount of data instead of recovering the entire data.
The archived data is generally away from the primary data and it can continue to be there as it follows the archival mechanism of recovery (To be recovered only when accessed) and is not required immediately.
If your mailboxes are backed up on tape, you need to recover the entire mailbox or Storage Group to be able to locate a single mail to recover. You need to maintain a parallel mechanism to handle this. Backing up individual emails takes much longer and needs large indexing capabilities and a cumbersome search for the lost email.
Data archival tools can quickly help you search lost emails from the mails archive and retrieve them back quickly without the need to wait for the entire mailbox to get recovered.
In the event of data loss from the on-premise archival infrastructure, it can be recovered from the backups. Since it is not an online data, a longer recovery SLA is acceptable and the recovery times do not impact the production performance.
While copying emails to archive mailboxes manually, there could be an accidental deletion of an email. This could be a critical email required by the user or the management for litigation purposes.
Sometimes, users do not realize the importance of some emails and may end up deleting such critical emails.
Backups run once a day and cannot guarantee capturing all the data because data created and then deleted within the same day or between two backup cycles could never get backed up. This is one of the most critical aspects of the backup approaches which archiving tolls can take better care of especially for mailing applications.
For such sensitive information, online data archival/journalling is recommended to ensure a copy of the email is preserved even before it is delivered to the user's mailbox. Journalling works on the mail server and copies every incoming & outgoing mail and stores it in a separate journal mailbox. Even if the user now deletes the email, there is a separate copy available in the journal that can be easily retrieved.
Online data archival /journalling ensures a copy of all email records moving across the mail server. It ensures that you can actually use your emails from archival in case the primary mailbox is not accessible. If not otherwise mandated, journalling acts as a better backup than the other backup approaches. It runs 24x7 without impacting the user experience and makes a copy while delivering it to the user which eliminates the risk of loss by user deletion.
Archiving tools allow recovering your entire mailbox for a given or all users into pst files so if a user loses his entire mailbox, it can be rebuilt from archives itself in the desired form. Users can also search emails from the mail archive to retrieve back the emails.
By archiving old data, you can understand information risk with comprehensive data mapping, manage the legal hold and conduct investigative searches across billions of documents in seconds. Archiving captures file and message content, attachment information, and all metadata for faster, more accurate search results. Many archiving products leverage advanced e-discovery features, such as searching by full-text keyword and key phrases, word proximity, file size, format, data, sender, recipients, and more.
Data Archival tools allow you to search files or emails from the archive. With a wide variety of search parameters possible, archiving tools help the management to search related content all across the organization without the need to depend on the user’s availability & acceptability to share the information. Especially with journalling enabled, it helps search across the emails which might otherwise have been deleted by the user.
Compliance requirements for most organizations require retaining their data for long periods which could be 7 to 10 years depending on the industry and data types. Retaining huge volumes of data on online storage and production databases has adverse effects as discussed above. Archiving gives an automated way to answer this. Data archival tools these days are compliant and certified by the various global records management standards.
Compliance works in conjunction with security. The data archival tools and absolutely secure in terms of managing them i.e. role-based access to the team managing the archiving infrastructure and complete logging & alert mechanism for all activities being performed. They comply with all major Global Standards of Compliance related to data management including GDPR standards.
From an infrastructure point of view, archiving to a cloud service provider gives you automatic compliance to International Compliance Standards as the Cloud Service Providers ensure that their infrastructure is secure & meets all Global compliance standards.