Bigger is better. It’s a lesson that’s drilled into us from childhood -- whoever ends up with the most toys wins. Everywhere you look there are people, groups or even entire countries trying to be the biggest one thing or another. Biggest building, biggest car, biggest ball of twine -- they're all out there. You ate the biggest steak on the menu? You're awesome -- here’s your big tee shirt.
Bigger is Just Bigger
The modern data center is not immune to the Bigger is Better Syndrome (BIBS). Big servers, big networks and, most of all, big data. Sorry, that should be Big Data with a capital “B” and “D” because Big Data is its own Big Thing and clearly of Big Importance. Everything is about Big Data these days -- how to store it, how to make it accessible, how to manage it, how to secure it and make more of it. And don't forget how to get it back in case you lose it.
Companies talk about the petabytes they've accumulated as if it were at badge of honor -- we have more data than them, so clearly we must be doing something better. A true symptom of BIBS is when a firm rolls out some poor IT Director to explain the yearlong process he or she went through to test the latest storage technology and assemble an infrastructure capable of supporting the latest generation of applications that spew data at rates that were unimaginable only a couple of years ago.
Finding the Needle in the Haystack
The problem with all of this is that while the systems can handle the load from a basic storage perspective, finding data within the pile is becoming more difficult every month. Basic business practices such as e-Discovery and compliance, which were hard to begin with, are now becoming next to impossible because of the breadth and depth of data that must be analyzed. The haystack has gotten bigger and the needles that much harder to find.
Let’s not confuse Big Data analytics with just plain having a lot of storage. Breakthroughs on the analytics side have been some of the more interesting developments during the last couple of years. It’s encouraging to see startups like DataGravity taking a fresh look at storage and understanding that there is indeed a difference between data and information and that the latter often gets lost within the former. Information is stored because it has value.
That value is lost when the information in question cannot be found or can only be dug up after spending lots of hours and lots of dollars looking for it. A common complaint in the not-so-distant past was “there’s not enough data to make a decision.” That has quickly changed to “there’s too much data to make a decision.” Analytics has huge potential value. Sitting on huge amounts of data for lack of another option is a huge potential waste.
Understanding the profile of unstructured data within an enterprise is one of those tasks that should be easy -- there are lots of reporting tools and management systems thrown at the pile to help increase transparency, plus a seemingly endless stream of metrics feeding into more dashboards than the space shuttle.
All that said, why can’t a data center manager easily look at a 100TB user share and see, with two mouse clicks, all the data that belongs to former employees? Or the director who implemented a no-PST policy two years ago but still has 28,000 of them floating around her shop?
Basic, metadata level information on files is for some reason not easily attainable. The majority of current SRM tools are focused on block-level reporting, de-duping and trend reports -- how much data do I have now, how much did I have last month and, based on that info, how much more storage will I have to provision for next month.
These are all valid measurements, but turning the model around and looking at the details of what is being stored may provide better value than how much is being stored and where it's coming from.
Using Context to Clean House
Stemming the flow of new data generated within an organization is not a fight that most IT managers want to take on. A better approach may be to analyze the files that are there now to get details on whose data it is, how old it is and how many copies of it are currently being stored.
Going to a group of users with a specific list showing that 80% of their files are more than five years old and can be either archived or disposed of has much more impact -- the conversation has moved from the theoretical (“I think you have a bunch of stuff that we don’t need to keep”) to the contextual (“You have 250GB of PST files and you haven’t accessed 50% of the data in more than three years.”).
What should be easily-obtained metrics like this can allow storage administrators to create, execute and enforce a data disposition strategy that will both keep them compliant with any relative storage policies as well as potentially freeing up large amounts of wasted disk space. Rather than budgeting yet more dollars for physical storage, most organizations will find that they already have a year’s worth of storage in their shop right now -- it’s just being used to house files that have lost their value.
This somewhat revolutionary thinking seems to be catching on. There is a noticeable trend of IT Managers who realize that simply fulfilling storage capacity requests by adding spindles is not a strategy that works. The explosion of digitized content in the form of unstructured data combined with increased requirements around regulatory and legal issues has created a nearly impenetrable cache where nothing can be found easily and a onetime valuable asset is quickly becoming a liability. And, contrary to popular opinion, storage is not cheap.
Creating a baseline data profile of their unstructured data should be a top priority for any IT organization. A simple, metadata level index of user shares can allow administrators to make more educated decisions around storage disposition as well as significantly reducing risk, non-compliance and above all else, storage-related budgets.
Editor's Note: Want more from David? Read his The Perfect Storm - Dealing with an Email Blizzard in the Face of Risk