One thing that came through loud and clear in the lively debate over my last post was that people are puzzled how to address their shared drive mess without involving massive amounts of human effort to accomplish the task. 

We’ve been hearing for years how so-called auto classification software was going to be the silver bullet answer to our information management woes -- and they’ve been far from it. But rather than focussing on the complexities of auto classification tools and techniques, let's discuss some of the most basic tools and techniques to begin addressing your shared drive mess. In most cases this will get you reductions of 20 to 30 percent, but it can get you as much as 70 to 80 percent, depending on your organization’s information management practices.

Machine Assisted Classification

Before we get into the details of cleaning up shared drive content using software tools, let’s clear the air about the auto in auto classification, which is a misnomer that implies no human effort -- we just have to point the tool at our content and it will figure out what it is while we kick back and have a bagel and coffee with our feet up.

In reality, these tools require varying levels of human involvement to be successful, because while they can accurately and rapidly tell you all sorts of things about your content, acting on this information requires knowledge of the wider context (business, legal, compliance, etc.) of the content.

In this way, auto classification tools are less like a Roomba (i.e., turn them on and they get to work no matter what the space is like that you put them in) and more like a robot that makes car parts (i.e., although they come out of the box with a set of capabilities, they need to be extensively trained in order to apply those capabilities to successfully making a given part).  

With the more narrow (and accurate) meaning of auto in mind, let’s take a look at one approach to hoe you can use them to begin cleaning up your shared drive mess.

The First 100 Pounds

The approach I’m going to present here is like losing the first 100 pounds for someone grossly overweight: some very basic changes (e.g., not drinking soda, walking 20 minutes a day) can have a dramatic result -- much more so than the kinds of activity required to lose the last five pounds, which can refuse to come off even if you’re training for a marathon.

In terms of shared drive content, “losing the first 100 pounds” can typically be accomplished through use of the most basic of classification tools: file analytics software.

File analytics software looks at the wrapper of the content -- the file system metadata -- to provide insight into that content, e.g., what files are PDFs or how many haven’t been viewed in the last five years, etc. Auto classification software typically does this as well as crack into the files to perform a variety of actions on the content itself, from simple full text indexing to more complex analysis like semantic or vector.

For losing the first 100 pounds, however, the added complexity and overhead of auto classification tools are not worth it. It’s far better to use file analytics tools to find content eligible for clean up using the following criteria:

Learning Opportunities

Obvious Junk -- There are 100 or so commonly accepted junk file types that file analytics tools can find. Pick a list and use it to drive the first wave of purging. 

Content Aging -- File analytics tools can also tell you information about content aging, not just how old it is, but when it was last updated and viewed. Depending on your organizational context, you may be able to simply pick a cut off date and purge, e.g., all content older than X years or that hasn’t been accessed in Y years. Or, if you’re heavily litigated or subject to stringent record keeping requirements, you’ll need to apply these kind of purges on a department by department (or even workgroup by workgroup) basis.

Duplication -- Duplicate files are easy to find, but difficult to know what to do with. You can’t simply delete the dupes without risking operational impact when users can’t find their documents (after all, they don’t know they’re duplicates… they're just their documents). So in order to reap the benefits of de-duping, you’ll need to invest in a tool that leaves a stub to the original file when deleting dupes (if your file analytics tool doesn’t offer that functionality).

I’ve seen very high purge rates (roughly 80 percent) at organizations simply by addressing obvious junk files on shared drives. But a more realistic estimate for the average organization is somewhere closer to 20 to 30 percent, which is still a huge improvement.

Content aging and duplication will offer additional purge opportunities, but require more finesse because of retention requirements and some technical challenges. However, even if you applied them very narrowly, it would be hard to imagine that you couldn’t get another 10 percent for each of these categories. Combined with the low end of what you can expect for junk files, you’re at 40 to 50 percent of your shared drives -- which is nothing to sneeze at.

The Final Word

Hopefully this gives you an idea of the kind of things you can do to address your shared drive problem without having to spend thousands of hours sifting through files or hundreds of thousands of dollars on complex auto classification software. For a reasonable amount of time and money, you can lose the first 100 pounds, which is not to say you should stop there. For many organizations, getting serious about the next 100 pounds makes a great deal of sense. And for a few organizations, losing the last five pounds will as well. But no matter what kind of organization you are, losing the first 100 makes sense, and the method I’ve outlined here is one way to do that.

Creative Commons Creative Commons Attribution 2.0 Generic LicenseTitle image by  Jenn Durfey