Servicing the formidable force of US corporate lawyers, American Document Management ("AmDoc"), specialists in the electronic management of business documents, have announced an agreement with Equivio to provide technology for detection of near-duplicate documents.
Now since they said it first, one has to chuckle or perhaps wince a little here. After all, are not most of those US$ 400 per hour cut-and-paste contracts, uh, "near-duplicates"? The answer is more than partially yes. And of course this makes perfect sense.The new service groups similar documents together into sets of near-duplicates, which according to Equivio, typically represent between 20 to 50 percent of email and files in document repositories. The near-dupe groupings are then presented to users, supporting a more coherent, systematic review process -- reducing the time and cost required to wade through a large document collection.
And, don't think AmDoc rests with a definition of "document" as just typed communications. Not so. There's a wider reach, at least when companies deploy their solutions. AmDoc includes video, voice-mail, web pages, and presentations in its definition list of "documents." Working at a binary level, the Equivio technology is largely agnostic when it comes to file types.
Founded in 2004, Equivio's document inspection and matching technology was released in January 2005. As one would expect, being able to detect near duplicates is a much trickier science than discovering exact matches.
Exact matching can rely on CRC or MD5 signatures to quickly assess exact matching. Whereas assessing approximate similarity tends to be much more computationally intensive and generally more algorithmically challenging. This is where the company has set itself aside from the competition.
Near duplicate documents can have a range of differences. Examples include document versions, emails sent to different destinations, or similar proposals sent to several clients. Additionally, Equivio can detect similar documents with different file types. For example a Microsoft Word source version and a PDF delivery copy.
The company claims that near-duplicates are especially common in email, business templates, such as proposals, customer letters, and contracts, and forms, such as purchase or travel requests.
Once near duplicate sets are built, the reviewer can open a single document in the set, called the "pivot document", and then if desired, may view difference reports in the context of the grouping.
Its cut and paste on steroids and one more round of golf the guy who can keep the interns busiest. Sounds like a fit.