Ever heard of BeyondRecognition? If not, the time to learn is now. The Chantilly, Va.-based "document textnology" software provider offers document managers an alternative to optical character recognition (OCR), while delivering results with accuracy and speed.
How BeyondRecognition Works
BeyondRecognition (“BR”) may be a young innovation, but it is a viable alternative to OCR. It utilizes glyphs, a letter or character formed by pixels that are of a sufficiently different color from the background of the document as to be identifiable. BR groups like glyphs into clusters at the character and word level. BR converts one glyph per cluster to text as appropriate.
While OCR continuously decides what each glyph is, BeyondRecognition’s single instance technology need only recognize one glyph per cluster to form a catalog of letters or characters. The advantage: the return on investment of using single instance recognition technology is much higher with a smaller data set -- a faster processing speed and better accuracy rate -- which shortens the Records Management program’s work breakdown structure significantly.
Because BeyondRecognition software is glyph dependent -- not text -- it is more versatile:
- BR is language agnostic. It currently recognizes over forty languages.
- BR is symbology agnostic. It can recognize and relate non-text elements.
- BR clusters visual similarities. It works on all kinds of documents.
- BR is over ninety-nine percent accurate.
- BR scales. It can analyze millions of pages per day per the BeyondRecognition server.
BeyondRecognition’s zonal attribute extraction permits subject matter experts to extract attributes from document classifications by clicking and dragging zones on one document per document type cluster.
As a rough rule of thumb, the number of document clusters is about one percent of the number of documents. Examining one percent of the documents provides an informed way to make retention decisions about a population -- even documents with no searchable text. Reviewers can tag a cluster to be retained or disposed of, and the ones to be retained can have a document type name label applied to them using a document classification tree that the user controls.
How Image Engine Puts BeyondRecognition to Work
Enter Image Engine. Its engineers utilize custom tools for business users who respond to requests for document production, reductions in content collections, and scanning large collections of hard copy. BR sorts the documents into their current clusters, then the business users assign document types and use zonal attribute extraction to extract metadata (or rubber band the classification for metadata entry) and arrange the content into exportable document-type taxonomies.
Think of the possibilities for deduplication. Your organization may have petabytes of information on SharePoint. BeyondRecognition can deduplicate content on two levels: the bit level as well as the visual duplicate level. For example, content may be reiterated from a word file to a PDF. Visual similarity classification can locate the pattern and can red list or green list the content based on the decisions made by the client for that classification or cluster. BR records the decisions made by reviewers as to record versus non-record, document type label assigned and zonal attribute extraction indicators, and applies these decisions to other members of the same visual document classification.
Think big: what if the typical bit-level deduplication is forty percent? What if the visual deduplication is fifty to sixty percent? Holistic deduplication -- in other words, across storage platforms and scalable to millions of pages -- could be accomplished in weeks.
Steve Elston, president of Image Engine likes to say, “BeyondRecognition puts pressure on data -- the data is not allowed to put pressure on the business.” That’s a good quote any records and information manager may borrow.