Document classification is a key part of any data capture strategy. However, it can also be used in advance of rolling out your entire capture strategy. A few thoughts on the importance of document classification.

Within any type of advanced technology there are several components of the technology that could stand alone for other purposes. Data mining has basic search. Content management has basic tagging. Data capture is no different. While most people consider “data capture” a single thing, a trend is evolving, as the market demands more education and explanation, to start looking at the sub-components of data capture. This trend allows organizations to deploy only those pieces that make the most sense and have a clear path to success. Once success is achieved they then can move to the entirety of data capture.

One component of data capture that has been overlooked and extremely underestimated is document classification.

What Class Is Your Document In?
Before data capture technologies can do the magic of field location and extraction using optical character recognition (OCR) or intelligent character recognition (ICR), they must first decide the page type (sometimes, the type of an entire document). Types might be obvious to the world, or only specific to an organization. Types can be determined by layout (lines, barcodes, graphics), or by context (words, codes, dictionaries). All data capture solutions have this built in as a part of the template matching or document identification process. When companies deployed data capture packages, classification was geared towards feeding the data capture process, not necessarily to stand alone as a function. Interestingly enough, however, many organizations have bought data capture applications just for the purpose of classification. They have done so with a success rate that seems to dwarf the overall data capture process. Let’s look at why.

One major challenge with data capture is the human labor associated with putting documents into groups. With documents automatically classed, this expense and time suck goes away. Because of this, I think using document classification only going to become more popular as companies see that they can first tackle that one major problem. Once successful, a company can then embark on the next, laborious steps towards data capture – but with a better chance of success. This approach also allows a company to better frame the process step-by-step for the technology vendors – tightly nailing down a well-defined problem and then moving outward from there with the technology. Vendors are often inclined to be helpful because they want the license value (for their bottom line) of the company’s entire data capture process.

Politics? What Politics!?
Classification can be a dream or a true nightmare to setup. It all depends on the documents (I’m using the term “document” to mean a record which could be single or multiple pages, but each page somehow relating to all the others.) If you are a little confused, you should be. Understanding your documents is the greatest stumbling block to classification. Sometimes, documents are very clear. Take accounts payable processing as an example. A document could be a purchase order that connects to a received invoice: this is the entire document. Within this document are the types purchase order, and vendor invoice. That was not so bad. Now what happens if you scan in duplex and the invoice on the back has payment instructions or disclaimers? What do you do with this page? That’s still probably not too complicated as you may just decide to omit the page if it does not have pertinent payable data from the document. The point: just a small illustration of the rate at which the definition of a document for an organization gets complicated.

The desired approach would be a study of what your objective types (page level understanding) are. This could be as deep as disclaimers, waivers, and descriptor pages. Once this is done, determine the rules that combine the pages together. In most environments the rules are flexible. For example, an invoice from a vendor can be 1 to 10 pages – the first page will have a header and the last page will have a total, everything in between is a detail page. When you do this you allow the ability to use all the cool tools automated document classification has to offer. Your only problem with this approach is the possibility of never-ending objective page level types.

Why Is Class Important?
What is so cool about classification is there is an even tighter control of the quality of the automatic classification because it’s much easier to toggle what is right or wrong. This allows an organization, once they have a clear understanding of their documents and then an understanding of their complexity relative to automated classification, the ability to determine an actual ROI (or at least get close). Also because it’s just a component of the whole data capture process, classification allows the organization to deploy exceptions faster, and perform initial setup faster with less expertise. Document classification – whether acknowledged or not – is a mandatory step in any data capture process and cannot be avoided. Why not excel at it?

As I mentioned before, the trend of tackling data capture’s pieces rather than as a whole is becoming increasingly popular as the market education on this type of technology increases. Companies are seeking a path to success in document automation. The step-by-step path is much less overwhelming than taking on an entire data capture process. When an organization makes the determination to do this and truly understand their documents, they are taking the accuracy of an automated system into their own hands and really giving technology the best chance to work for them.

Chris Riley is founder of http://www.livinganalytics.com where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools. Chris recently was the feature speaker for our webinar on March 5; Tips and Tricks to Help You Automate your Office Documents (for Effective Data Capture).

