Software (include version): All Versions
Use Case: Running Samples thru Classifier Tool for learning invoice and attachment splitting
Prerequisites: Classifier Tool
Creating a Top Notch Classification Model
When Creating a Classification model, in order to get the best results the following best practices should be applied.
1. Number of Sample Documents – The first question most customers ask is how many samples are needed to train the classification model. The most comprehensive answer is that the number of samples required for each document type needs to reflect the variation of documents received in production
· Documents with little variation such as structured forms can often be trained with 10-15 samples
· Documents with lots of variation such as Invoices and Purchase orders where the physical layout and content varies based on supplier or customer require a much larger sample set. In fact, the ideal scenario, in this case, is to gather a month worth of documents so that it ensures the entire population is represented.
2. Properly Separating Documents – When using documents for training samples. Make sure each pdf or multi-page tiff file is its own document. It should not contain multiple documents.
3. Make sure each document starts with the appropriate first page. The classification engine focuses on what makes the first page of a document to determine document breaks.
4. Make sure not to include blank pages within the pdf or tiff documents.
5. Make sure all pages are rotated correctly, the classification engine does not auto-orient the pages like ancoraDocs does in run time.
6. The training set of documents must be either pdf or tiff but cannot contain both.
7. Above all: Do not train the same document under two different document types.