““My life is very monotonous,” the fox said “I hunt chickens; men hunt me. All the chickens are just alike, and all the men are just alike. And, in consequence, I am a little bored.””
– Antoine de Saint-Exupéry
My recent visit to a KPO reminded me of the famous French quote above. There seemed to be tremendous boredom and monotony amongst the educated and well paid workforce there. I could not but feel sorry for them as I could imagine the amount of human effort being spent on manually classifying documents day in and out. Being a software geek who had solved several such problems earlier, I knew there was a better way for those folks to get the job done i.e. by automating the process using Machine Learning.
The Document Classification Automation aims to ease the life of a domain expert by avoiding painstaking, repetitive and time-consuming process.
What do the Classifiers do
Classifiers make ‘predictions’ on the basis of past experience. When a classifier is fed a new document, it makes a prediction that the document belongs to a particular class or category and assigns a “label” for the document.
Source data for Building a Classification process
The Source dataset is a collection of documents which have been classified in the past. The Source dataset must be bifurcated into two parts – Training and Testing datasets
i) Training dataset – This is required for building the classification model. It needs to be large enough to have an adequate number of documents in each class. The Training dataset needs to be of a good quality with a clear demarcation of differences in the documents belonging to the different categories.
ii) Testing dataset – This is used for evaluating the effectiveness of the classification model.
How the Classifier is built
i) Pre-processing of dataset –
Pre-processing the data is necessary since source data may contain unnecessary information like noise and unreliable data. The objective is to structure the data to facilitate the Classification Process. Data pre-processing includes Data Cleansing, Data Normalization,Feature Extraction, Feature Selection.
We need to remember that Data Preparation is a complex subject that can involve a lot of iterations, exploration and analysis. Readying data in the Pre-processing steps is essential to get good results from the Classifier. Pre-processing steps play vital role in improving accuracy of a classifier.
ii) Classification Algorithm –
Documents are classified by comparing the number of matching terms in the document vectors to see which class it most closely resembled. Classifier makes document into one of the category type and assigning label to a document within a given category type.
As per my experience, the classification algorithms such as Support Vector Machines, Naive Bayes and Rocchio are best suited for Document Classification.
The Accuracy Measure
Once the Classification model is build, it needs to be evaluated by feding testing dataset. If the accuracy of current Classification model is not as per expectation, then you must take few steps to improve it. I took following measures to improve accuracy :
a) Revisit pre-processing of dataset and filter out unwanted data
b) Improve quality of Training corpus
c) Try other Classification Algorithms or try Ensemble approach
Document Classification is a supervised method that involves the creation of a model based on a pre-processed set of data. The classifier is trained on this Training dataset and is then used to predict the category of any given document. The quality of the training dataset affects the quality of prediction. So, keeping the variation in each category of document is essential to keep up the quality of training dataset.
If you wish to have a look at our Document Classification Product, please be free to contact us at firstname.lastname@example.org .
I hope this blog was helpful. My next blog will deal with the real life Use Cases of Document Classification that we implemented. Stay tuned and Happy Machine Learning!