java - Naive Bayes Text Classifier - determining when a document should be labelled 'unclassified' -
i have designed , implemented naive bayes text classifier (in java). using classify tweets 20 classes. determine probability document belongs class use
foreach(class) { probability = (p(bag of words occurring class) * p(class)) / p(bag of words occurring globally) } what best way determine if bag of words shouldn't belong class? i'm aware sent minimum threshold p(bag of words occurring class) , if classes under threshold class document unclassifed, i'm realising prevents classifier being sensitive.
would option create unclassified class , train document deem unclassifiable?
thanks,
mark
--edit---
i had thought - set maximum threshold p(bag of words occurring globally)*(number of words in document) . mean documents consisted of common words (typically tweets want filter out) eg. "yes agree you". filtered out. - thoughts on appreciated also.
or perhaps should find standard deviation , if low determine should unclassified?
i see 2 different options, seeing problem set of 20 binary classification problems.
- you can compute likelihood of p(doc being in class)/p(doc not being in class). naive bayes implementations use kind of method.
- assuming have evaluation measure, can compute threshold per class , optimise based on cross-validation process. standard way of applying text classification. use thresholds (one per class) based on data. in case scut or scutfbr best option explained in paper.
regards,
Comments
Post a Comment