java - Naive Bayes Text Classifier - determining when a document should be labelled 'unclassified' -


i have designed , implemented naive bayes text classifier (in java). using classify tweets 20 classes. determine probability document belongs class use

foreach(class) {    probability = (p(bag of words occurring class) * p(class)) / p(bag of words occurring globally) } 

what best way determine if bag of words shouldn't belong class? i'm aware sent minimum threshold p(bag of words occurring class) , if classes under threshold class document unclassifed, i'm realising prevents classifier being sensitive.

would option create unclassified class , train document deem unclassifiable?

thanks,

mark

--edit---

i had thought - set maximum threshold p(bag of words occurring globally)*(number of words in document) . mean documents consisted of common words (typically tweets want filter out) eg. "yes agree you". filtered out. - thoughts on appreciated also.

or perhaps should find standard deviation , if low determine should unclassified?

i see 2 different options, seeing problem set of 20 binary classification problems.

  1. you can compute likelihood of p(doc being in class)/p(doc not being in class). naive bayes implementations use kind of method.
  2. assuming have evaluation measure, can compute threshold per class , optimise based on cross-validation process. standard way of applying text classification. use thresholds (one per class) based on data. in case scut or scutfbr best option explained in paper.

regards,


Comments

Popular posts from this blog

c# - Operator '==' incompatible with operand types 'Guid' and 'Guid' using DynamicExpression.ParseLambda<T, bool> -