Hello again faithful GeeryDev readers. If you frequent GeeryDev, then it's likely that you also frequent Throwing It Back Weekly (TIBW for the remainder of this post). Some might even say that there is a little, friendly community rivalry here. But I will leave that for another time.
Today, I would like to set the record straight, who said what on these blogs. I have been working on a simple Naive Bayes Text Classifier, to understand whether a text query is more likely to be of GeeryDev or TIBW origin. Before I get into the specifics, or if you'd rather ignore the specifics, have some fun with it. You may come to make some realizations that we will explain later. What good is this classification? Why when searching random words does it seem to just default to GeeryDev? This isn't good enough, how can we make it better? Don't worry, I'll give you my thoughts here at the end of this post.
Although I strongly disagree, some might consider this a useless implementation of Naive Bayes, and a more useful classification problem for it to be applied would be email spam/not spam filtering. This is considered to be one of the great examples of where Naive Bayes Text Classifiers have worked very well.
I am glad you asked. As, we discussed earlier, Bayesian inference uses a prior probability as a starting point for calculating a poster probability. Considering the GeeryDev and TIBW blog histories, GeeryDev has written just slightly more than TIBW. This means that given any ext query, GeeryDev is more likely to have written the text. The likelihood function will of course have no problem overcoming this history for cases where TIBW is more likely, but there is a hill to climb.
You mean, how can we make this worse!?? Of course, the biggest opportunity to improve this classifier is simply getting more data. GeeryDev and TIBW would need to write considerably more to get this thing to be impressive. I think a multinomial (as opposed to Bernoulli), where word counts are more important, approach may work a little better. And, my wildest dreams would include a world with a word association system such as word2vec to get more comprehension with such little vocabulary. This would further violate our Naive assumption, but I would be excited to see the results. Got any ideas? Help me out. You can see the source here and your knowledge is always greatly appreciated.
Comments