Using Naive Bayes Theorem to label data
Today I am going to write off-topic about using Naive Bayes algorithm for record classification. This is a supervised category of an algorithm. Which means we train the algorithm with given input records with known labels, make model and then apply the created model on unknown records to correctly classify them in given category.
Since examples are the best to get familiar with any new algorithm. Let's begin with an example dataset. I have used the dataset from this source since I could not come up with ideal dataset which could be apt to explain the algorithm in logical and clear way.
Outlook | Temperature | Humidity | Wind | To play |
---|---|---|---|---|
sunny | hot | high | false | no |
sunny | hot | high | true | no |
overcast | hot | high | false | yes |
rainy | mild | high | false | yes |
rainy | cool | normal | false | yes |
rainy | cool | normal | true | no |
overcast | cool | normal | true | yes |
sunny | mild | high | false | no |
sunny | cool | normal | false | yes |
rainy | mild | normal | false | yes |
sunny | mild | normal | true | yes |
overcast | mild | high | true | yes |
overcast | hot | normal | false | yes |
rainy | mild | high | true | no |
Listed above is the training dataset. Given the parameters
- Outlook
- Temperature
- Humidity
- Wind
every record is classified with two labels. To play outside or not. Now this is a training dataset. So far so good. Suppose you're given following dataset which has unknown label and your task is to use training dataset and Naive Bayes algorithm to compute label for an unknown record.
Outlook | Temperature | Humidity | Wind | To play |
---|---|---|---|---|
sunny | cool | high | true | ? |
To make actual classification, let's see how Naive Bayes algorithm looks like :
P(event1/event2) = (P(event2/event1) * P(event1))/(P(event2))
Let's look at individual parameters first,
P(event1) - Independent probability of event1, also called as a prior probability
P(event2) - Independent probability of event2
P(event1/event2) - Conditional probability that event1 will happen given the event 2. Posterior probability which is the main probability in question given the list of training dataset and conditions
P(event2/event1) - Conditional probability that event2 will happen given the event 1. This is also called as a likelihood that event2 will happen given the event1
Let's begin. For the given record we have two possibilities of outcome. To play = "yes" or To play = "no". To choose which value to take, we will compute following two probabilities and compare them. Whichever is higher, we will go with that verdict.
P(yes/(sunny/cool/high/true)) and P(no/(sunny/cool/high/true))
Since we have 4 parameters to consider in the given dataset,we will need to compute 4 intermediate probabilities for each verdict.
P(yes/(sunny/cool/high/true)) = P(yes) * P(sunny/yes) * P(cool/yes) * P(high/yes) * P(true/yes)
// Similarly
P(no/(sunny/cool/high/true)) = P(no) * P(sunny/no) * P(cool/no) * P(high/no) * P(true/no)
Now how do we calculate these intermediate probabilities? Ride along!
- First we will calculate
P(yes)
andP(no)
. Notice there are total 14 records out of which 5 say "no" and remaining 9 say "yes". Given this information,
P(no) = 5 / 14 = 0.36
P(yes) = 9 / 14 = 0.64
We got two of the above unknown probabilities. Let's move on
- To get the value of
P(sunny/yes)
, count the number of records which have label saying "yes". Out of these records, again count the number of records for which outlook says "sunny".
Number of records with label "yes" - 9
Number of records with label "yes" which say outlook is sunny - 2
Which concludes to,
P(sunny/yes) = 2 / 9 = 0.22
- Similarly, to get the value of
P(cool/yes)
, count the number of records which have label saying "yes". Out of these records, again count the number of records for which temperature says "cool".
Number of records with label "yes" - 9
Number of records with label "yes" which say temperature is cool - 3
Which concludes to,
P(cool/yes) = 3 / 9 = 0.33
- With the similar logic in the past two points, we calculate remaining two intermediate probabilities associated with verdict
P(yes/(sunny/cool/high/true))
// For Humidity
Number of records with label "yes" - 9
Number of records with label "yes" which say humidity is high - 3
Which concludes to,
P(high/yes) = 3 / 9 = 0.33
// For Windy condition
Number of records with label "yes" - 9
Number of records with label "yes" which say wind is true - 3
Which concludes to,
P(true/yes) = 3 / 9 = 0.33
With all the values at hand, let's compute P(yes/(sunny/cool/high/true))
P(yes/(sunny/cool/high/true)) = P(yes) * P(sunny/yes) * P(cool/yes) * P(high/yes) * P(true/yes)
P(yes/(sunny/cool/high/true)) = 0.64 * 0.22 * 0.33 * 0.33 * 0.33
P(yes/(sunny/cool/high/true)) = 0.00505
P(yes/(sunny/cool/high/true)) = 0.00505
Now, let's go to the other side to compute value of P(no/(sunny/cool/high/true))
. Let's start from the beginning - Slowly and cautiously
-
As calculated above we already have value for P(no) which is
0.36
-
To get the value of
P(sunny/no)
, count the number of records which have label saying "no". Out of these records, again count the number of records for which outlook says "sunny".
Number of records with label "no" - 5
Number of records with label "no" which say outlook is sunny - 3
Which concludes to,
P(sunny/no) = 3 / 5 = 0.6
- With the similar logic from past points, we will calculate remaining three intermediate probabilities associated with verdict
P(no/(sunny/cool/high/true))
// For Temperature
Number of records with label "no" - 5
Number of records with label "no" which say temperature is cool - 1
Which concludes to,
P(cool/no) = 1 / 5 = 0.2
// For Humidity
Number of records with label "no" - 5
Number of records with label "no" which say humidity is high - 4
Which concludes to,
P(high/no) = 4 / 5 = 0.8
// For Windy condition
Number of records with label "no" - 5
Number of records with label "no" which say wind is true - 3
Which concludes to,
P(true/no) = 3 / 5 = 0.6
With all the values at hand, let's compute P(no/(sunny/cool/high/true))
P(no/(sunny/cool/high/true)) = P(no) * P(sunny/no) * P(cool/no) * P(high/no) * P(true/no)
P(no/(sunny/cool/high/true)) = 0.36 * 0.6 * 0.2 * 0.8 * 0.6
P(no/(sunny/cool/high/true)) = 0.02
As summarized from previously computed value
P(yes/(sunny/cool/high/true)) = 0.00505
// Thus finally since 0.02 > 0.00505 we can safely conclude that,
P(no/(sunny/cool/high/true)) > P(yes/(sunny/cool/high/true))
Thus given the following test data,
Outlook | Temperature | Humidity | Wind | To play |
---|---|---|---|---|
sunny | cool | high | true | no |
training data, and Naive Bayes algorithm, we can safely say that we are not going to play today.
Let's wait until weather subsides and it's shiny, warm, less windy and humid to venture outside
To make sure you understood the Naive Bayes explanation correctly, here's another test dataset with unknown label. Can you utilize training dataset to correctly identify its label?
Outlook | Temperature | Humidity | Wind | To play |
---|---|---|---|---|
overcast | mild | normal | false | ? |
Give it a try and message me on Twitter if you need any help with the exercise!
Reference:
Naive Bayes Classifiers