Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Meet Naive Bayes — the OG of machine learning. He’s old, wise, and a bit “naive”… but surprisingly effective.

“I assume all your features are independent,” he says confidently, ignoring the obvious chemistry between ‘age’ and ‘income’. 💸😂


🎯 Why We Need It

You want to predict whether:

  • A customer will click on an ad 🖱️

  • An email is spam 💌

  • A review is positive or negative ⭐

…and you need a fast, interpretable algorithm that can handle text, counts, or categorical data. Enter our Bayesian magician 🧙.


🤔 The “Naive” Assumption

Naive Bayes assumes that features are independent given the class. Formally:

[ P(y|x_1, x_2, ..., x_n) \propto P(y) \prod_{i=1}^{n} P(x_i | y) ]

In plain English:

“To predict if an email is spam, I’ll pretend that the number of emojis and the presence of ‘FREE’ are unrelated — even though they clearly are.” 🤷‍♂️

Surprisingly, this “naive” move often works really well in practice.


📦 Types of Naive Bayes

TypeUse CaseExample
Gaussian NBContinuous dataCustomer spend, product ratings
Multinomial NBCount-based dataWord frequencies in text
Bernoulli NBBinary featuresPresence/absence of words

🧠 The Bayes Formula Refresher

At its heart lies Bayes’ Theorem, the king of conditional probabilities:

[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} ]

We use it to flip probabilities around — turning what we can easily measure into what we actually want to know.

Example:

“What’s the probability of churn given high complaints?” instead of “What’s the probability of high complaints given churn?”


💼 Business Example: Spam or Not Spam

Let’s classify marketing emails as spam or not spam using Multinomial Naive Bayes.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

emails = [
    "Win a FREE iPhone today",
    "Meeting scheduled at 3pm",
    "Earn money working from home",
    "Project update attached"
]

labels = [1, 0, 1, 0]  # 1 = spam, 0 = not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

model = MultinomialNB()
model.fit(X, labels)

new_email = ["Congratulations! You won a free trip"]
prediction = model.predict(vectorizer.transform(new_email))

print("Predicted class:", "Spam" if prediction[0] else "Not Spam")

Output:

Predicted class: Spam 😅

Naive Bayes saw the word “free” and couldn’t resist — classic move.


📊 Probabilistic Thinking in Action

Each prediction is based on conditional probabilities:

  • P(spam | word='free')

  • P(spam | word='win')

  • P(spam | word='project')

Even if it assumes independence, combining them often gives surprisingly accurate results.


🔥 Pros & Cons

😎 Pros😬 Cons
Super fast (train in milliseconds)Assumes independence (often false)
Works great for text & categorical dataStruggles with correlated features
Interpretable probabilitiesLess flexible for continuous data

🧩 Practice Exercise

Try this:

  1. Create a small dataset of customer reviews (positive/negative).

  2. Use CountVectorizer and MultinomialNB to classify them.

  3. Check model.predict_proba() — see how confident your model is.

  4. Add sarcasm in a review (“Oh great, another broken product 🙃”) and see if your model gets confused 😆.


🎓 TL;DR

ConceptMeaning
Naive BayesProbabilistic model using Bayes’ theorem
NaiveAssumes feature independence
OutputClass probabilities
Best forText classification, spam detection, sentiment analysis

💬 “Naive Bayes is like that overconfident intern — makes bold assumptions but still nails most predictions.” 🧑‍💻💪


🔗 Next Up: Calibration & Class Imbalance – where we learn how to handle the real world, where your model loves the majority class a little too much. ⚖️

Turning Text into Numbers for Machine Learning with Naive Bayes

Turning Text into Usable Data

Each piece of data, called x x , is a string of text (like an email or message) that can be any length. To use this text with machine learning, we need to convert it into a list of numbers, called a d d -dimensional feature vector ϕ(x) \phi(x) , where d d is the number of features (like characteristics we measure).

Ways to Convert Text to Numbers

  1. Custom Features:

    • Look at the text and create features based on what you know about it.

    • Examples:

      • Does the text have the word “church”? (xj=1 x_j = 1 if yes, 0 0 if no)

      • Is the email sent from outside the U.S.? (xj=1 x_j = 1 if yes, 0 0 if no)

      • Is the sender a university? (xj=1 x_j = 1 if yes, 0 0 if no)

  2. Word Presence Features:

    • Make a list of words (called a vocabulary) and check if each word is in the text.

    • For words like “Aardvark”, “Apple”, ..., “Zebra”:

      • Set xj=1 x_j = 1 if the word is in the text, 0 0 if it’s not.

  3. Deep Learning Methods:

    • Advanced machine learning techniques can work directly with raw text, using tools like neural networks that understand sequences of letters or words.

Bag of Words Model

A common way to represent text is the bag of words model, which treats text like a bag of words, ignoring their order.

  • Vocabulary:

    • Create a list of words to track, called the vocabulary V V :

      V={church,doctor,fervently,purple,slow,}V = \{\text{church}, \text{doctor}, \text{fervently}, \text{purple}, \text{slow}, \dots\}
  • Feature Vector:

    • Turn a text x x into a list of 1s and 0s, with one number for each word in V V . This list is ϕ(x) \phi(x) , and its length is the number of words in V V , written V |V| :

      ϕ(x)=(0101)churchdoctorferventlypurple\phi(x) = \begin{pmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 1 \\ \vdots \end{pmatrix} \begin{array}{l} \text{church} \\ \text{doctor} \\ \text{fervently} \\ \vdots \\ \text{purple} \\ \vdots \end{array}
    • Each number ϕ(x)j \phi(x)_j is 1 1 if the j j -th word in V V is in the text, or 0 0 if it’s not. For example, if “doctor” is in the text, the second number is 1 1 .

Building a Model to Classify Text

For binary classification (e.g., spam vs. not spam), we create two models using a dataset with labeled examples (texts marked as spam or not):

Pθ(xy=0)andPθ(xy=1)\begin{align*} P_\theta(x|y=0) && \text{and} && P_\theta(x|y=1) \end{align*}
  • Pθ(xy=0) P_\theta(x|y=0) tells us how likely a text x x is to be not spam (y=0 y=0 ).

  • Pθ(xy=1) P_\theta(x|y=1) tells us how likely it is to be spam (y=1 y=1 ).

  • The θ \theta means these probabilities depend on parameters we’ll learn, and we use the bag-of-words representation for x x .

What is a Categorical Distribution?

A Categorical distribution is like rolling a die with K K sides, where each side has a probability:

Pθ(x=j)=θjP_\theta(x = j) = \theta_j
  • x x can be one of K K outcomes (like 1, 2, ..., K K ).

  • θj \theta_j is the probability of outcome j j .

  • When K=2 K=2 (e.g., yes/no), it’s called a Bernoulli distribution, like flipping a coin.

Naive Bayes Simplification

Text data can have many words (e.g., 10,000), making it hard to calculate probabilities for every possible combination of words. The Naive Bayes assumption simplifies this by assuming each word’s presence is independent of the others:

Pθ(x=xy)=j=1dPθ(xj=xjy)P_\theta(x = x' | y) = \prod_{j=1}^d P_\theta(x_j = x_j' | y)
  • x x is the text’s feature vector, and x x' is a specific vector (e.g., [0,1,0,] [0, 1, 0, \dots] ).

  • Pθ(x=xy) P_\theta(x = x' | y) is the probability that a text has the exact word pattern x x' given its class y y .

  • The product \prod means we multiply the probabilities for each word’s presence or absence.

For example, if a text has:

x=(0100)churchdoctorferventlypurplex = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right) \begin{array}{l} \text{church} \\ \text{doctor} \\ \text{fervently} \\ \vdots \\ \text{purple} \end{array}

The probability it belongs to class y y (e.g., spam) is:

Pθ(x=(0100)|y)=Pθ(x1=0y)Pθ(x2=1y)Pθ(xd=0y)P_\theta \left( x = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right) \middle| y \right) = P_\theta(x_1=0|y) \cdot P_\theta(x_2=1|y) \cdot \dots \cdot P_\theta(x_d=0|y)
  • Pθ(x1=0y) P_\theta(x_1=0|y) is the chance “church” is absent given class y y .

  • Pθ(x2=1y) P_\theta(x_2=1|y) is the chance “doctor” is present, and so on.

Parameters in Naive Bayes
  • Each word’s probability Pθ(xjy=k) P_\theta(x_j | y=k) is a Bernoulli distribution with a parameter ψjk \psi_{jk} :

    Pθ(xj=1y=k)=ψjk,Pθ(xj=0y=k)=1ψjkP_\theta(x_j = 1 | y=k) = \psi_{jk}, \quad P_\theta(x_j = 0 | y=k) = 1 - \psi_{jk}
    • ψjk \psi_{jk} is the probability that word j j appears in class k k .

  • If we have K K classes (e.g., spam and not spam, so K=2 K=2 ) and d d words, we need Kd Kd parameters, which is much simpler than calculating every possible text combination.

Naive Bayes for Bag of Words
  • The chance a text x x has a specific word pattern x x' in class k k :

    Pθ(x=xy=k)=j=1dPθ(xj=xjy=k)P_\theta(x = x' | y=k) = \prod_{j=1}^d P_\theta(x_j = x_j' | y=k)
  • Each word’s presence is a Bernoulli:

    Pθ(xj=1y=k)=ψjk,Pθ(xj=0y=k)=1ψjkP_\theta(x_j = 1 | y=k) = \psi_{jk}, \quad P_\theta(x_j = 0 | y=k) = 1 - \psi_{jk}
  • We need Kd Kd parameters, where ψjk \psi_{jk} is the chance word j j is in a text of class k k .

Does Naive Bayes Work Well?
  • Problem: It assumes words don’t affect each other, but in real life, words like “bank” and “account” often appear together in spam.

  • Effect: This can make the model’s probabilities too extreme (over- or under-confident).

  • Benefit: Even with this flaw, Naive Bayes often classifies texts accurately, making it a practical choice.

Setting Up Class Probabilities

We need a starting guess for how likely each class is, called the prior Pθ(y=k) P_\theta(y=k) :

  • Use a Categorical distribution with parameters ϕ=(ϕ1,,ϕK) \vec{\phi} = (\phi_1, \dots, \phi_K) :

    Pθ(y=k)=ϕkP_\theta(y=k) = \phi_k
  • We can learn ϕk \phi_k from the data, like counting how many texts are spam vs. not spam.

Bernoulli Naive Bayes Model

The Bernoulli Naive Bayes model works with binary data x{0,1}d x \in \{0,1\}^d (e.g., bag-of-words where each word is present or absent):

  • Parameters: θ=(ϕ1,,ϕK,ψ11,,ψdK) \theta = (\phi_1, \dots, \phi_K, \psi_{11}, \dots, \psi_{dK}) , totaling K(d+1) K(d+1) parameters.

  • Class Prior:

    Pθ(y)=Categorical(ϕ1,ϕ2,,ϕK)P_\theta(y) = \text{Categorical}(\phi_1, \phi_2, \dots, \phi_K)
  • Word Probabilities:

    Pθ(xj=1y=k)=Bernoulli(ψjk)P_\theta(x_j = 1 | y=k) = \text{Bernoulli}(\psi_{jk})

    Pθ(xy=k)=j=1dPθ(xjy=k)P_\theta(x | y=k) = \prod_{j=1}^d P_\theta(x_j | y=k)

Learning the Model’s Parameters

Learning Class Probabilities ϕk \phi_k

  • Suppose we have n n texts, and nk n_k of them belong to class k k (e.g., nk n_k spam emails).

  • The best estimate for ϕk \phi_k is:

    ϕk=nkn\phi_k = \frac{n_k}{n}
  • This is just the fraction of texts in class k k . For example, if 30 out of 100 emails are spam, ϕspam=0.3 \phi_{\text{spam}} = 0.3 .

Learning Word Probabilities ψjk \psi_{jk}

  • For each class k k and word j j , ψjk \psi_{jk} is the chance that word j j appears in texts of class k k :

    ψjk=njknk\psi_{jk} = \frac{n_{jk}}{n_k}
  • njk n_{jk} is the number of texts in class k k that contain word j j .

  • Example: If 20 out of 50 spam emails contain “doctor”, then ψdoctor,spam=2050=0.4 \psi_{\text{doctor,spam}} = \frac{20}{50} = 0.4 .

Making Predictions

  • To classify a new text, use Bayes’ rule to find the most likely class:

    argmaxyPθ(yx)=argmaxyPθ(xy)Pθ(y)\arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P_\theta(y)
  • Calculate Pθ(xy=k)Pθ(y=k) P_\theta(x|y=k)P_\theta(y=k) for each class k k (e.g., spam and not spam), and pick the class with the highest value.

  • Think of it like scoring how “spam-like” or “not-spam-like” the text is, then choosing the best match.

Example Scenario

Suppose we have a small dataset of emails, and we want to classify them into three categories: Spam (y=1 y = 1 ), Work (y=2 y = 2 ), or Personal (y=3 y = 3 ). Each email is represented as a bag-of-words feature vector based on a small vocabulary of four words: “deal,” “meeting,” “friend,” and “family” (d=4 d = 4 ). We’ll use the Bernoulli Naive Bayes model to:

  1. Represent the emails as feature vectors.

  2. Estimate the model parameters (ϕk \phi_k and ψjk \psi_{jk} ).

  3. Predict the class of a new email.

Example: Classifying Emails with Naive Bayes (Three Categories)

This example applies the Naive Bayes model from the provided text to classify emails into three categories: Spam, Work, or Personal, using a small vocabulary. We’ll use the same math and concepts, showing how they work with K=3 K = 3 classes.

Step 1: Representing Emails as Feature Vectors

Each email x x is a sequence of words. We convert it into a d d -dimensional feature vector ϕ(x) \phi(x) using the bag of words model.

Vocabulary

Define a vocabulary V V with d=4 d = 4 words:

V={deal,meeting,friend,family}V = \{\text{deal}, \text{meeting}, \text{friend}, \text{family}\}

Feature Vector

For an email x x , we create a binary vector ϕ(x){0,1}4 \phi(x) \in \{0,1\}^4 :

ϕ(x)=(x1x2x3x4)dealmeetingfriendfamily\phi(x) = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{pmatrix} \begin{array}{l} \text{deal} \\ \text{meeting} \\ \text{friend} \\ \text{family} \end{array}
  • xj=1 x_j = 1 if word j j (e.g., “deal”) is in the email, else xj=0 x_j = 0 .

  • Example: If an email contains “meeting” and “friend” but not “deal” or “family,” its feature vector is:

    ϕ(x)=(0110)\phi(x) = \begin{pmatrix} 0 \\ 1 \\ 1 \\ 0 \end{pmatrix}

Step 2: Dataset

Suppose we have a small training dataset with n=10 n = 10 emails, labeled as Spam (y=1 y = 1 ), Work (y=2 y = 2 ), or Personal (y=3 y = 3 ):

Email IDWords PresentClass (y y )
1deal, friendSpam (1)
2dealSpam (1)
3meeting, friendWork (2)
4meetingWork (2)
5meeting, familyWork (2)
6friend, familyPersonal (3)
7familyPersonal (3)
8friendPersonal (3)
9deal, meetingSpam (1)
10friend, familyPersonal (3)

Feature Vectors

Each email is converted to a 4D binary vector:

  • Email 1: x(1)=[1,0,1,0] x^{(1)} = [1, 0, 1, 0] (deal, friend)

  • Email 2: x(2)=[1,0,0,0] x^{(2)} = [1, 0, 0, 0] (deal)

  • Email 3: x(3)=[0,1,1,0] x^{(3)} = [0, 1, 1, 0] (meeting, friend)

  • ...

  • Email 10: x(10)=[0,0,1,1] x^{(10)} = [0, 0, 1, 1] (friend, family)

Step 3: Bernoulli Naive Bayes Model

We use the Bernoulli Naive Bayes model for binary data x{0,1}4 x \in \{0,1\}^4 , with three classes (K=3 K = 3 ).

Model Components

  • Parameters: θ=(ϕ1,ϕ2,ϕ3,ψ11,ψ21,,ψ43) \theta = (\phi_1, \phi_2, \phi_3, \psi_{11}, \psi_{21}, \dots, \psi_{43}) , with K(d+1)=3(4+1)=15 K(d+1) = 3(4+1) = 15 parameters.

  • Class Prior:

    Pθ(y)=Categorical(ϕ1,ϕ2,ϕ3)P_\theta(y) = \text{Categorical}(\phi_1, \phi_2, \phi_3)
    • ϕk \phi_k is the probability of class k k .

  • Feature Likelihood:

    Pθ(xj=1y=k)=Bernoulli(ψjk)P_\theta(x_j = 1 | y=k) = \text{Bernoulli}(\psi_{jk})

    Pθ(xy=k)=j=14Pθ(xjy=k)P_\theta(x | y=k) = \prod_{j=1}^4 P_\theta(x_j | y=k)
    • ψjk \psi_{jk} is the probability that word j j is present in class k k .

    • Pθ(xj=0y=k)=1ψjk P_\theta(x_j = 0 | y=k) = 1 - \psi_{jk} .

Naive Bayes Assumption

We assume words are independent given the class:

Pθ(x=xy=k)=j=14Pθ(xj=xjy=k)P_\theta(x = x' | y=k) = \prod_{j=1}^4 P_\theta(x_j = x_j' | y=k)

For a vector x=[0,1,1,0] x' = [0, 1, 1, 0] :

Pθ(x=(0110)|y=k)=Pθ(x1=0y=k)Pθ(x2=1y=k)Pθ(x3=1y=k)Pθ(x4=0y=k)P_\theta \left( x = \begin{pmatrix} 0 \\ 1 \\ 1 \\ 0 \end{pmatrix} \middle| y=k \right) = P_\theta(x_1=0|y=k) \cdot P_\theta(x_2=1|y=k) \cdot P_\theta(x_3=1|y=k) \cdot P_\theta(x_4=0|y=k)

Step 4: Learning Parameters

Learning Class Priors ϕk \phi_k

Count the number of emails in each class:

  • Spam (y=1 y = 1 ): n1=3 n_1 = 3 (Emails 1, 2, 9)

  • Work (y=2 y = 2 ): n2=3 n_2 = 3 (Emails 3, 4, 5)

  • Personal (y=3 y = 3 ): n3=4 n_3 = 4 (Emails 6, 7, 8, 10)

  • Total emails: n=10 n = 10

The prior probabilities are:

ϕk=nkn\phi_k = \frac{n_k}{n}
  • ϕ1=310=0.3 \phi_1 = \frac{3}{10} = 0.3 (Spam)

  • ϕ2=310=0.3 \phi_2 = \frac{3}{10} = 0.3 (Work)

  • ϕ3=410=0.4 \phi_3 = \frac{4}{10} = 0.4 (Personal)

Learning Feature Parameters ψjk \psi_{jk}

For each class k k and word j j , compute:

ψjk=njknk\psi_{jk} = \frac{n_{jk}}{n_k}
  • njk n_{jk} is the number of emails in class k k where word j j is present.

  • nk n_k is the number of emails in class k k .

Spam (k=1 k = 1 , n1=3 n_1 = 3 )
  • Word 1 (deal): Present in Emails 1, 2, 9 (n11=3 n_{11} = 3 )

    ψ11=33=1.0\psi_{11} = \frac{3}{3} = 1.0
  • Word 2 (meeting): Present in Email 9 (n21=1 n_{21} = 1 )

    ψ21=130.333\psi_{21} = \frac{1}{3} \approx 0.333
  • Word 3 (friend): Present in Email 1 (n31=1 n_{31} = 1 )

    ψ31=130.333\psi_{31} = \frac{1}{3} \approx 0.333
  • Word 4 (family): Absent (n41=0 n_{41} = 0 )

    ψ41=03=0.0\psi_{41} = \frac{0}{3} = 0.0
Work (k=2 k = 2 , n2=3 n_2 = 3 )
  • Word 1 (deal): Absent (n12=0 n_{12} = 0 )

    ψ12=03=0.0\psi_{12} = \frac{0}{3} = 0.0
  • Word 2 (meeting): Present in Emails 3, 4, 5 (n22=3 n_{22} = 3 )

    ψ22=33=1.0\psi_{22} = \frac{3}{3} = 1.0
  • Word 3 (friend): Present in Email 3 (n32=1 n_{32} = 1 )

    ψ32=130.333\psi_{32} = \frac{1}{3} \approx 0.333
  • Word 4 (family): Present in Email 5 (n42=1 n_{42} = 1 )

    ψ42=130.333\psi_{42} = \frac{1}{3} \approx 0.333
Personal (k=3 k = 3 , n3=4 n_3 = 4 )
  • Word 1 (deal): Absent (n13=0 n_{13} = 0 )

    ψ13=04=0.0\psi_{13} = \frac{0}{4} = 0.0
  • Word 2 (meeting): Absent (n23=0 n_{23} = 0 )

    ψ23=04=0.0\psi_{23} = \frac{0}{4} = 0.0
  • Word 3 (friend): Present in Emails 6, 8, 10 (n33=3 n_{33} = 3 )

    ψ33=34=0.75\psi_{33} = \frac{3}{4} = 0.75
  • Word 4 (family): Present in Emails 6, 7, 10 (n43=3 n_{43} = 3 )

    ψ43=34=0.75\psi_{43} = \frac{3}{4} = 0.75

Parameter Summary

  • Priors: ϕ1=0.3 \phi_1 = 0.3 , ϕ2=0.3 \phi_2 = 0.3 , ϕ3=0.4 \phi_3 = 0.4

  • Feature probabilities:

    Word j j Spam (ψj1 \psi_{j1} )Work (ψj2 \psi_{j2} )Personal (ψj3 \psi_{j3} )
    deal1.00.00.0
    meeting0.3331.00.0
    friend0.3330.3330.75
    family0.00.3330.75

Step 5: Predicting a New Email

Suppose a new email contains “friend” and “family” (x=[0,0,1,1] x' = [0, 0, 1, 1] ). We predict its class using Bayes’ rule:

argmaxyPθ(yx)=argmaxyPθ(xy)Pθ(y)\arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P_\theta(y)

Compute Pθ(x=xy=k)Pθ(y=k) P_\theta(x = x' | y=k)P_\theta(y=k) for each class k k .

Spam (k=1 k = 1 )

Pθ(x=[0,0,1,1]y=1)=Pθ(x1=0y=1)Pθ(x2=0y=1)Pθ(x3=1y=1)Pθ(x4=1y=1)P_\theta(x = [0, 0, 1, 1] | y=1) = P_\theta(x_1=0|y=1) \cdot P_\theta(x_2=0|y=1) \cdot P_\theta(x_3=1|y=1) \cdot P_\theta(x_4=1|y=1)
  • Pθ(x1=0y=1)=1ψ11=11.0=0.0 P_\theta(x_1=0|y=1) = 1 - \psi_{11} = 1 - 1.0 = 0.0

  • Pθ(x2=0y=1)=1ψ21=10.333=0.667 P_\theta(x_2=0|y=1) = 1 - \psi_{21} = 1 - 0.333 = 0.667

  • Pθ(x3=1y=1)=ψ31=0.333 P_\theta(x_3=1|y=1) = \psi_{31} = 0.333

  • Pθ(x4=1y=1)=ψ41=0.0 P_\theta(x_4=1|y=1) = \psi_{41} = 0.0

Since Pθ(x1=0y=1)=0.0 P_\theta(x_1=0|y=1) = 0.0 , the product is:

Pθ(xy=1)=0.00.6670.3330.0=0.0P_\theta(x | y=1) = 0.0 \cdot 0.667 \cdot 0.333 \cdot 0.0 = 0.0

Pθ(xy=1)Pθ(y=1)=0.00.3=0.0P_\theta(x | y=1)P_\theta(y=1) = 0.0 \cdot 0.3 = 0.0

Work (k=2 k = 2 )

Pθ(x=[0,0,1,1]y=2)=Pθ(x1=0y=2)Pθ(x2=0y=2)Pθ(x3=1y=2)Pθ(x4=1y=2)P_\theta(x = [0, 0, 1, 1] | y=2) = P_\theta(x_1=0|y=2) \cdot P_\theta(x_2=0|y=2) \cdot P_\theta(x_3=1|y=2) \cdot P_\theta(x_4=1|y=2)
  • Pθ(x1=0y=2)=1ψ12=10.0=1.0 P_\theta(x_1=0|y=2) = 1 - \psi_{12} = 1 - 0.0 = 1.0

  • Pθ(x2=0y=2)=1ψ22=11.0=0.0 P_\theta(x_2=0|y=2) = 1 - \psi_{22} = 1 - 1.0 = 0.0

  • Pθ(x3=1y=2)=ψ32=0.333 P_\theta(x_3=1|y=2) = \psi_{32} = 0.333

  • Pθ(x4=1y=2)=ψ42=0.333 P_\theta(x_4=1|y=2) = \psi_{42} = 0.333

Since Pθ(x2=0y=2)=0.0 P_\theta(x_2=0|y=2) = 0.0 :

Pθ(xy=2)=1.00.00.3330.333=0.0P_\theta(x | y=2) = 1.0 \cdot 0.0 \cdot 0.333 \cdot 0.333 = 0.0

Pθ(xy=2)Pθ(y=2)=0.00.3=0.0P_\theta(x | y=2)P_\theta(y=2) = 0.0 \cdot 0.3 = 0.0

Personal (k=3 k = 3 )

Pθ(x=[0,0,1,1]y=3)=Pθ(x1=0y=3)Pθ(x2=0y=3)Pθ(x3=1y=3)Pθ(x4=1y=3)P_\theta(x = [0, 0, 1, 1] | y=3) = P_\theta(x_1=0|y=3) \cdot P_\theta(x_2=0|y=3) \cdot P_\theta(x_3=1|y=3) \cdot P_\theta(x_4=1|y=3)
  • Pθ(x1=0y=3)=1ψ13=10.0=1.0 P_\theta(x_1=0|y=3) = 1 - \psi_{13} = 1 - 0.0 = 1.0

  • Pθ(x2=0y=3)=1ψ23=10.0=1.0 P_\theta(x_2=0|y=3) = 1 - \psi_{23} = 1 - 0.0 = 1.0

  • Pθ(x3=1y=3)=ψ33=0.75 P_\theta(x_3=1|y=3) = \psi_{33} = 0.75

  • Pθ(x4=1y=3)=ψ43=0.75 P_\theta(x_4=1|y=3) = \psi_{43} = 0.75

Pθ(xy=3)=1.01.00.750.75=0.5625P_\theta(x | y=3) = 1.0 \cdot 1.0 \cdot 0.75 \cdot 0.75 = 0.5625
Pθ(xy=3)Pθ(y=3)=0.56250.4=0.225P_\theta(x | y=3)P_\theta(y=3) = 0.5625 \cdot 0.4 = 0.225

Prediction

Compare the scores:

  • Spam: 0.0 0.0

  • Work: 0.0 0.0

  • Personal: 0.225 0.225

The highest score is for Personal (y=3 y = 3 ), so we predict the email is Personal.

Why This Works for More Than Two Categories

  • The Naive Bayes model scales to K>2 K > 2 classes by:

    • Estimating a prior ϕk \phi_k for each class k=1,2,,K k = 1, 2, \dots, K .

    • Learning ψjk \psi_{jk} for each word j j and class k k , resulting in Kd Kd feature parameters.

    • Computing Pθ(xy=k)Pθ(y=k) P_\theta(x | y=k)P_\theta(y=k) for all K K classes during prediction.

  • In this example, K=3 K = 3 , but the same process applies for any K K . For K=4 K = 4 (e.g., adding a “Promotional” class), you’d count emails in the new class and estimate ϕ4 \phi_4 and ψj4 \psi_{j4} similarly.

# Your code here