Naive Bayes

Naive Bayes#

Meet Naive Bayes — the OG of machine learning. He’s old, wise, and a bit “naive”… but surprisingly effective.

“I assume all your features are independent,” he says confidently, ignoring the obvious chemistry between ‘age’ and ‘income’. 💸😂

🎯 Why We Need It#

You want to predict whether:

A customer will click on an ad 🖱️
An email is spam 💌
A review is positive or negative ⭐

…and you need a fast, interpretable algorithm that can handle text, counts, or categorical data. Enter our Bayesian magician 🧙.

🤔 The “Naive” Assumption#

Naive Bayes assumes that features are independent given the class. Formally:

[ P(y|x_1, x_2, …, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i | y) ]

In plain English:

“To predict if an email is spam, I’ll pretend that the number of emojis and the presence of ‘FREE’ are unrelated — even though they clearly are.” 🤷‍♂️

Surprisingly, this “naive” move often works really well in practice.

📦 Types of Naive Bayes#

Type	Use Case	Example
Gaussian NB	Continuous data	Customer spend, product ratings
Multinomial NB	Count-based data	Word frequencies in text
Bernoulli NB	Binary features	Presence/absence of words

🧠 The Bayes Formula Refresher#

At its heart lies Bayes’ Theorem, the king of conditional probabilities:

[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} ]

We use it to flip probabilities around — turning what we can easily measure into what we actually want to know.

Example:

“What’s the probability of churn given high complaints?” instead of “What’s the probability of high complaints given churn?”

💼 Business Example: Spam or Not Spam#

Let’s classify marketing emails as spam or not spam using Multinomial Naive Bayes.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

emails = [
    "Win a FREE iPhone today",
    "Meeting scheduled at 3pm",
    "Earn money working from home",
    "Project update attached"
]

labels = [1, 0, 1, 0]  # 1 = spam, 0 = not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

model = MultinomialNB()
model.fit(X, labels)

new_email = ["Congratulations! You won a free trip"]
prediction = model.predict(vectorizer.transform(new_email))

print("Predicted class:", "Spam" if prediction[0] else "Not Spam")

Output:

Predicted class: Spam 😅

Naive Bayes saw the word “free” and couldn’t resist — classic move.

📊 Probabilistic Thinking in Action#

Each prediction is based on conditional probabilities:

P(spam | word='free')
P(spam | word='win')
P(spam | word='project')

Even if it assumes independence, combining them often gives surprisingly accurate results.

🔥 Pros & Cons#

😎 Pros	😬 Cons
Super fast (train in milliseconds)	Assumes independence (often false)
Works great for text & categorical data	Struggles with correlated features
Interpretable probabilities	Less flexible for continuous data

🧩 Practice Exercise#

Try this:

Create a small dataset of customer reviews (positive/negative).
Use CountVectorizer and MultinomialNB to classify them.
Check model.predict_proba() — see how confident your model is.
Add sarcasm in a review (“Oh great, another broken product 🙃”) and see if your model gets confused 😆.

🎓 TL;DR#

Concept	Meaning
Naive Bayes	Probabilistic model using Bayes’ theorem
Naive	Assumes feature independence
Output	Class probabilities
Best for	Text classification, spam detection, sentiment analysis

💬 “Naive Bayes is like that overconfident intern — makes bold assumptions but still nails most predictions.” 🧑‍💻💪

🔗 Next Up: Calibration & Class Imbalance – where we learn how to handle the real world, where your model loves the majority class a little too much. ⚖️

Turning Text into Numbers for Machine Learning with Naive Bayes#

Turning Text into Usable Data#

Each piece of data, called $ x $, is a string of text (like an email or message) that can be any length. To use this text with machine learning, we need to convert it into a list of numbers, called a $ d $-dimensional feature vector $ \phi(x) $, where $ d $ is the number of features (like characteristics we measure).

Ways to Convert Text to Numbers#

Custom Features:
- Look at the text and create features based on what you know about it.
- Examples:
  - Does the text have the word “church”? ($ x_j = 1 $ if yes, $ 0 $ if no)
  - Is the email sent from outside the U.S.? ($ x_j = 1 $ if yes, $ 0 $ if no)
  - Is the sender a university? ($ x_j = 1 $ if yes, $ 0 $ if no)
Word Presence Features:
- Make a list of words (called a vocabulary) and check if each word is in the text.
- For words like “Aardvark”, “Apple”, …, “Zebra”:
  - Set $ x_j = 1 $ if the word is in the text, $ 0 $ if it’s not.
Deep Learning Methods:
- Advanced machine learning techniques can work directly with raw text, using tools like neural networks that understand sequences of letters or words.

Bag of Words Model#

A common way to represent text is the bag of words model, which treats text like a bag of words, ignoring their order.

Vocabulary:
- Create a list of words to track, called the vocabulary $ V $: $$ V = \{\text{church}, \text{doctor}, \text{fervently}, \text{purple}, \text{slow}, \dots\} $$
Feature Vector:
- Turn a text $ x $ into a list of 1s and 0s, with one number for each word in $ V $. This list is $ \phi(x) $, and its length is the number of words in $ V $, written $ |V| $: $$ \phi(x) = \begin{pmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 1 \\ \vdots \end{pmatrix} \begin{array}{l} \text{church} \\ \text{doctor} \\ \text{fervently} \\ \vdots \\ \text{purple} \\ \vdots \end{array} $$
- Each number $ \phi(x)_j $ is $ 1 $ if the $ j $-th word in $ V $ is in the text, or $ 0 $ if it’s not. For example, if “doctor” is in the text, the second number is $ 1 $.

Building a Model to Classify Text#

For binary classification (e.g., spam vs. not spam), we create two models using a dataset with labeled examples (texts marked as spam or not): \begin{align*} P_\theta(x|y=0) && \text{and} && P_\theta(x|y=1) \end{align*}

$ P_\theta(x|y=0) $ tells us how likely a text $ x $ is to be not spam ($ y=0 $).
$ P_\theta(x|y=1) $ tells us how likely it is to be spam ($ y=1 $).
The $ \theta $ means these probabilities depend on parameters we’ll learn, and we use the bag-of-words representation for $ x $.

What is a Categorical Distribution?#

A Categorical distribution is like rolling a die with $ K $ sides, where each side has a probability: $$ P_\theta(x = j) = \theta_j $$

$ x $ can be one of $ K $ outcomes (like 1, 2, …, $ K $).
$ \theta_j $ is the probability of outcome $ j $.
When $ K=2 $ (e.g., yes/no), it’s called a Bernoulli distribution, like flipping a coin.

Naive Bayes Simplification#

Text data can have many words (e.g., 10,000), making it hard to calculate probabilities for every possible combination of words. The Naive Bayes assumption simplifies this by assuming each word’s presence is independent of the others: $$ P_\theta(x = x' | y) = \prod_{j=1}^d P_\theta(x_j = x_j' | y) $$

$ x $ is the text’s feature vector, and $ x' $ is a specific vector (e.g., $ [0, 1, 0, \dots] $).
$ P_\theta(x = x' | y) $ is the probability that a text has the exact word pattern $ x' $ given its class $ y $.
The product $ \prod $ means we multiply the probabilities for each word’s presence or absence.

For example, if a text has: $$ x = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right) \begin{array}{l} \text{church} \\ \text{doctor} \\ \text{fervently} \\ \vdots \\ \text{purple} \end{array} $$ The probability it belongs to class $ y $ (e.g., spam) is: $$ P_\theta \left( x = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right) \middle| y \right) = P_\theta(x_1=0|y) \cdot P_\theta(x_2=1|y) \cdot \dots \cdot P_\theta(x_d=0|y) $$

$ P_\theta(x_1=0|y) $ is the chance “church” is absent given class $ y $.
$ P_\theta(x_2=1|y) $ is the chance “doctor” is present, and so on.

Parameters in Naive Bayes#

Each word’s probability $ P_\theta(x_j | y=k) $ is a Bernoulli distribution with a parameter $ \psi_{jk} $: $$ P_\theta(x_j = 1 | y=k) = \psi_{jk}, \quad P_\theta(x_j = 0 | y=k) = 1 - \psi_{jk} $$
- $ \psi_{jk} $ is the probability that word $ j $ appears in class $ k $.
If we have $ K $ classes (e.g., spam and not spam, so $ K=2 $) and $ d $ words, we need $ Kd $ parameters, which is much simpler than calculating every possible text combination.

Naive Bayes for Bag of Words#

The chance a text $ x $ has a specific word pattern $ x' $ in class $ k $: $$ P_\theta(x = x' | y=k) = \prod_{j=1}^d P_\theta(x_j = x_j' | y=k) $$
Each word’s presence is a Bernoulli: $$ P_\theta(x_j = 1 | y=k) = \psi_{jk}, \quad P_\theta(x_j = 0 | y=k) = 1 - \psi_{jk} $$
We need $ Kd $ parameters, where $ \psi_{jk} $ is the chance word $ j $ is in a text of class $ k $.

Does Naive Bayes Work Well?#

Problem: It assumes words don’t affect each other, but in real life, words like “bank” and “account” often appear together in spam.
Effect: This can make the model’s probabilities too extreme (over- or under-confident).
Benefit: Even with this flaw, Naive Bayes often classifies texts accurately, making it a practical choice.

Setting Up Class Probabilities#

We need a starting guess for how likely each class is, called the prior $ P_\theta(y=k) $:

Use a Categorical distribution with parameters $ \vec{\phi} = (\phi_1, \dots, \phi_K) $: $$ P_\theta(y=k) = \phi_k $$
We can learn $ \phi_k $ from the data, like counting how many texts are spam vs. not spam.

Bernoulli Naive Bayes Model#

The Bernoulli Naive Bayes model works with binary data $ x \in \{0,1\}^d $ (e.g., bag-of-words where each word is present or absent):

Parameters: $ \theta = (\phi_1, \dots, \phi_K, \psi_{11}, \dots, \psi_{dK}) $, totaling $ K(d+1) $ parameters.
Class Prior: $$ P_\theta(y) = \text{Categorical}(\phi_1, \phi_2, \dots, \phi_K) $$
Word Probabilities: $$ P_\theta(x_j = 1 | y=k) = \text{Bernoulli}(\psi_{jk}) $$ P_\theta(x | y=k) = \prod_{j=1}^d P_\theta(x_j | y=k) $$

Learning the Model’s Parameters#

Learning Class Probabilities $ \phi_k $#

Suppose we have $ n $ texts, and $ n_k $ of them belong to class $ k $ (e.g., $ n_k $ spam emails).
The best estimate for $ \phi_k $ is: $$ \phi_k = \frac{n_k}{n} $$
This is just the fraction of texts in class $ k $. For example, if 30 out of 100 emails are spam, $ \phi_{\text{spam}} = 0.3 $.

Learning Word Probabilities $ \psi_{jk} $#

For each class $ k $ and word $ j $, $ \psi_{jk} $ is the chance that word $ j $ appears in texts of class $ k $: $$ \psi_{jk} = \frac{n_{jk}}{n_k} $$
$ n_{jk} $ is the number of texts in class $ k $ that contain word $ j $.
Example: If 20 out of 50 spam emails contain “doctor”, then $ \psi_{\text{doctor,spam}} = \frac{20}{50} = 0.4 $.

Making Predictions#

To classify a new text, use Bayes’ rule to find the most likely class: $$ \arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P_\theta(y) $$
Calculate $ P_\theta(x|y=k)P_\theta(y=k) $ for each class $ k $ (e.g., spam and not spam), and pick the class with the highest value.
Think of it like scoring how “spam-like” or “not-spam-like” the text is, then choosing the best match.

Example Scenario#

Suppose we have a small dataset of emails, and we want to classify them into three categories: Spam ($ y = 1 $), Work ($ y = 2 $), or Personal ($ y = 3 $). Each email is represented as a bag-of-words feature vector based on a small vocabulary of four words: “deal,” “meeting,” “friend,” and “family” ($ d = 4 $). We’ll use the Bernoulli Naive Bayes model to:

Represent the emails as feature vectors.
Estimate the model parameters ($ \phi_k $ and $ \psi_{jk} $).
Predict the class of a new email.

Example: Classifying Emails with Naive Bayes (Three Categories)#

This example applies the Naive Bayes model from the provided text to classify emails into three categories: Spam, Work, or Personal, using a small vocabulary. We’ll use the same math and concepts, showing how they work with $ K = 3 $ classes.

Step 1: Representing Emails as Feature Vectors#

Each email $ x $ is a sequence of words. We convert it into a $ d $-dimensional feature vector $ \phi(x) $ using the bag of words model.

Vocabulary#

Define a vocabulary $ V $ with $ d = 4 $ words: $$ V = \{\text{deal}, \text{meeting}, \text{friend}, \text{family}\} $$

Feature Vector#

For an email $ x $, we create a binary vector $ \phi(x) \in \{0,1\}^4 $: $$ \phi(x) = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{pmatrix} \begin{array}{l} \text{deal} \\ \text{meeting} \\ \text{friend} \\ \text{family} \end{array} $$

$ x_j = 1 $ if word $ j $ (e.g., “deal”) is in the email, else $ x_j = 0 $.
Example: If an email contains “meeting” and “friend” but not “deal” or “family,” its feature vector is: $$ \phi(x) = \begin{pmatrix} 0 \\ 1 \\ 1 \\ 0 \end{pmatrix} $$

Step 2: Dataset#

Suppose we have a small training dataset with $ n = 10 $ emails, labeled as Spam ($ y = 1 $), Work ($ y = 2 $), or Personal ($ y = 3 $):

Email ID	Words Present	Class ($ y $)
1	deal, friend	Spam (1)
2	deal	Spam (1)
3	meeting, friend	Work (2)
4	meeting	Work (2)
5	meeting, family	Work (2)
6	friend, family	Personal (3)
7	family	Personal (3)
8	friend	Personal (3)
9	deal, meeting	Spam (1)
10	friend, family	Personal (3)

Feature Vectors#

Each email is converted to a 4D binary vector:

Email 1: $ x^{(1)} = [1, 0, 1, 0] $ (deal, friend)
Email 2: $ x^{(2)} = [1, 0, 0, 0] $ (deal)
Email 3: $ x^{(3)} = [0, 1, 1, 0] $ (meeting, friend)
…
Email 10: $ x^{(10)} = [0, 0, 1, 1] $ (friend, family)

Step 3: Bernoulli Naive Bayes Model#

We use the Bernoulli Naive Bayes model for binary data $ x \in \{0,1\}^4 $, with three classes ($ K = 3 $).

Model Components#

Parameters: $ \theta = (\phi_1, \phi_2, \phi_3, \psi_{11}, \psi_{21}, \dots, \psi_{43}) $, with $ K(d+1) = 3(4+1) = 15 $ parameters.
Class Prior: $$ P_\theta(y) = \text{Categorical}(\phi_1, \phi_2, \phi_3) $$
- $ \phi_k $ is the probability of class $ k $.
Feature Likelihood: $$ P_\theta(x_j = 1 | y=k) = \text{Bernoulli}(\psi_{jk}) $$ P_\theta(x | y=k) = \prod_{j=1}^4 P_\theta(x_j | y=k) $$
- $ \psi_{jk} $ is the probability that word $ j $ is present in class $ k $.
- $ P_\theta(x_j = 0 | y=k) = 1 - \psi_{jk} $.

Naive Bayes Assumption#

We assume words are independent given the class: $$ P_\theta(x = x' | y=k) = \prod_{j=1}^4 P_\theta(x_j = x_j' | y=k) $$ For a vector $ x’ = [0, 1, 1, 0] $: $$ P_\theta \left( x = \begin{pmatrix} 0 \\ 1 \\ 1 \\ 0 \end{pmatrix} \middle| y=k \right) = P_\theta(x_1=0|y=k) \cdot P_\theta(x_2=1|y=k) \cdot P_\theta(x_3=1|y=k) \cdot P_\theta(x_4=0|y=k) $$

Step 4: Learning Parameters#

Learning Class Priors $ \phi_k $#

Count the number of emails in each class:

Spam ($ y = 1 $): $ n_1 = 3 $ (Emails 1, 2, 9)
Work ($ y = 2 $): $ n_2 = 3 $ (Emails 3, 4, 5)
Personal ($ y = 3 $): $ n_3 = 4 $ (Emails 6, 7, 8, 10)
Total emails: $ n = 10 $

The prior probabilities are: $$ \phi_k = \frac{n_k}{n} $$

$ \phi_1 = \frac{3}{10} = 0.3 $ (Spam)
$ \phi_2 = \frac{3}{10} = 0.3 $ (Work)
$ \phi_3 = \frac{4}{10} = 0.4 $ (Personal)

Learning Feature Parameters $ \psi_{jk} $#

For each class $ k $ and word $ j $, compute: $$ \psi_{jk} = \frac{n_{jk}}{n_k} $$

$ n_{jk} $ is the number of emails in class $ k $ where word $ j $ is present.
$ n_k $ is the number of emails in class $ k $.

Spam ($ k = 1 $, $ n_1 = 3 $)#

Word 1 (deal): Present in Emails 1, 2, 9 ($ n_{11} = 3 $) $$ \psi_{11} = \frac{3}{3} = 1.0 $$
Word 2 (meeting): Present in Email 9 ($ n_{21} = 1 $) $$ \psi_{21} = \frac{1}{3} \approx 0.333 $$
Word 3 (friend): Present in Email 1 ($ n_{31} = 1 $) $$ \psi_{31} = \frac{1}{3} \approx 0.333 $$
Word 4 (family): Absent ($ n_{41} = 0 $) $$ \psi_{41} = \frac{0}{3} = 0.0 $$

Work ($ k = 2 $, $ n_2 = 3 $)#

Word 1 (deal): Absent ($ n_{12} = 0 $) $$ \psi_{12} = \frac{0}{3} = 0.0 $$
Word 2 (meeting): Present in Emails 3, 4, 5 ($ n_{22} = 3 $) $$ \psi_{22} = \frac{3}{3} = 1.0 $$
Word 3 (friend): Present in Email 3 ($ n_{32} = 1 $) $$ \psi_{32} = \frac{1}{3} \approx 0.333 $$
Word 4 (family): Present in Email 5 ($ n_{42} = 1 $) $$ \psi_{42} = \frac{1}{3} \approx 0.333 $$

Personal ($ k = 3 $, $ n_3 = 4 $)#

Word 1 (deal): Absent ($ n_{13} = 0 $) $$ \psi_{13} = \frac{0}{4} = 0.0 $$
Word 2 (meeting): Absent ($ n_{23} = 0 $) $$ \psi_{23} = \frac{0}{4} = 0.0 $$
Word 3 (friend): Present in Emails 6, 8, 10 ($ n_{33} = 3 $) $$ \psi_{33} = \frac{3}{4} = 0.75 $$
Word 4 (family): Present in Emails 6, 7, 10 ($ n_{43} = 3 $) $$ \psi_{43} = \frac{3}{4} = 0.75 $$

Parameter Summary#

Priors: $ \phi_1 = 0.3 $, $ \phi_2 = 0.3 $, $ \phi_3 = 0.4 $
Feature probabilities:

Word $ j $

Spam ($ \psi_{j1} $)

Work ($ \psi_{j2} $)

Personal ($ \psi_{j3} $)

deal

1.0

0.0

0.0

meeting

0.333

1.0

0.0

friend

0.333

0.333

0.75

family

0.0

0.333

0.75

Step 5: Predicting a New Email#

Suppose a new email contains “friend” and “family” ($ x' = [0, 0, 1, 1] $). We predict its class using Bayes’ rule: $$ \arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P_\theta(y) $$

Compute $ P_\theta(x = x' | y=k)P_\theta(y=k) $ for each class $ k $.

Spam ($ k = 1 $)#

$ P_\theta(x_1=0|y=1) = 1 - \psi_{11} = 1 - 1.0 = 0.0 $
$ P_\theta(x_2=0|y=1) = 1 - \psi_{21} = 1 - 0.333 = 0.667 $
$ P_\theta(x_3=1|y=1) = \psi_{31} = 0.333 $
$ P_\theta(x_4=1|y=1) = \psi_{41} = 0.0 $

Since $ P_\theta(x_1=0|y=1) = 0.0 $, the product is: $$ P_\theta(x | y=1) = 0.0 \cdot 0.667 \cdot 0.333 \cdot 0.0 = 0.0 $$ P_\theta(x | y=1)P_\theta(y=1) = 0.0 \cdot 0.3 = 0.0 $$

Work ($ k = 2 $)#

$ P_\theta(x_1=0|y=2) = 1 - \psi_{12} = 1 - 0.0 = 1.0 $
$ P_\theta(x_2=0|y=2) = 1 - \psi_{22} = 1 - 1.0 = 0.0 $
$ P_\theta(x_3=1|y=2) = \psi_{32} = 0.333 $
$ P_\theta(x_4=1|y=2) = \psi_{42} = 0.333 $

Since $ P_\theta(x_2=0|y=2) = 0.0 $: $$ P_\theta(x | y=2) = 1.0 \cdot 0.0 \cdot 0.333 \cdot 0.333 = 0.0 $$ P_\theta(x | y=2)P_\theta(y=2) = 0.0 \cdot 0.3 = 0.0 $$

Personal ($ k = 3 $)#

$ P_\theta(x_1=0|y=3) = 1 - \psi_{13} = 1 - 0.0 = 1.0 $
$ P_\theta(x_2=0|y=3) = 1 - \psi_{23} = 1 - 0.0 = 1.0 $
$ P_\theta(x_3=1|y=3) = \psi_{33} = 0.75 $
$ P_\theta(x_4=1|y=3) = \psi_{43} = 0.75 $

\[ P_\theta(x | y=3) = 1.0 \cdot 1.0 \cdot 0.75 \cdot 0.75 = 0.5625 \]

\[ P_\theta(x | y=3)P_\theta(y=3) = 0.5625 \cdot 0.4 = 0.225 \]

Prediction#

Compare the scores:

Spam: $ 0.0 $
Work: $ 0.0 $
Personal: $ 0.225 $

The highest score is for Personal ($ y = 3 $), so we predict the email is Personal.

Why This Works for More Than Two Categories#

The Naive Bayes model scales to $ K > 2 $ classes by:
- Estimating a prior $ \phi_k $ for each class $ k = 1, 2, \dots, K $.
- Learning $ \psi_{jk} $ for each word $ j $ and class $ k $, resulting in $ Kd $ feature parameters.
- Computing $ P_\theta(x | y=k)P_\theta(y=k) $ for all $ K $ classes during prediction.
In this example, $ K = 3 $, but the same process applies for any $ K $. For $ K = 4 $ (e.g., adding a “Promotional” class), you’d count emails in the new class and estimate $ \phi_4 $ and $ \psi_{j4} $ similarly.

# Your code here

Word \( j \)	Spam (\( \psi_{j1} \))	Work (\( \psi_{j2} \))	Personal (\( \psi_{j3} \))
deal	1.0	0.0	0.0
meeting	0.333	1.0	0.0
friend	0.333	0.333	0.75
family	0.0	0.333	0.75