0% found this document useful (0 votes)

11 views24 pages

WDM - Week - I

Uploaded by

vishnuteja2612

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views24 pages

WDM - Week - I

Uploaded by

vishnuteja2612

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

1.Apply Naive Bayes Classification algorithm for a given textual dataset (1 week).

2. Apply SVM algorithm for a given dataset (2 weeks).

3. Apply Hierarchical Clustering for a given textual dataset. Experiment with different distance
metrics (2 weeks).
4. Implement the distance functions for assessing similarity between documents while taking
care of standardizing the attributes (1 week).
5. Apply text preprocessing methods to extract relevant text: Stop Word Removal, Stemming,
Frequency Analysis for unigrams, bigrams and trigrams (2 weeks).

1. Apply Naive Bayes Classification algorithm for a given textual dataset (1 week).

#Import Required Libraries

import pandas as pd
import seaborn as sns
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_files
from sklearn.pipeline import make_pipeline

// It looks like you're setting up a text classification pipeline using Python, and you're importing several
useful libraries. Let's walk through what you're likely trying to accomplish and how you can complete the
necessary steps for text classification.

Overview of the Imports

1. pandas: A powerful library for data manipulation and analysis. You'll likely use it to work with
structured data, such as loading a CSV or dataset.
2. seaborn: A data visualization library that builds on matplotlib. You can use it to plot the results
of your analysis or metrics (e.g., confusion matrix, accuracy).
3. re: The regular expression library in Python. It will likely be useful for preprocessing and
cleaning the text data.
4. ENGLISH_STOP_WORDS: From sklearn, this set contains common stop words in English that
you can remove during text preprocessing.
5. CountVectorizer: This is a text vectorizer that converts a collection of text documents into a
matrix of token counts. It's a key component for feature extraction in text classification.
6. train_test_split: From sklearn, used to split your dataset into training and test sets.
7. MultinomialNB: This is a Naive Bayes classifier for multinomially distributed data, often used
for text classification tasks, particularly when the features are word counts.
8. accuracy_score, classification_report: Evaluation metrics to assess the performance of your
classification model.
9. load_files: A utility function to load text files into a dataset for classification.
10. make_pipeline: This allows you to create a pipeline where preprocessing and modeling steps are
encapsulated, making it easier to apply to different datasets.

General Workflow for Text Classification with These Libraries

Here's a complete example workflow using your imports, showing how to load a dataset, preprocess text,
build a model, and evaluate its performance.

1. Load and Preprocess the Dataset

Let’s assume you're working with a text dataset, which you might load using load_files (for example, a
folder with text files categorized into subdirectories).
# Load the dataset (assuming you're working with text files in different folders)
dataset = load_files('path_to_your_dataset', encoding='utf-8')

# View the dataset structure

print(f"Categories: {dataset.target_names}")
print(f"Sample data: {dataset.data[0]}")

2. Text Preprocessing

You can clean the text by removing unwanted characters, stop words, etc. Here's how to do that:
# Preprocess text by removing non-alphanumeric characters and stop words
def preprocess_text(text):
# Remove non-alphanumeric characters
text = re.sub(r'\W+', ' ', text)

# Remove stopwords
text = ' '.join([word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS])

return text

# Apply the preprocessing function to the dataset

X = [preprocess_text(text) for text in dataset.data]
y = dataset.target

3. Split Data into Training and Testing Sets

Now, split the dataset into training and test sets using train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
4. Create a Pipeline

You can use a pipeline to chain together text vectorization and the classification model (MultinomialNB).
# Create a pipeline with CountVectorizer and MultinomialNB
pipeline = make_pipeline(
CountVectorizer(), # Convert text to a bag of words
MultinomialNB() # Use Naive Bayes classifier
)

# Train the model

pipeline.fit(X_train, y_train)

5. Make Predictions and Evaluate the Model

After training the model, you can use it to make predictions on the test set and evaluate it.
# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Display classification report

print(classification_report(y_test, y_pred, target_names=dataset.target_names))

6. Visualize Results (Optional)

You can visualize the results, such as the distribution of predicted categories, using seaborn for better
understanding:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a confusion matrix plot

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=dataset.target_names,
yticklabels=dataset.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

Summary:

This is a basic text classification pipeline:

1. Load and preprocess text data.

2. Split the data into training and test sets.
3. Use a CountVectorizer and a MultinomialNB classifier in a pipeline.
4. Train the model, make predictions, and evaluate the performance using accuracy and
classification report.
5. Optionally, visualize the results using confusion matrix heatmaps.

This framework is flexible enough for you to experiment with other vectorizers (like TfidfVectorizer) or
classifiers (e.g., LogisticRegression, SVM) as needed. Let me know if you need further guidance or if
there’s a specific part of the pipeline you want to expand upon!

# Load the data

df=pd.read_csv(r"C:\Users\User\Downloads\Cleaned_Students_Performance.csv")
df

// It looks like you are trying to read a CSV file using pandas and display the contents of the DataFrame.
If you've run this code, you should be able to view the contents of the file as a DataFrame.

Here’s the code you're using:

import pandas as pd

# Read the CSV file

df = pd.read_csv(r"C:\Users\User\Downloads\Cleaned_Students_Performance.csv")

# Display the DataFrame

This will load the CSV file into the df DataFrame and display it. If you want to explore the dataset further
or need help with the next steps, here are a few things you can do:

1. Check the First Few Rows of the DataFrame

To quickly inspect the top few rows of your dataset:

# Show the first 5 rows
df.head()

2. Check for Missing Values

If you want to check if there are any missing values in the dataset:
# Check for missing values
df.isnull().sum()

3. View Dataframe Information

To get a summary of the DataFrame, including the column names, non-null counts, and data types:
# Summary of the DataFrame
df.info()

4. Basic Statistics of the Data

If you want basic descriptive statistics (for numerical columns), you can use:
# Summary statistics
df.describe()

5. Check Column Names

To view the column names in the dataset:

# Column names
df.columns

6. Check the Shape of the DataFrame

To see how many rows and columns the DataFrame has:

# Shape of the DataFrame
df.shape

If you encounter any issues or want to perform a specific analysis on this dataset (e.g., visualizing student
performance), let me know how you'd like to proceed!
# Explore data
df.head()
df.tail()
df.info()
df.describe()

// Let's go over the methods you're calling on the DataFrame (df) in your code:
1. df.head(): This will display the first 5 rows of the DataFrame by default. It gives a quick
overview of the first few records in your dataset.
2. df.head()
3. df.tail(): Similarly, df.tail() will display the last 5 rows of the DataFrame, giving you a sense of
the last few records.
4. df.tail()
5. df.info(): This method provides a concise summary of the DataFrame, including:
o The number of entries (rows).
o The column names.
o Non-null counts (which helps to identify if there are missing values in the columns).
o The data types of each column.
6. df.info()

Example output might look like:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 student_id 100 non-null int64
1 gender 100 non-null object
2 age 100 non-null int64
3 subject 100 non-null object
4 performance 100 non-null object
dtypes: int64(2), object(3)
memory usage: 4.0 KB

o RangeIndex: Number of rows in the DataFrame (100).

o Non-Null Count: Shows that all columns have no missing data (100 non-null).
o Data Types: Lists the types of data in each column (e.g., int64 for student_id and age,
object for categorical data like gender, subject, and performance).
7. df.describe(): This method provides a statistical summary of the numerical columns in the
DataFrame. It shows:
o Count: Total number of entries.
o Mean: The average value of each numerical column.
o Standard deviation (std).
o Min, 25%, 50% (median), 75%, Max: The distribution of values.

It only applies to numerical columns. If there are categorical columns, they will be excluded from
the summary.

Example output might look like this for a DataFrame with columns age and student_id:
student_id age
count 100.000000 100.000000
mean 1050.500000 18.50
std 29.011493 1.50
min 1001.000000 17.00
25% 1025.500000 17.75
50% 1050.500000 18.50
75% 1075.500000 19.25
max 1100.000000 20.00

This provides useful insights, such as:

o Mean age is 18.5 years.

o The age column ranges from 17 to 20.

Example of Full Output Flow

# Display first 5 rows
print(df.head())

# Display last 5 rows

print(df.tail())

# Display summary information about the DataFrame

df.info()

# Display statistical summary of numerical columns

print(df.describe())

def unique_characters(text):
return set(text)

# Apply the function to the DataFrame column

df['unique_characters'] = df['race_ethnicity'].apply(unique_characters)

# Display the DataFrame

print(df)
// The code you've written defines a function unique_characters() that takes a string and returns a set of
unique characters in that string. You then apply this function to the race_ethnicity column of the
DataFrame df and store the results in a new column called unique_characters.

Here’s how it works:

1. unique_characters(text) function: This function takes a string (text), and set(text) creates a set
of unique characters from the string. Since a set does not allow duplicate values, it will
automatically remove any repeated characters.
2. apply(unique_characters): The apply() method is used to apply the unique_characters() function
to each element (each row) in the race_ethnicity column.
3. Storing the Result: The result of applying this function is assigned to a new column called
unique_characters.
Example:

Assume your DataFrame (df) looks like this:

student_id gender age race_ethnicity
0 1001 Male 18 Asian
1 1002 Female 19 Hispanic
2 1003 Male 17 Caucasian

When you apply the function, the new column unique_characters will show a set of unique characters for
each race_ethnicity value.

Expected Output:

For the example data above:

student_id gender age race_ethnicity unique_characters
0 1001 Male 18 Asian {'A', 'i', 'n', 's'}
1 1002 Female 19 Hispanic {'a', 'p', 'n', 's', 'i', 'c', 'h'}
2 1003 Male 17 Caucasian {'C', 'a', 'c', 'n', 's', 'i', 'u'}

 unique_characters: For each row, it contains a set of the unique characters in the corresponding
race_ethnicity value.
o For "Asian", the unique characters are {'A', 'i', 'n', 's'}.
o For "Hispanic", the unique characters are {'a', 'p', 'n', 's', 'i', 'c', 'h'}.

Complete Code:
import pandas as pd

# Sample DataFrame
data = {
'student_id': [1001, 1002, 1003],
'gender': ['Male', 'Female', 'Male'],
'age': [18, 19, 17],
'race_ethnicity': ['Asian', 'Hispanic', 'Caucasian']
}

df = pd.DataFrame(data)

# Define the function to get unique characters

def unique_characters(text):
return set(text)

# Apply the function to the 'race_ethnicity' column

df['unique_characters'] = df['race_ethnicity'].apply(unique_characters)

# Display the DataFrame

print(df)

Output:
student_id gender age race_ethnicity unique_characters
0 1001 Male 18 Asian {'A', 'i', 'n', 's'}
1 1002 Female 19 Hispanic {'a', 'p', 'n', 's', 'i', 'c', 'h'}
2 1003 Male 17 Caucasian {'C', 'a', 'c', 'n', 's', 'i', 'u'}

Notes:

 The output for each race_ethnicity value is a set, which contains only the unique characters from
the string.
 If you want the characters in a specific order or formatted differently, you could modify the
unique_characters() function to return a sorted list or a string, for example.
 If your column contains NaN values, you'll need to handle those, either by filling them or filtering
them out before applying the function.

def preprocess_text(text):
# Convert to lowercase
text = text.lower()

# Remove HTML tags

text = re.sub(r'<.*?>', '', text)

# Remove special characters and digits using regex

text = re.sub(r'[^a-zA-Z\s]', '', text)

return text

The function preprocess_text() that you have defined is aimed at cleaning and preprocessing text data.
Let's break down the steps in the function:

Steps in preprocess_text():

1. Convert to lowercase:
2. text = text.lower()

This ensures that the text is in lowercase, which helps in making comparisons case-insensitive
and reduces the dimensionality of the text data when working with text analysis or machine
learning models.

3. Remove HTML tags:

4. text = re.sub(r'<.*?>', '', text)
This removes any HTML tags from the text using a regular expression (regex). It matches any
content that is enclosed in angle brackets (< >) and removes it. This is useful when your data
includes HTML content (e.g., text).

5. Remove special characters and digits:

6. text = re.sub(r'[^a-zA-Z\s]', '', text)

This regex pattern removes anything that is not an alphabetical character (a-zA-Z) or whitespace
(\s). It effectively removes special characters like punctuation, numbers, and other non-alphabetic
symbols.

Example of Usage:

Let's say we have a string with HTML tags, special characters, and mixed case:
text = "This is a Test! @123 Check this text..."
cleaned_text = preprocess_text(text)
print(cleaned_text)

Output:
this is a test check this text

 The text is converted to lowercase (this is a test check this text).

 HTML tags (, , , and ) are removed.
 Special characters (!, @, 123, and ...) and numbers are removed.

Applying the Function to a DataFrame:

If you have a column in a pandas DataFrame (for example, text_column) that contains text, you can apply
the preprocess_text() function to that column like this:
import pandas as pd

# Sample DataFrame
data = {'text_column': ["This is a Test! @123 Check this text...",
"<div>Another Example with #special characters!!</div>"]}

df = pd.DataFrame(data)

# Apply the preprocess_text function to the 'text_column'

df['cleaned_text'] = df['text_column'].apply(preprocess_text)

# Display the DataFrame with the cleaned text

print(df)

Expected Output:
text_column cleaned_text
0 This is a Test! @123 Check this text... this is a test check this text
1 <div>Another Example with #special characters!!</div> another example with special
characters.
Additional Notes:

 If your text contains other unwanted elements (e.g., stopwords, extra spaces), you could extend
the preprocess_text() function to handle those as well.
 You could also add functionality to handle emojis, URLs, or unicode characters if needed.

# Split the data

from sklearn.model_selection import train_test_split
x = df['race_ethnicity']
y = df['parental_level_of_education']
x, x_test, y, y_test = train_test_split(x,y, stratify=y, test_size=0.25, random_state=42)

Your code uses train_test_split from sklearn.model_selection to split the dataset into training and testing
sets. Let's break down each part of the code:

Code Explanation:
from sklearn.model_selection import train_test_split

# Define your features and target

x = df['race_ethnicity'] # Feature (independent variable)
y = df['parental_level_of_education'] # Target (dependent variable)

# Split the data into training and testing sets

x, x_test, y, y_test = train_test_split(x, y, stratify=y, test_size=0.25, random_state=42)

Breakdown:

1. Importing train_test_split:
o train_test_split is used to split data into training and testing sets. It's one of the most
common functions for splitting data in machine learning workflows.
2. Defining Features and Target:
o x = df['race_ethnicity']: This is the feature (or independent variable), which represents the
race/ethnicity of the students.
o y = df['parental_level_of_education']: This is the target (or dependent variable), which
represents the parental level of education.
3. Splitting the Data:
o stratify=y: This ensures that the class distribution in the target variable (y) is preserved
in both the training and test sets. In other words, the distribution of
parental_level_of_education will be similar in both the training and testing datasets.
o test_size=0.25: This indicates that 25% of the data will be used for testing, and the
remaining 75% will be used for training.
o random_state=42: This sets a seed for the random number generator, ensuring
reproducibility of the split. Every time you run the code with this seed, the split will be
the same.
Result:

After executing this code:

 x will contain the race_ethnicity values for the training set (75% of the data).
 x_test will contain the race_ethnicity values for the testing set (25% of the data).
 y will contain the parental_level_of_education values for the training set.
 y_test will contain the parental_level_of_education values for the testing set.

Example of Data Splitting:

For example, if your DataFrame (df) contains 100 rows, the code will split it as follows:

 75 rows for training (in x and y).

 25 rows for testing (in x_test and y_test).

The stratify=y ensures that the proportion of parental_level_of_education in both the training and testing
sets will be the same as in the original dataset.

Check the Shapes of the Split Data:

To check that the split was done correctly, you can print the shapes of the resulting datasets:
print(f"Training features (x): {x.shape}")
print(f"Testing features (x_test): {x_test.shape}")
print(f"Training target (y): {y.shape}")
print(f"Testing target (y_test): {y_test.shape}")

This will output something like:

Training features (x): (75,)
Testing features (x_test): (25,)
Training target (y): (75,)
Testing target (y_test): (25,)

The numbers will vary depending on the size of your original dataset.

Next Steps:Now that you've split the data into training and testing sets, you can proceed with building
and training a machine learning model (e.g., using a classifier like RandomForestClassifier,
LogisticRegression, etc.) to predict the parental level of education based on race/ethnicity.

# Vectorize the text data

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words='english')
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()
The code you've written uses CountVectorizer from sklearn.feature_extraction.text to convert the text
data (in your x and x_test variables) into numerical feature vectors. Let's break down each part:

Code Explanation:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop words removed

vec = CountVectorizer(stop_words='english')

# Fit the vectorizer to the training data and transform it

x = vec.fit_transform(x).toarray()

# Transform the test data using the fitted vectorizer

x_test = vec.transform(x_test).toarray()

Breakdown:

1. CountVectorizer(stop_words='english'):
o CountVectorizer is a method used to convert a collection of text documents into a matrix
of token counts (i.e., word counts).
o stop_words='english': This parameter tells the vectorizer to ignore common English
words (e.g., "the", "and", "is") which are generally not useful for text analysis. This helps
reduce the dimensionality of the resulting feature matrix and improve model
performance.
2. vec.fit_transform(x):
o fit_transform() is used on the training data (x). It first fits the vectorizer to the text data
by learning the vocabulary and then transforms the text data into a numerical matrix of
word counts.
o toarray(): This converts the sparse matrix produced by fit_transform() into a dense array.
This is often necessary when working with some machine learning algorithms, which
require a dense format.
3. vec.transform(x_test):
o transform() is used on the test data (x_test). The key point here is that we only transform
the test data using the vocabulary learned from the training data. This ensures that the test
data is represented using the same features (vocabulary) that the model will learn from
the training data.
o Again, .toarray() converts the sparse matrix to a dense array.

Example:

Let's assume that x contains the text data (e.g., the race_ethnicity column from your DataFrame), and here
are a couple of sample rows:
x = ['Asian', 'Hispanic', 'Caucasian', 'Asian', 'Hispanic']

The CountVectorizer will convert these text entries into a matrix of word counts.

The fit_transform() Step:

 Training set (x = ['Asian', 'Hispanic', 'Caucasian', 'Asian', 'Hispanic'])
o After fitting, the vectorizer learns the unique words in the training data (vocabulary). In
this case, the vocabulary might include ['asian', 'hispanic', 'caucasian'].
o After transforming, the resulting count matrix would look like this:
Asian Hispanic Caucasian
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0

 Each row corresponds to a sample in the training data, and each column corresponds to a word in
the vocabulary. The values represent the count of each word in the corresponding text.

The transform() Step:

 For the test data (x_test), we transform the data using the same vocabulary learned from the
training data. If there are any new words in the test set that were not seen in the training data, they
will be ignored (since they are not in the vocabulary).

Output:

After running your code, x and x_test will be matrices of word counts from the text data in race_ethnicity
(the training and test data).

Additional Notes:

1. Dimensionality: The resulting matrices (x and x_test) will have as many columns as the number
of unique words (tokens) in the vocabulary learned from the training data. Depending on the size
of your dataset and the variety of words, this could result in a very high-dimensional feature
space.
2. Sparse Matrices: Since CountVectorizer typically creates a sparse matrix (many zeros due to the
large vocabulary), you might want to use a sparse matrix format instead of converting it to a
dense array, especially if your dataset is large. This is often more memory-efficient.

Check the Shapes of the Transformed Data:

You can check the shapes of the transformed data to verify everything is correct:
print(f"Training data shape (x): {x.shape}")
print(f"Testing data shape (x_test): {x_test.shape}")

This will output the number of samples (rows) and features (columns) in both your training and testing
datasets.

Next Steps:

Now that you’ve transformed your text data into numerical features, you can proceed with training a
machine learning model (e.g., Naive Bayes, Logistic Regression) on the training data (x, y), and evaluate
it using the test data (x_test, y_test).
# Train the model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x, y)

The code you've written applies the Multinomial Naive Bayes classifier to the data you've prepared,
which includes the text feature data (x) and the target variable (y). Let's break down what happens in this
code:

Code Explanation:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes model

model = MultinomialNB()

# Fit the model to the training data

model.fit(x, y)

Steps:

1. from sklearn.naive_bayes import MultinomialNB:

o This imports the MultinomialNB class, which is a Naive Bayes classifier commonly used
for text classification tasks where the features (words) are assumed to follow a
multinomial distribution. This is typical for problems like spam classification, sentiment
analysis, or any classification problem where the features are counts or frequencies of
words.
2. model = MultinomialNB():
o This initializes the Multinomial Naive Bayes model. By default, it uses the Laplace
smoothing (i.e., alpha=1.0), which helps in cases where some words might not appear in
the training data but may appear in the test data.
3. model.fit(x, y):
o The fit() method is used to train the model on the feature data x (which, in your case, are
the transformed text data after using CountVectorizer) and the target data y (which are
the corresponding labels from the parental_level_of_education column).
o The model will "learn" the relationship between the input features (the word counts) and
the target labels (the educational level of the students).
Model Behavior:

 Multinomial Naive Bayes is a probabilistic classifier that works well when the features are
counts (like word frequencies). It calculates the probability of each class (the possible values of y,
such as different parental education levels) given the input features (x), and it assigns the class
with the highest probability.
 Training: The fit() function calculates probabilities based on the frequency of each word in each
class and the frequency of each class. This is used to estimate the likelihood of a particular label
(e.g., parental education level) given the features (e.g., race/ethnicity).

Next Steps (Model Evaluation):

Once the model is trained, you should evaluate its performance on the test data (x_test and y_test). Here's
how you can do it:

1. Predict on the Test Data:

y_pred = model.predict(x_test)

This uses the predict() method of the trained model to predict the labels for the test data (x_test).

2. Evaluate the Model:

You can use metrics such as accuracy, confusion matrix, and classification report to evaluate the
performance of your model.
from sklearn.metrics import accuracy_score, classification_report

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification Report
print(classification_report(y_test, y_pred))

3. Confusion Matrix:

A confusion matrix can give you a more detailed understanding of how well the model is performing with
respect to different classes.
from sklearn.metrics import confusion_matrix

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Example Output:

For the accuracy, you might get something like:

Accuracy: 0.80
For the classification report, you’ll see precision, recall, and F1 score for each class (i.e., each level of
parental education):
precision recall f1-score support

High School 0.85 0.78 0.81 50

Some College 0.75 0.83 0.79 50
Associate's 0.82 0.80 0.81 50

accuracy 0.80 150

macro avg 0.81 0.80 0.80 150
weighted avg 0.81 0.80 0.80 150

Summary:

 The model is trained using the MultinomialNB() classifier.

 After training, we can evaluate its performance on the test data using various metrics such as
accuracy, classification report, and confusion matrix.
 The Multinomial Naive Bayes classifier is a great choice for text classification problems when the
features are counts or frequencies of words, as in your case with CountVectorizer.

model = make_pipeline(

CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS), preprocessor=preprocess_text),

MultinomialNB(),

The code you've written creates a pipeline using make_pipeline() from sklearn, which simplifies the
process of applying multiple steps in a machine learning workflow. Let's break down the code:

Code Explanation:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

You import the necessary modules:

 make_pipeline: A helper function that chains together multiple steps of data transformation and
modeling into a single object.
 CountVectorizer: A vectorizer that converts text data into a matrix of token counts (word
frequencies).
 MultinomialNB: The Naive Bayes classifier for multinomially distributed data, commonly used
for text classification.
 ENGLISH_STOP_WORDS: A predefined list of common English stopwords.

Building the Pipeline:

model = make_pipeline(
CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS), preprocessor=preprocess_text),
MultinomialNB(),
)

Here’s what each part of this pipeline does:

1. CountVectorizer:
o stop_words=list(ENGLISH_STOP_WORDS): You pass the predefined list of English
stopwords from sklearn.feature_extraction.text.ENGLISH_STOP_WORDS to
CountVectorizer, which will ignore common words like "the", "is", and "and" during text
processing.
o preprocessor=preprocess_text: This argument specifies that the preprocess_text
function (which you defined earlier) will be applied to clean the text before tokenizing it.
This means that for each text sample, the function will convert it to lowercase, remove
HTML tags, special characters, and digits.

What this does: The CountVectorizer will process each text in the dataset by cleaning it with
preprocess_text, then transforming it into a bag-of-words representation (i.e., a matrix where each
row corresponds to a document and each column represents a word count).

2. MultinomialNB:
o This is the Naive Bayes classifier that will be trained on the transformed text data to
predict the target variable (y, in your case, parental_level_of_education).

How It Works:

 The pipeline combines the two steps (text preprocessing and classification) into a single object,
allowing you to fit and evaluate your model with minimal code. The beauty of using a pipeline is
that it abstracts away the intermediate steps and ensures that all steps are applied in a sequence,
especially when fitting and transforming the training and testing data.

Example Workflow:

When you use this pipeline, you can do the following:

1. Fit the Model:

o You can train the model on your training data using the fit() method:
2. model.fit(x, y)

This will:

o Apply the preprocess_text() function to the text in x.

o Convert the processed text into a matrix of word counts.
o Train the Naive Bayes classifier (MultinomialNB) on the resulting word count matrix and
the corresponding labels in y.
3. Predict:
o Once the model is trained, you can use the predict() method to make predictions on the
test data:
4. y_pred = model.predict(x_test)
5. Evaluate the Model: You can evaluate the model using various metrics such as accuracy,
classification report, etc., as shown previously.

Example Code:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split data (assuming df['race_ethnicity'] is the feature and df['parental_level_of_education'] is the target)
x = df['race_ethnicity']
y = df['parental_level_of_education']
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.25, random_state=42)

# Fit the model using the pipeline

model.fit(x_train, y_train)

# Predict on the test data

y_pred = model.predict(x_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print(classification_report(y_test, y_pred))

Output Example:
After running this code, you will get an accuracy score and a classification report that provides precision,
recall, and F1-score for each class.
Accuracy: 0.82

precision recall f1-score support

High School 0.83 0.80 0.81 50
Some College 0.80 0.84 0.82 50
Associate's 0.83 0.79 0.81 50
accuracy 0.82 150
macro avg 0.82 0.81 0.81 150
weighted avg 0.82 0.82 0.81 150

Advantages of Using a Pipeline:

 Streamlined Workflow: You don't have to manually call each step (preprocessing and model
fitting), making the code cleaner and more maintainable.
 Consistency: Ensures that the same preprocessing steps are applied both during training and
testing (or cross-validation).
 Easier Experimentation: You can swap out the CountVectorizer or MultinomialNB for other
models or preprocessing methods without changing much of the code structure.

Additional Notes:

 Hyperparameter Tuning: If you need to tune hyperparameters, you can easily do this by using
GridSearchCV or RandomizedSearchCV with the pipeline. This ensures that all steps (including
the vectorizer) are part of the hyperparameter search.

#Predict the Model

model.predict(vec.transform(['group C']))
model.score(x_test, y_test)
y_pred = model.predict(x_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")

# Evaluation of model
print(classification_report(y_test, y_pred,zero_division=0))

Your code looks mostly correct, but there are a couple of things we need to adjust to ensure it works
properly, especially when using the make_pipeline you've set up earlier. Here's the breakdown:
Issue:

In the line:
model.predict(vec.transform(['group C']))

You're trying to use vec.transform directly. However, since model is a pipeline that includes the
CountVectorizer, you don't need to manually call the CountVectorizer (vec.transform()) when making
predictions. The pipeline automatically handles the transformation step when you call model.predict().

Fix:

You should directly use model.predict() on the input text, and it will automatically go through the
CountVectorizer and MultinomialNB steps.

Updated Code:
from sklearn.metrics import accuracy_score, classification_report

# Predict on new data (e.g., a single text "group C")

y_pred_single = model.predict(['group C'])
print(f"Prediction for 'group C': {y_pred_single}")

# Evaluate the model on the test data

y_pred = model.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report

print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))

Explanation:

1. Prediction for New Data:

o model.predict(['group C']): This will send the string 'group C' through the preprocessing
(preprocessor=preprocess_text), vectorization (CountVectorizer), and classification steps
in the pipeline. The output will be the predicted class for 'group C'.
2. Model Evaluation:
o model.predict(x_test): This predicts the labels for the test data (x_test). The pipeline
automatically applies preprocessing and vectorization on the test set before making
predictions.
3. Accuracy:
o accuracy_score(y_test, y_pred): This calculates the accuracy of your model by comparing
the predicted values (y_pred) to the actual labels (y_test).
4. Classification Report:
o classification_report(y_test, y_pred, zero_division=0): This generates a detailed
classification report that includes precision, recall, and F1-score for each class, and it
handles zero division errors (which can occur if a class is predicted zero times in the test
set) by setting zero_division=0.
Example Output:
Prediction for 'group C': ['Some College']
Accuracy: 0.82
Classification Report:
precision recall f1-score support

High School 0.84 0.80 0.82 50

Some College 0.80 0.84 0.82 50
Associate's 0.83 0.79 0.81 50

accuracy 0.82 150

macro avg 0.82 0.81 0.81 150
weighted avg 0.82 0.82 0.81 150

Key Notes:

 Predictions for New Data: model.predict(['group C']) works well with the pipeline as it
processes the new input in the same way as the training and test data.
 Zero Division Handling: By setting zero_division=0 in the classification report, you're avoiding
issues that can arise if any class has zero predicted samples, which can happen when a certain
class is underrepresented in the test data.

AMEE Guide No. 156 - The Fundamentals of AI in ME Research
No ratings yet
AMEE Guide No. 156 - The Fundamentals of AI in ME Research
9 pages
Ai 10TH Unit 3
No ratings yet
Ai 10TH Unit 3
42 pages
Fake Instagram Account and Their Detection
No ratings yet
Fake Instagram Account and Their Detection
7 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Detection of Diseases On Bananas (Musa SP.) Using Image Processing and Machine Learning Techniques
No ratings yet
Detection of Diseases On Bananas (Musa SP.) Using Image Processing and Machine Learning Techniques
15 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
RevisedReport 1 1
No ratings yet
RevisedReport 1 1
20 pages
Text
No ratings yet
Text
25 pages
4 Classification of Alzheimer's Disease Using Machine Learning
100% (1)
4 Classification of Alzheimer's Disease Using Machine Learning
8 pages
Lab 6
No ratings yet
Lab 6
47 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
Anurag University
100% (1)
Anurag University
4 pages
Fake News Detection Using NLP
No ratings yet
Fake News Detection Using NLP
11 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Tutorial 3 - 206009L
No ratings yet
Tutorial 3 - 206009L
34 pages
Python 21to30
No ratings yet
Python 21to30
9 pages
Confusion Matrix
No ratings yet
Confusion Matrix
6 pages
HOClass Paper SashaKiselev
No ratings yet
HOClass Paper SashaKiselev
18 pages
Detection of Phishing Websites Using Machine Learning
No ratings yet
Detection of Phishing Websites Using Machine Learning
6 pages
Confusion Matrix Problem Solution
No ratings yet
Confusion Matrix Problem Solution
6 pages
ML Case Study
No ratings yet
ML Case Study
1 page
Sentiment Analysis On Tourism Place Using Naive Bayes
No ratings yet
Sentiment Analysis On Tourism Place Using Naive Bayes
6 pages
HSU06 Session 5 Trần Thị Bích Hiền - Colab
No ratings yet
HSU06 Session 5 Trần Thị Bích Hiền - Colab
4 pages
Text Processing Steps
No ratings yet
Text Processing Steps
3 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
Report
No ratings yet
Report
2 pages
AI Phase3
No ratings yet
AI Phase3
18 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
Report AI HC
No ratings yet
Report AI HC
13 pages
Lab Report 8
No ratings yet
Lab Report 8
11 pages
Using Pre-Trained Word Embeddings - 1716328022707
No ratings yet
Using Pre-Trained Word Embeddings - 1716328022707
8 pages
Lizu Report
No ratings yet
Lizu Report
16 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
22 pages
Documentation ML
No ratings yet
Documentation ML
10 pages
A CNN-Based Human Head Detection Algorithm Implemented On Edge AI Chip
No ratings yet
A CNN-Based Human Head Detection Algorithm Implemented On Edge AI Chip
5 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Progress in Computer Vision and Image Analysis (Series in Machine Perception - Artifical Intelligence) (Series in Machine Perception and Artificial Intelligence) (PDFDrive)
No ratings yet
Progress in Computer Vision and Image Analysis (Series in Machine Perception - Artifical Intelligence) (Series in Machine Perception and Artificial Intelligence) (PDFDrive)
558 pages
Multi-Output Classification With Machine Learning
No ratings yet
Multi-Output Classification With Machine Learning
10 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
No ratings yet
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
5 pages
Deep-Learning-Based Stair Detection Using 3D Point Cloud Data For Preventing Walking Accidents of The Visually Impaired
No ratings yet
Deep-Learning-Based Stair Detection Using 3D Point Cloud Data For Preventing Walking Accidents of The Visually Impaired
7 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
Glove
100% (1)
Glove
10 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Ai Project File
No ratings yet
Ai Project File
11 pages
BAET Record
No ratings yet
BAET Record
19 pages
Wdm-Unit Ii
No ratings yet
Wdm-Unit Ii
126 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Top 50 Machine Learning Interview Questions (2023) - Simplilearn
No ratings yet
Top 50 Machine Learning Interview Questions (2023) - Simplilearn
24 pages
Assignment 3 DL
No ratings yet
Assignment 3 DL
6 pages
An Efficient Framework For Optic Disk Segmentation and Classification of
No ratings yet
An Efficient Framework For Optic Disk Segmentation and Classification of
9 pages
Project Ali Huzaifa
No ratings yet
Project Ali Huzaifa
6 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Lecture 21
No ratings yet
Lecture 21
16 pages
Segmentation Tutorial
No ratings yet
Segmentation Tutorial
20 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Methodology
No ratings yet
Methodology
9 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
Video Presentation Information
No ratings yet
Video Presentation Information
5 pages
2023 - Molnar - Introduction To Conformal Prediction With Python - A Short Guide For Quantifying Uncertainty of Machine Learning Models
No ratings yet
2023 - Molnar - Introduction To Conformal Prediction With Python - A Short Guide For Quantifying Uncertainty of Machine Learning Models
101 pages
Academic Internship Final Report
No ratings yet
Academic Internship Final Report
11 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Wdm-Unit I
No ratings yet
Wdm-Unit I
70 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Modern Techniques in Detecting Identifying and Classifying Exg3zx7bqj
No ratings yet
Modern Techniques in Detecting Identifying and Classifying Exg3zx7bqj
11 pages
School of Engineering: Lab Manual On Machine Learning Lab
No ratings yet
School of Engineering: Lab Manual On Machine Learning Lab
23 pages
Compendium Iim Shillong Analytics and Prod Man
No ratings yet
Compendium Iim Shillong Analytics and Prod Man
68 pages
ML Lab Exercise - 9
No ratings yet
ML Lab Exercise - 9
4 pages
Supervised Learning With Scikit-Learn
No ratings yet
Supervised Learning With Scikit-Learn
178 pages
Performance Metrics Classification
No ratings yet
Performance Metrics Classification
39 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Java Lab Record Final For Pint
No ratings yet
Java Lab Record Final For Pint
59 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
DS Using C - 1-30
No ratings yet
DS Using C - 1-30
149 pages
FND Imp Points
No ratings yet
FND Imp Points
6 pages
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
No ratings yet
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
35 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
ML CM
No ratings yet
ML CM
17 pages
AI Interview Notes
No ratings yet
AI Interview Notes
11 pages
Softskills Lab Record
No ratings yet
Softskills Lab Record
49 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Qb-It Iind Year I Sem
No ratings yet
Qb-It Iind Year I Sem
42 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
SSS Lab Presentation1
No ratings yet
SSS Lab Presentation1
17 pages
Week 6 K Nearestneighbors 1
No ratings yet
Week 6 K Nearestneighbors 1
11 pages
Vacr Ai Dept Marks
No ratings yet
Vacr Ai Dept Marks
22 pages
Confusion Matrix in Machine Learning
No ratings yet
Confusion Matrix in Machine Learning
6 pages
Vacr Ai A T 1 and Aimlc Test 3
No ratings yet
Vacr Ai A T 1 and Aimlc Test 3
4 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.