0% found this document useful (0 votes)
11 views24 pages

WDM - Week - I

Uploaded by

vishnuteja2612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views24 pages

WDM - Week - I

Uploaded by

vishnuteja2612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1.Apply Naive Bayes Classification algorithm for a given textual dataset (1 week).

2. Apply SVM algorithm for a given dataset (2 weeks).


3. Apply Hierarchical Clustering for a given textual dataset. Experiment with different distance
metrics (2 weeks).
4. Implement the distance functions for assessing similarity between documents while taking
care of standardizing the attributes (1 week).
5. Apply text preprocessing methods to extract relevant text: Stop Word Removal, Stemming,
Frequency Analysis for unigrams, bigrams and trigrams (2 weeks).

1. Apply Naive Bayes Classification algorithm for a given textual dataset (1 week).

#Import Required Libraries


import pandas as pd
import seaborn as sns
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_files
from sklearn.pipeline import make_pipeline

// It looks like you're setting up a text classification pipeline using Python, and you're importing several
useful libraries. Let's walk through what you're likely trying to accomplish and how you can complete the
necessary steps for text classification.

Overview of the Imports

1. pandas: A powerful library for data manipulation and analysis. You'll likely use it to work with
structured data, such as loading a CSV or dataset.
2. seaborn: A data visualization library that builds on matplotlib. You can use it to plot the results
of your analysis or metrics (e.g., confusion matrix, accuracy).
3. re: The regular expression library in Python. It will likely be useful for preprocessing and
cleaning the text data.
4. ENGLISH_STOP_WORDS: From sklearn, this set contains common stop words in English that
you can remove during text preprocessing.
5. CountVectorizer: This is a text vectorizer that converts a collection of text documents into a
matrix of token counts. It's a key component for feature extraction in text classification.
6. train_test_split: From sklearn, used to split your dataset into training and test sets.
7. MultinomialNB: This is a Naive Bayes classifier for multinomially distributed data, often used
for text classification tasks, particularly when the features are word counts.
8. accuracy_score, classification_report: Evaluation metrics to assess the performance of your
classification model.
9. load_files: A utility function to load text files into a dataset for classification.
10. make_pipeline: This allows you to create a pipeline where preprocessing and modeling steps are
encapsulated, making it easier to apply to different datasets.

General Workflow for Text Classification with These Libraries

Here's a complete example workflow using your imports, showing how to load a dataset, preprocess text,
build a model, and evaluate its performance.

1. Load and Preprocess the Dataset

Let’s assume you're working with a text dataset, which you might load using load_files (for example, a
folder with text files categorized into subdirectories).
# Load the dataset (assuming you're working with text files in different folders)
dataset = load_files('path_to_your_dataset', encoding='utf-8')

# View the dataset structure


print(f"Categories: {dataset.target_names}")
print(f"Sample data: {dataset.data[0]}")

2. Text Preprocessing

You can clean the text by removing unwanted characters, stop words, etc. Here's how to do that:
# Preprocess text by removing non-alphanumeric characters and stop words
def preprocess_text(text):
# Remove non-alphanumeric characters
text = re.sub(r'\W+', ' ', text)

# Remove stopwords
text = ' '.join([word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS])

return text

# Apply the preprocessing function to the dataset


X = [preprocess_text(text) for text in dataset.data]
y = dataset.target

3. Split Data into Training and Testing Sets

Now, split the dataset into training and test sets using train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
4. Create a Pipeline

You can use a pipeline to chain together text vectorization and the classification model (MultinomialNB).
# Create a pipeline with CountVectorizer and MultinomialNB
pipeline = make_pipeline(
CountVectorizer(), # Convert text to a bag of words
MultinomialNB() # Use Naive Bayes classifier
)

# Train the model


pipeline.fit(X_train, y_train)

5. Make Predictions and Evaluate the Model

After training the model, you can use it to make predictions on the test set and evaluate it.
# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Display classification report


print(classification_report(y_test, y_pred, target_names=dataset.target_names))

6. Visualize Results (Optional)

You can visualize the results, such as the distribution of predicted categories, using seaborn for better
understanding:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a confusion matrix plot


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=dataset.target_names,
yticklabels=dataset.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

Summary:

This is a basic text classification pipeline:

1. Load and preprocess text data.


2. Split the data into training and test sets.
3. Use a CountVectorizer and a MultinomialNB classifier in a pipeline.
4. Train the model, make predictions, and evaluate the performance using accuracy and
classification report.
5. Optionally, visualize the results using confusion matrix heatmaps.

This framework is flexible enough for you to experiment with other vectorizers (like TfidfVectorizer) or
classifiers (e.g., LogisticRegression, SVM) as needed. Let me know if you need further guidance or if
there’s a specific part of the pipeline you want to expand upon!

# Load the data


df=pd.read_csv(r"C:\Users\User\Downloads\Cleaned_Students_Performance.csv")
df

// It looks like you are trying to read a CSV file using pandas and display the contents of the DataFrame.
If you've run this code, you should be able to view the contents of the file as a DataFrame.

Here’s the code you're using:


import pandas as pd

# Read the CSV file


df = pd.read_csv(r"C:\Users\User\Downloads\Cleaned_Students_Performance.csv")

# Display the DataFrame


df

This will load the CSV file into the df DataFrame and display it. If you want to explore the dataset further
or need help with the next steps, here are a few things you can do:

1. Check the First Few Rows of the DataFrame

To quickly inspect the top few rows of your dataset:


# Show the first 5 rows
df.head()

2. Check for Missing Values

If you want to check if there are any missing values in the dataset:
# Check for missing values
df.isnull().sum()

3. View Dataframe Information

To get a summary of the DataFrame, including the column names, non-null counts, and data types:
# Summary of the DataFrame
df.info()

4. Basic Statistics of the Data

If you want basic descriptive statistics (for numerical columns), you can use:
# Summary statistics
df.describe()

5. Check Column Names

To view the column names in the dataset:


# Column names
df.columns

6. Check the Shape of the DataFrame

To see how many rows and columns the DataFrame has:


# Shape of the DataFrame
df.shape

If you encounter any issues or want to perform a specific analysis on this dataset (e.g., visualizing student
performance), let me know how you'd like to proceed!
# Explore data
df.head()
df.tail()
df.info()
df.describe()

// Let's go over the methods you're calling on the DataFrame (df) in your code:
1. df.head(): This will display the first 5 rows of the DataFrame by default. It gives a quick
overview of the first few records in your dataset.
2. df.head()
3. df.tail(): Similarly, df.tail() will display the last 5 rows of the DataFrame, giving you a sense of
the last few records.
4. df.tail()
5. df.info(): This method provides a concise summary of the DataFrame, including:
o The number of entries (rows).
o The column names.
o Non-null counts (which helps to identify if there are missing values in the columns).
o The data types of each column.
6. df.info()

Example output might look like:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 student_id 100 non-null int64
1 gender 100 non-null object
2 age 100 non-null int64
3 subject 100 non-null object
4 performance 100 non-null object
dtypes: int64(2), object(3)
memory usage: 4.0 KB

o RangeIndex: Number of rows in the DataFrame (100).


o Non-Null Count: Shows that all columns have no missing data (100 non-null).
o Data Types: Lists the types of data in each column (e.g., int64 for student_id and age,
object for categorical data like gender, subject, and performance).
7. df.describe(): This method provides a statistical summary of the numerical columns in the
DataFrame. It shows:
o Count: Total number of entries.
o Mean: The average value of each numerical column.
o Standard deviation (std).
o Min, 25%, 50% (median), 75%, Max: The distribution of values.

It only applies to numerical columns. If there are categorical columns, they will be excluded from
the summary.

Example output might look like this for a DataFrame with columns age and student_id:
student_id age
count 100.000000 100.000000
mean 1050.500000 18.50
std 29.011493 1.50
min 1001.000000 17.00
25% 1025.500000 17.75
50% 1050.500000 18.50
75% 1075.500000 19.25
max 1100.000000 20.00

This provides useful insights, such as:

o Mean age is 18.5 years.


o The age column ranges from 17 to 20.

Example of Full Output Flow


# Display first 5 rows
print(df.head())

# Display last 5 rows


print(df.tail())

# Display summary information about the DataFrame


df.info()

# Display statistical summary of numerical columns


print(df.describe())

def unique_characters(text):
return set(text)

# Apply the function to the DataFrame column


df['unique_characters'] = df['race_ethnicity'].apply(unique_characters)

# Display the DataFrame


print(df)
// The code you've written defines a function unique_characters() that takes a string and returns a set of
unique characters in that string. You then apply this function to the race_ethnicity column of the
DataFrame df and store the results in a new column called unique_characters.

Here’s how it works:

1. unique_characters(text) function: This function takes a string (text), and set(text) creates a set
of unique characters from the string. Since a set does not allow duplicate values, it will
automatically remove any repeated characters.
2. apply(unique_characters): The apply() method is used to apply the unique_characters() function
to each element (each row) in the race_ethnicity column.
3. Storing the Result: The result of applying this function is assigned to a new column called
unique_characters.
Example:

Assume your DataFrame (df) looks like this:


student_id gender age race_ethnicity
0 1001 Male 18 Asian
1 1002 Female 19 Hispanic
2 1003 Male 17 Caucasian

When you apply the function, the new column unique_characters will show a set of unique characters for
each race_ethnicity value.

Expected Output:

For the example data above:


student_id gender age race_ethnicity unique_characters
0 1001 Male 18 Asian {'A', 'i', 'n', 's'}
1 1002 Female 19 Hispanic {'a', 'p', 'n', 's', 'i', 'c', 'h'}
2 1003 Male 17 Caucasian {'C', 'a', 'c', 'n', 's', 'i', 'u'}

 unique_characters: For each row, it contains a set of the unique characters in the corresponding
race_ethnicity value.
o For "Asian", the unique characters are {'A', 'i', 'n', 's'}.
o For "Hispanic", the unique characters are {'a', 'p', 'n', 's', 'i', 'c', 'h'}.

Complete Code:
import pandas as pd

# Sample DataFrame
data = {
'student_id': [1001, 1002, 1003],
'gender': ['Male', 'Female', 'Male'],
'age': [18, 19, 17],
'race_ethnicity': ['Asian', 'Hispanic', 'Caucasian']
}

df = pd.DataFrame(data)

# Define the function to get unique characters


def unique_characters(text):
return set(text)

# Apply the function to the 'race_ethnicity' column


df['unique_characters'] = df['race_ethnicity'].apply(unique_characters)

# Display the DataFrame


print(df)

Output:
student_id gender age race_ethnicity unique_characters
0 1001 Male 18 Asian {'A', 'i', 'n', 's'}
1 1002 Female 19 Hispanic {'a', 'p', 'n', 's', 'i', 'c', 'h'}
2 1003 Male 17 Caucasian {'C', 'a', 'c', 'n', 's', 'i', 'u'}

Notes:

 The output for each race_ethnicity value is a set, which contains only the unique characters from
the string.
 If you want the characters in a specific order or formatted differently, you could modify the
unique_characters() function to return a sorted list or a string, for example.
 If your column contains NaN values, you'll need to handle those, either by filling them or filtering
them out before applying the function.

def preprocess_text(text):
# Convert to lowercase
text = text.lower()

# Remove HTML tags


text = re.sub(r'<.*?>', '', text)

# Remove special characters and digits using regex


text = re.sub(r'[^a-zA-Z\s]', '', text)

return text

The function preprocess_text() that you have defined is aimed at cleaning and preprocessing text data.
Let's break down the steps in the function:

Steps in preprocess_text():

1. Convert to lowercase:
2. text = text.lower()

This ensures that the text is in lowercase, which helps in making comparisons case-insensitive
and reduces the dimensionality of the text data when working with text analysis or machine
learning models.

3. Remove HTML tags:


4. text = re.sub(r'<.*?>', '', text)
This removes any HTML tags from the text using a regular expression (regex). It matches any
content that is enclosed in angle brackets (< >) and removes it. This is useful when your data
includes HTML content (e.g., <p>text</p>).

5. Remove special characters and digits:


6. text = re.sub(r'[^a-zA-Z\s]', '', text)

This regex pattern removes anything that is not an alphabetical character (a-zA-Z) or whitespace
(\s). It effectively removes special characters like punctuation, numbers, and other non-alphabetic
symbols.

Example of Usage:

Let's say we have a string with HTML tags, special characters, and mixed case:
text = "<p>This is a <b>Test</b>! @123 Check this text...</p>"
cleaned_text = preprocess_text(text)
print(cleaned_text)

Output:
this is a test check this text

 The text is converted to lowercase (this is a test check this text).


 HTML tags (<p>, </p>, <b>, and </b>) are removed.
 Special characters (!, @, 123, and ...) and numbers are removed.

Applying the Function to a DataFrame:

If you have a column in a pandas DataFrame (for example, text_column) that contains text, you can apply
the preprocess_text() function to that column like this:
import pandas as pd

# Sample DataFrame
data = {'text_column': ["<p>This is a <b>Test</b>! @123 Check this text...</p>",
"<div>Another <i>Example</i> with #special characters!!</div>"]}

df = pd.DataFrame(data)

# Apply the preprocess_text function to the 'text_column'


df['cleaned_text'] = df['text_column'].apply(preprocess_text)

# Display the DataFrame with the cleaned text


print(df)

Expected Output:
text_column cleaned_text
0 <p>This is a <b>Test</b>! @123 Check this text...</p> this is a test check this text
1 <div>Another <i>Example</i> with #special characters!!</div> another example with special
characters.
Additional Notes:

 If your text contains other unwanted elements (e.g., stopwords, extra spaces), you could extend
the preprocess_text() function to handle those as well.
 You could also add functionality to handle emojis, URLs, or unicode characters if needed.

# Split the data


from sklearn.model_selection import train_test_split
x = df['race_ethnicity']
y = df['parental_level_of_education']
x, x_test, y, y_test = train_test_split(x,y, stratify=y, test_size=0.25, random_state=42)

Your code uses train_test_split from sklearn.model_selection to split the dataset into training and testing
sets. Let's break down each part of the code:

Code Explanation:
from sklearn.model_selection import train_test_split

# Define your features and target


x = df['race_ethnicity'] # Feature (independent variable)
y = df['parental_level_of_education'] # Target (dependent variable)

# Split the data into training and testing sets


x, x_test, y, y_test = train_test_split(x, y, stratify=y, test_size=0.25, random_state=42)

Breakdown:

1. Importing train_test_split:
o train_test_split is used to split data into training and testing sets. It's one of the most
common functions for splitting data in machine learning workflows.
2. Defining Features and Target:
o x = df['race_ethnicity']: This is the feature (or independent variable), which represents the
race/ethnicity of the students.
o y = df['parental_level_of_education']: This is the target (or dependent variable), which
represents the parental level of education.
3. Splitting the Data:
o stratify=y: This ensures that the class distribution in the target variable (y) is preserved
in both the training and test sets. In other words, the distribution of
parental_level_of_education will be similar in both the training and testing datasets.
o test_size=0.25: This indicates that 25% of the data will be used for testing, and the
remaining 75% will be used for training.
o random_state=42: This sets a seed for the random number generator, ensuring
reproducibility of the split. Every time you run the code with this seed, the split will be
the same.
Result:

After executing this code:

 x will contain the race_ethnicity values for the training set (75% of the data).
 x_test will contain the race_ethnicity values for the testing set (25% of the data).
 y will contain the parental_level_of_education values for the training set.
 y_test will contain the parental_level_of_education values for the testing set.

Example of Data Splitting:

For example, if your DataFrame (df) contains 100 rows, the code will split it as follows:

 75 rows for training (in x and y).


 25 rows for testing (in x_test and y_test).

The stratify=y ensures that the proportion of parental_level_of_education in both the training and testing
sets will be the same as in the original dataset.

Check the Shapes of the Split Data:

To check that the split was done correctly, you can print the shapes of the resulting datasets:
print(f"Training features (x): {x.shape}")
print(f"Testing features (x_test): {x_test.shape}")
print(f"Training target (y): {y.shape}")
print(f"Testing target (y_test): {y_test.shape}")

This will output something like:


Training features (x): (75,)
Testing features (x_test): (25,)
Training target (y): (75,)
Testing target (y_test): (25,)

The numbers will vary depending on the size of your original dataset.

Next Steps:Now that you've split the data into training and testing sets, you can proceed with building
and training a machine learning model (e.g., using a classifier like RandomForestClassifier,
LogisticRegression, etc.) to predict the parental level of education based on race/ethnicity.

# Vectorize the text data


from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words='english')
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()
The code you've written uses CountVectorizer from sklearn.feature_extraction.text to convert the text
data (in your x and x_test variables) into numerical feature vectors. Let's break down each part:

Code Explanation:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop words removed


vec = CountVectorizer(stop_words='english')

# Fit the vectorizer to the training data and transform it


x = vec.fit_transform(x).toarray()

# Transform the test data using the fitted vectorizer


x_test = vec.transform(x_test).toarray()

Breakdown:

1. CountVectorizer(stop_words='english'):
o CountVectorizer is a method used to convert a collection of text documents into a matrix
of token counts (i.e., word counts).
o stop_words='english': This parameter tells the vectorizer to ignore common English
words (e.g., "the", "and", "is") which are generally not useful for text analysis. This helps
reduce the dimensionality of the resulting feature matrix and improve model
performance.
2. vec.fit_transform(x):
o fit_transform() is used on the training data (x). It first fits the vectorizer to the text data
by learning the vocabulary and then transforms the text data into a numerical matrix of
word counts.
o toarray(): This converts the sparse matrix produced by fit_transform() into a dense array.
This is often necessary when working with some machine learning algorithms, which
require a dense format.
3. vec.transform(x_test):
o transform() is used on the test data (x_test). The key point here is that we only transform
the test data using the vocabulary learned from the training data. This ensures that the test
data is represented using the same features (vocabulary) that the model will learn from
the training data.
o Again, .toarray() converts the sparse matrix to a dense array.

Example:

Let's assume that x contains the text data (e.g., the race_ethnicity column from your DataFrame), and here
are a couple of sample rows:
x = ['Asian', 'Hispanic', 'Caucasian', 'Asian', 'Hispanic']

The CountVectorizer will convert these text entries into a matrix of word counts.

The fit_transform() Step:


 Training set (x = ['Asian', 'Hispanic', 'Caucasian', 'Asian', 'Hispanic'])
o After fitting, the vectorizer learns the unique words in the training data (vocabulary). In
this case, the vocabulary might include ['asian', 'hispanic', 'caucasian'].
o After transforming, the resulting count matrix would look like this:
Asian Hispanic Caucasian
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0

 Each row corresponds to a sample in the training data, and each column corresponds to a word in
the vocabulary. The values represent the count of each word in the corresponding text.

The transform() Step:

 For the test data (x_test), we transform the data using the same vocabulary learned from the
training data. If there are any new words in the test set that were not seen in the training data, they
will be ignored (since they are not in the vocabulary).

Output:

After running your code, x and x_test will be matrices of word counts from the text data in race_ethnicity
(the training and test data).

Additional Notes:

1. Dimensionality: The resulting matrices (x and x_test) will have as many columns as the number
of unique words (tokens) in the vocabulary learned from the training data. Depending on the size
of your dataset and the variety of words, this could result in a very high-dimensional feature
space.
2. Sparse Matrices: Since CountVectorizer typically creates a sparse matrix (many zeros due to the
large vocabulary), you might want to use a sparse matrix format instead of converting it to a
dense array, especially if your dataset is large. This is often more memory-efficient.

Check the Shapes of the Transformed Data:

You can check the shapes of the transformed data to verify everything is correct:
print(f"Training data shape (x): {x.shape}")
print(f"Testing data shape (x_test): {x_test.shape}")

This will output the number of samples (rows) and features (columns) in both your training and testing
datasets.

Next Steps:

Now that you’ve transformed your text data into numerical features, you can proceed with training a
machine learning model (e.g., Naive Bayes, Logistic Regression) on the training data (x, y), and evaluate
it using the test data (x_test, y_test).
# Train the model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x, y)

The code you've written applies the Multinomial Naive Bayes classifier to the data you've prepared,
which includes the text feature data (x) and the target variable (y). Let's break down what happens in this
code:

Code Explanation:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes model


model = MultinomialNB()

# Fit the model to the training data


model.fit(x, y)

Steps:

1. from sklearn.naive_bayes import MultinomialNB:


o This imports the MultinomialNB class, which is a Naive Bayes classifier commonly used
for text classification tasks where the features (words) are assumed to follow a
multinomial distribution. This is typical for problems like spam classification, sentiment
analysis, or any classification problem where the features are counts or frequencies of
words.
2. model = MultinomialNB():
o This initializes the Multinomial Naive Bayes model. By default, it uses the Laplace
smoothing (i.e., alpha=1.0), which helps in cases where some words might not appear in
the training data but may appear in the test data.
3. model.fit(x, y):
o The fit() method is used to train the model on the feature data x (which, in your case, are
the transformed text data after using CountVectorizer) and the target data y (which are
the corresponding labels from the parental_level_of_education column).
o The model will "learn" the relationship between the input features (the word counts) and
the target labels (the educational level of the students).
Model Behavior:

 Multinomial Naive Bayes is a probabilistic classifier that works well when the features are
counts (like word frequencies). It calculates the probability of each class (the possible values of y,
such as different parental education levels) given the input features (x), and it assigns the class
with the highest probability.
 Training: The fit() function calculates probabilities based on the frequency of each word in each
class and the frequency of each class. This is used to estimate the likelihood of a particular label
(e.g., parental education level) given the features (e.g., race/ethnicity).

Next Steps (Model Evaluation):

Once the model is trained, you should evaluate its performance on the test data (x_test and y_test). Here's
how you can do it:

1. Predict on the Test Data:


y_pred = model.predict(x_test)

This uses the predict() method of the trained model to predict the labels for the test data (x_test).

2. Evaluate the Model:

You can use metrics such as accuracy, confusion matrix, and classification report to evaluate the
performance of your model.
from sklearn.metrics import accuracy_score, classification_report

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification Report
print(classification_report(y_test, y_pred))

3. Confusion Matrix:

A confusion matrix can give you a more detailed understanding of how well the model is performing with
respect to different classes.
from sklearn.metrics import confusion_matrix

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Example Output:

For the accuracy, you might get something like:


Accuracy: 0.80
For the classification report, you’ll see precision, recall, and F1 score for each class (i.e., each level of
parental education):
precision recall f1-score support

High School 0.85 0.78 0.81 50


Some College 0.75 0.83 0.79 50
Associate's 0.82 0.80 0.81 50

accuracy 0.80 150


macro avg 0.81 0.80 0.80 150
weighted avg 0.81 0.80 0.80 150

Summary:

 The model is trained using the MultinomialNB() classifier.


 After training, we can evaluate its performance on the test data using various metrics such as
accuracy, classification report, and confusion matrix.
 The Multinomial Naive Bayes classifier is a great choice for text classification problems when the
features are counts or frequencies of words, as in your case with CountVectorizer.

model = make_pipeline(

CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS), preprocessor=preprocess_text),

MultinomialNB(),

The code you've written creates a pipeline using make_pipeline() from sklearn, which simplifies the
process of applying multiple steps in a machine learning workflow. Let's break down the code:

Code Explanation:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

You import the necessary modules:

 make_pipeline: A helper function that chains together multiple steps of data transformation and
modeling into a single object.
 CountVectorizer: A vectorizer that converts text data into a matrix of token counts (word
frequencies).
 MultinomialNB: The Naive Bayes classifier for multinomially distributed data, commonly used
for text classification.
 ENGLISH_STOP_WORDS: A predefined list of common English stopwords.

Building the Pipeline:


model = make_pipeline(
CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS), preprocessor=preprocess_text),
MultinomialNB(),
)

Here’s what each part of this pipeline does:

1. CountVectorizer:
o stop_words=list(ENGLISH_STOP_WORDS): You pass the predefined list of English
stopwords from sklearn.feature_extraction.text.ENGLISH_STOP_WORDS to
CountVectorizer, which will ignore common words like "the", "is", and "and" during text
processing.
o preprocessor=preprocess_text: This argument specifies that the preprocess_text
function (which you defined earlier) will be applied to clean the text before tokenizing it.
This means that for each text sample, the function will convert it to lowercase, remove
HTML tags, special characters, and digits.

What this does: The CountVectorizer will process each text in the dataset by cleaning it with
preprocess_text, then transforming it into a bag-of-words representation (i.e., a matrix where each
row corresponds to a document and each column represents a word count).

2. MultinomialNB:
o This is the Naive Bayes classifier that will be trained on the transformed text data to
predict the target variable (y, in your case, parental_level_of_education).

How It Works:

 The pipeline combines the two steps (text preprocessing and classification) into a single object,
allowing you to fit and evaluate your model with minimal code. The beauty of using a pipeline is
that it abstracts away the intermediate steps and ensures that all steps are applied in a sequence,
especially when fitting and transforming the training and testing data.

Example Workflow:

When you use this pipeline, you can do the following:

1. Fit the Model:


o You can train the model on your training data using the fit() method:
2. model.fit(x, y)

This will:

o Apply the preprocess_text() function to the text in x.


o Convert the processed text into a matrix of word counts.
o Train the Naive Bayes classifier (MultinomialNB) on the resulting word count matrix and
the corresponding labels in y.
3. Predict:
o Once the model is trained, you can use the predict() method to make predictions on the
test data:
4. y_pred = model.predict(x_test)
5. Evaluate the Model: You can evaluate the model using various metrics such as accuracy,
classification report, etc., as shown previously.

Example Code:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split data (assuming df['race_ethnicity'] is the feature and df['parental_level_of_education'] is the target)
x = df['race_ethnicity']
y = df['parental_level_of_education']
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.25, random_state=42)

# Fit the model using the pipeline


model.fit(x_train, y_train)

# Predict on the test data


y_pred = model.predict(x_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print(classification_report(y_test, y_pred))

Output Example:
After running this code, you will get an accuracy score and a classification report that provides precision,
recall, and F1-score for each class.
Accuracy: 0.82

precision recall f1-score support


High School 0.83 0.80 0.81 50
Some College 0.80 0.84 0.82 50
Associate's 0.83 0.79 0.81 50
accuracy 0.82 150
macro avg 0.82 0.81 0.81 150
weighted avg 0.82 0.82 0.81 150

Advantages of Using a Pipeline:

 Streamlined Workflow: You don't have to manually call each step (preprocessing and model
fitting), making the code cleaner and more maintainable.
 Consistency: Ensures that the same preprocessing steps are applied both during training and
testing (or cross-validation).
 Easier Experimentation: You can swap out the CountVectorizer or MultinomialNB for other
models or preprocessing methods without changing much of the code structure.

Additional Notes:

 Hyperparameter Tuning: If you need to tune hyperparameters, you can easily do this by using
GridSearchCV or RandomizedSearchCV with the pipeline. This ensures that all steps (including
the vectorizer) are part of the hyperparameter search.

#Predict the Model


model.predict(vec.transform(['group C']))
model.score(x_test, y_test)
y_pred = model.predict(x_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")

# Evaluation of model
print(classification_report(y_test, y_pred,zero_division=0))

Your code looks mostly correct, but there are a couple of things we need to adjust to ensure it works
properly, especially when using the make_pipeline you've set up earlier. Here's the breakdown:
Issue:

In the line:
model.predict(vec.transform(['group C']))

You're trying to use vec.transform directly. However, since model is a pipeline that includes the
CountVectorizer, you don't need to manually call the CountVectorizer (vec.transform()) when making
predictions. The pipeline automatically handles the transformation step when you call model.predict().

Fix:

You should directly use model.predict() on the input text, and it will automatically go through the
CountVectorizer and MultinomialNB steps.

Updated Code:
from sklearn.metrics import accuracy_score, classification_report

# Predict on new data (e.g., a single text "group C")


y_pred_single = model.predict(['group C'])
print(f"Prediction for 'group C': {y_pred_single}")

# Evaluate the model on the test data


y_pred = model.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report


print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))

Explanation:

1. Prediction for New Data:


o model.predict(['group C']): This will send the string 'group C' through the preprocessing
(preprocessor=preprocess_text), vectorization (CountVectorizer), and classification steps
in the pipeline. The output will be the predicted class for 'group C'.
2. Model Evaluation:
o model.predict(x_test): This predicts the labels for the test data (x_test). The pipeline
automatically applies preprocessing and vectorization on the test set before making
predictions.
3. Accuracy:
o accuracy_score(y_test, y_pred): This calculates the accuracy of your model by comparing
the predicted values (y_pred) to the actual labels (y_test).
4. Classification Report:
o classification_report(y_test, y_pred, zero_division=0): This generates a detailed
classification report that includes precision, recall, and F1-score for each class, and it
handles zero division errors (which can occur if a class is predicted zero times in the test
set) by setting zero_division=0.
Example Output:
Prediction for 'group C': ['Some College']
Accuracy: 0.82
Classification Report:
precision recall f1-score support

High School 0.84 0.80 0.82 50


Some College 0.80 0.84 0.82 50
Associate's 0.83 0.79 0.81 50

accuracy 0.82 150


macro avg 0.82 0.81 0.81 150
weighted avg 0.82 0.82 0.81 150

Key Notes:

 Predictions for New Data: model.predict(['group C']) works well with the pipeline as it
processes the new input in the same way as the training and test data.
 Zero Division Handling: By setting zero_division=0 in the classification report, you're avoiding
issues that can arise if any class has zero predicted samples, which can happen when a certain
class is underrepresented in the test data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy