WDM - Week - I
WDM - Week - I
1. Apply Naive Bayes Classification algorithm for a given textual dataset (1 week).
// It looks like you're setting up a text classification pipeline using Python, and you're importing several
useful libraries. Let's walk through what you're likely trying to accomplish and how you can complete the
necessary steps for text classification.
1. pandas: A powerful library for data manipulation and analysis. You'll likely use it to work with
structured data, such as loading a CSV or dataset.
2. seaborn: A data visualization library that builds on matplotlib. You can use it to plot the results
of your analysis or metrics (e.g., confusion matrix, accuracy).
3. re: The regular expression library in Python. It will likely be useful for preprocessing and
cleaning the text data.
4. ENGLISH_STOP_WORDS: From sklearn, this set contains common stop words in English that
you can remove during text preprocessing.
5. CountVectorizer: This is a text vectorizer that converts a collection of text documents into a
matrix of token counts. It's a key component for feature extraction in text classification.
6. train_test_split: From sklearn, used to split your dataset into training and test sets.
7. MultinomialNB: This is a Naive Bayes classifier for multinomially distributed data, often used
for text classification tasks, particularly when the features are word counts.
8. accuracy_score, classification_report: Evaluation metrics to assess the performance of your
classification model.
9. load_files: A utility function to load text files into a dataset for classification.
10. make_pipeline: This allows you to create a pipeline where preprocessing and modeling steps are
encapsulated, making it easier to apply to different datasets.
Here's a complete example workflow using your imports, showing how to load a dataset, preprocess text,
build a model, and evaluate its performance.
Let’s assume you're working with a text dataset, which you might load using load_files (for example, a
folder with text files categorized into subdirectories).
# Load the dataset (assuming you're working with text files in different folders)
dataset = load_files('path_to_your_dataset', encoding='utf-8')
2. Text Preprocessing
You can clean the text by removing unwanted characters, stop words, etc. Here's how to do that:
# Preprocess text by removing non-alphanumeric characters and stop words
def preprocess_text(text):
# Remove non-alphanumeric characters
text = re.sub(r'\W+', ' ', text)
# Remove stopwords
text = ' '.join([word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS])
return text
Now, split the dataset into training and test sets using train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
4. Create a Pipeline
You can use a pipeline to chain together text vectorization and the classification model (MultinomialNB).
# Create a pipeline with CountVectorizer and MultinomialNB
pipeline = make_pipeline(
CountVectorizer(), # Convert text to a bag of words
MultinomialNB() # Use Naive Bayes classifier
)
After training the model, you can use it to make predictions on the test set and evaluate it.
# Make predictions on the test set
y_pred = pipeline.predict(X_test)
You can visualize the results, such as the distribution of predicted categories, using seaborn for better
understanding:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=dataset.target_names,
yticklabels=dataset.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
Summary:
This framework is flexible enough for you to experiment with other vectorizers (like TfidfVectorizer) or
classifiers (e.g., LogisticRegression, SVM) as needed. Let me know if you need further guidance or if
there’s a specific part of the pipeline you want to expand upon!
// It looks like you are trying to read a CSV file using pandas and display the contents of the DataFrame.
If you've run this code, you should be able to view the contents of the file as a DataFrame.
This will load the CSV file into the df DataFrame and display it. If you want to explore the dataset further
or need help with the next steps, here are a few things you can do:
If you want to check if there are any missing values in the dataset:
# Check for missing values
df.isnull().sum()
To get a summary of the DataFrame, including the column names, non-null counts, and data types:
# Summary of the DataFrame
df.info()
If you want basic descriptive statistics (for numerical columns), you can use:
# Summary statistics
df.describe()
If you encounter any issues or want to perform a specific analysis on this dataset (e.g., visualizing student
performance), let me know how you'd like to proceed!
# Explore data
df.head()
df.tail()
df.info()
df.describe()
// Let's go over the methods you're calling on the DataFrame (df) in your code:
1. df.head(): This will display the first 5 rows of the DataFrame by default. It gives a quick
overview of the first few records in your dataset.
2. df.head()
3. df.tail(): Similarly, df.tail() will display the last 5 rows of the DataFrame, giving you a sense of
the last few records.
4. df.tail()
5. df.info(): This method provides a concise summary of the DataFrame, including:
o The number of entries (rows).
o The column names.
o Non-null counts (which helps to identify if there are missing values in the columns).
o The data types of each column.
6. df.info()
It only applies to numerical columns. If there are categorical columns, they will be excluded from
the summary.
Example output might look like this for a DataFrame with columns age and student_id:
student_id age
count 100.000000 100.000000
mean 1050.500000 18.50
std 29.011493 1.50
min 1001.000000 17.00
25% 1025.500000 17.75
50% 1050.500000 18.50
75% 1075.500000 19.25
max 1100.000000 20.00
def unique_characters(text):
return set(text)
1. unique_characters(text) function: This function takes a string (text), and set(text) creates a set
of unique characters from the string. Since a set does not allow duplicate values, it will
automatically remove any repeated characters.
2. apply(unique_characters): The apply() method is used to apply the unique_characters() function
to each element (each row) in the race_ethnicity column.
3. Storing the Result: The result of applying this function is assigned to a new column called
unique_characters.
Example:
When you apply the function, the new column unique_characters will show a set of unique characters for
each race_ethnicity value.
Expected Output:
unique_characters: For each row, it contains a set of the unique characters in the corresponding
race_ethnicity value.
o For "Asian", the unique characters are {'A', 'i', 'n', 's'}.
o For "Hispanic", the unique characters are {'a', 'p', 'n', 's', 'i', 'c', 'h'}.
Complete Code:
import pandas as pd
# Sample DataFrame
data = {
'student_id': [1001, 1002, 1003],
'gender': ['Male', 'Female', 'Male'],
'age': [18, 19, 17],
'race_ethnicity': ['Asian', 'Hispanic', 'Caucasian']
}
df = pd.DataFrame(data)
Output:
student_id gender age race_ethnicity unique_characters
0 1001 Male 18 Asian {'A', 'i', 'n', 's'}
1 1002 Female 19 Hispanic {'a', 'p', 'n', 's', 'i', 'c', 'h'}
2 1003 Male 17 Caucasian {'C', 'a', 'c', 'n', 's', 'i', 'u'}
Notes:
The output for each race_ethnicity value is a set, which contains only the unique characters from
the string.
If you want the characters in a specific order or formatted differently, you could modify the
unique_characters() function to return a sorted list or a string, for example.
If your column contains NaN values, you'll need to handle those, either by filling them or filtering
them out before applying the function.
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
return text
The function preprocess_text() that you have defined is aimed at cleaning and preprocessing text data.
Let's break down the steps in the function:
Steps in preprocess_text():
1. Convert to lowercase:
2. text = text.lower()
This ensures that the text is in lowercase, which helps in making comparisons case-insensitive
and reduces the dimensionality of the text data when working with text analysis or machine
learning models.
This regex pattern removes anything that is not an alphabetical character (a-zA-Z) or whitespace
(\s). It effectively removes special characters like punctuation, numbers, and other non-alphabetic
symbols.
Example of Usage:
Let's say we have a string with HTML tags, special characters, and mixed case:
text = "<p>This is a <b>Test</b>! @123 Check this text...</p>"
cleaned_text = preprocess_text(text)
print(cleaned_text)
Output:
this is a test check this text
If you have a column in a pandas DataFrame (for example, text_column) that contains text, you can apply
the preprocess_text() function to that column like this:
import pandas as pd
# Sample DataFrame
data = {'text_column': ["<p>This is a <b>Test</b>! @123 Check this text...</p>",
"<div>Another <i>Example</i> with #special characters!!</div>"]}
df = pd.DataFrame(data)
Expected Output:
text_column cleaned_text
0 <p>This is a <b>Test</b>! @123 Check this text...</p> this is a test check this text
1 <div>Another <i>Example</i> with #special characters!!</div> another example with special
characters.
Additional Notes:
If your text contains other unwanted elements (e.g., stopwords, extra spaces), you could extend
the preprocess_text() function to handle those as well.
You could also add functionality to handle emojis, URLs, or unicode characters if needed.
Your code uses train_test_split from sklearn.model_selection to split the dataset into training and testing
sets. Let's break down each part of the code:
Code Explanation:
from sklearn.model_selection import train_test_split
Breakdown:
1. Importing train_test_split:
o train_test_split is used to split data into training and testing sets. It's one of the most
common functions for splitting data in machine learning workflows.
2. Defining Features and Target:
o x = df['race_ethnicity']: This is the feature (or independent variable), which represents the
race/ethnicity of the students.
o y = df['parental_level_of_education']: This is the target (or dependent variable), which
represents the parental level of education.
3. Splitting the Data:
o stratify=y: This ensures that the class distribution in the target variable (y) is preserved
in both the training and test sets. In other words, the distribution of
parental_level_of_education will be similar in both the training and testing datasets.
o test_size=0.25: This indicates that 25% of the data will be used for testing, and the
remaining 75% will be used for training.
o random_state=42: This sets a seed for the random number generator, ensuring
reproducibility of the split. Every time you run the code with this seed, the split will be
the same.
Result:
x will contain the race_ethnicity values for the training set (75% of the data).
x_test will contain the race_ethnicity values for the testing set (25% of the data).
y will contain the parental_level_of_education values for the training set.
y_test will contain the parental_level_of_education values for the testing set.
For example, if your DataFrame (df) contains 100 rows, the code will split it as follows:
The stratify=y ensures that the proportion of parental_level_of_education in both the training and testing
sets will be the same as in the original dataset.
To check that the split was done correctly, you can print the shapes of the resulting datasets:
print(f"Training features (x): {x.shape}")
print(f"Testing features (x_test): {x_test.shape}")
print(f"Training target (y): {y.shape}")
print(f"Testing target (y_test): {y_test.shape}")
The numbers will vary depending on the size of your original dataset.
Next Steps:Now that you've split the data into training and testing sets, you can proceed with building
and training a machine learning model (e.g., using a classifier like RandomForestClassifier,
LogisticRegression, etc.) to predict the parental level of education based on race/ethnicity.
Code Explanation:
from sklearn.feature_extraction.text import CountVectorizer
Breakdown:
1. CountVectorizer(stop_words='english'):
o CountVectorizer is a method used to convert a collection of text documents into a matrix
of token counts (i.e., word counts).
o stop_words='english': This parameter tells the vectorizer to ignore common English
words (e.g., "the", "and", "is") which are generally not useful for text analysis. This helps
reduce the dimensionality of the resulting feature matrix and improve model
performance.
2. vec.fit_transform(x):
o fit_transform() is used on the training data (x). It first fits the vectorizer to the text data
by learning the vocabulary and then transforms the text data into a numerical matrix of
word counts.
o toarray(): This converts the sparse matrix produced by fit_transform() into a dense array.
This is often necessary when working with some machine learning algorithms, which
require a dense format.
3. vec.transform(x_test):
o transform() is used on the test data (x_test). The key point here is that we only transform
the test data using the vocabulary learned from the training data. This ensures that the test
data is represented using the same features (vocabulary) that the model will learn from
the training data.
o Again, .toarray() converts the sparse matrix to a dense array.
Example:
Let's assume that x contains the text data (e.g., the race_ethnicity column from your DataFrame), and here
are a couple of sample rows:
x = ['Asian', 'Hispanic', 'Caucasian', 'Asian', 'Hispanic']
The CountVectorizer will convert these text entries into a matrix of word counts.
Each row corresponds to a sample in the training data, and each column corresponds to a word in
the vocabulary. The values represent the count of each word in the corresponding text.
For the test data (x_test), we transform the data using the same vocabulary learned from the
training data. If there are any new words in the test set that were not seen in the training data, they
will be ignored (since they are not in the vocabulary).
Output:
After running your code, x and x_test will be matrices of word counts from the text data in race_ethnicity
(the training and test data).
Additional Notes:
1. Dimensionality: The resulting matrices (x and x_test) will have as many columns as the number
of unique words (tokens) in the vocabulary learned from the training data. Depending on the size
of your dataset and the variety of words, this could result in a very high-dimensional feature
space.
2. Sparse Matrices: Since CountVectorizer typically creates a sparse matrix (many zeros due to the
large vocabulary), you might want to use a sparse matrix format instead of converting it to a
dense array, especially if your dataset is large. This is often more memory-efficient.
You can check the shapes of the transformed data to verify everything is correct:
print(f"Training data shape (x): {x.shape}")
print(f"Testing data shape (x_test): {x_test.shape}")
This will output the number of samples (rows) and features (columns) in both your training and testing
datasets.
Next Steps:
Now that you’ve transformed your text data into numerical features, you can proceed with training a
machine learning model (e.g., Naive Bayes, Logistic Regression) on the training data (x, y), and evaluate
it using the test data (x_test, y_test).
# Train the model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x, y)
The code you've written applies the Multinomial Naive Bayes classifier to the data you've prepared,
which includes the text feature data (x) and the target variable (y). Let's break down what happens in this
code:
Code Explanation:
from sklearn.naive_bayes import MultinomialNB
Steps:
Multinomial Naive Bayes is a probabilistic classifier that works well when the features are
counts (like word frequencies). It calculates the probability of each class (the possible values of y,
such as different parental education levels) given the input features (x), and it assigns the class
with the highest probability.
Training: The fit() function calculates probabilities based on the frequency of each word in each
class and the frequency of each class. This is used to estimate the likelihood of a particular label
(e.g., parental education level) given the features (e.g., race/ethnicity).
Once the model is trained, you should evaluate its performance on the test data (x_test and y_test). Here's
how you can do it:
This uses the predict() method of the trained model to predict the labels for the test data (x_test).
You can use metrics such as accuracy, confusion matrix, and classification report to evaluate the
performance of your model.
from sklearn.metrics import accuracy_score, classification_report
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification Report
print(classification_report(y_test, y_pred))
3. Confusion Matrix:
A confusion matrix can give you a more detailed understanding of how well the model is performing with
respect to different classes.
from sklearn.metrics import confusion_matrix
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
Example Output:
Summary:
model = make_pipeline(
CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS), preprocessor=preprocess_text),
MultinomialNB(),
The code you've written creates a pipeline using make_pipeline() from sklearn, which simplifies the
process of applying multiple steps in a machine learning workflow. Let's break down the code:
Code Explanation:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
make_pipeline: A helper function that chains together multiple steps of data transformation and
modeling into a single object.
CountVectorizer: A vectorizer that converts text data into a matrix of token counts (word
frequencies).
MultinomialNB: The Naive Bayes classifier for multinomially distributed data, commonly used
for text classification.
ENGLISH_STOP_WORDS: A predefined list of common English stopwords.
1. CountVectorizer:
o stop_words=list(ENGLISH_STOP_WORDS): You pass the predefined list of English
stopwords from sklearn.feature_extraction.text.ENGLISH_STOP_WORDS to
CountVectorizer, which will ignore common words like "the", "is", and "and" during text
processing.
o preprocessor=preprocess_text: This argument specifies that the preprocess_text
function (which you defined earlier) will be applied to clean the text before tokenizing it.
This means that for each text sample, the function will convert it to lowercase, remove
HTML tags, special characters, and digits.
What this does: The CountVectorizer will process each text in the dataset by cleaning it with
preprocess_text, then transforming it into a bag-of-words representation (i.e., a matrix where each
row corresponds to a document and each column represents a word count).
2. MultinomialNB:
o This is the Naive Bayes classifier that will be trained on the transformed text data to
predict the target variable (y, in your case, parental_level_of_education).
How It Works:
The pipeline combines the two steps (text preprocessing and classification) into a single object,
allowing you to fit and evaluate your model with minimal code. The beauty of using a pipeline is
that it abstracts away the intermediate steps and ensures that all steps are applied in a sequence,
especially when fitting and transforming the training and testing data.
Example Workflow:
This will:
Example Code:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Split data (assuming df['race_ethnicity'] is the feature and df['parental_level_of_education'] is the target)
x = df['race_ethnicity']
y = df['parental_level_of_education']
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.25, random_state=42)
# Classification report
print(classification_report(y_test, y_pred))
Output Example:
After running this code, you will get an accuracy score and a classification report that provides precision,
recall, and F1-score for each class.
Accuracy: 0.82
Streamlined Workflow: You don't have to manually call each step (preprocessing and model
fitting), making the code cleaner and more maintainable.
Consistency: Ensures that the same preprocessing steps are applied both during training and
testing (or cross-validation).
Easier Experimentation: You can swap out the CountVectorizer or MultinomialNB for other
models or preprocessing methods without changing much of the code structure.
Additional Notes:
Hyperparameter Tuning: If you need to tune hyperparameters, you can easily do this by using
GridSearchCV or RandomizedSearchCV with the pipeline. This ensures that all steps (including
the vectorizer) are part of the hyperparameter search.
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
# Evaluation of model
print(classification_report(y_test, y_pred,zero_division=0))
Your code looks mostly correct, but there are a couple of things we need to adjust to ensure it works
properly, especially when using the make_pipeline you've set up earlier. Here's the breakdown:
Issue:
In the line:
model.predict(vec.transform(['group C']))
You're trying to use vec.transform directly. However, since model is a pipeline that includes the
CountVectorizer, you don't need to manually call the CountVectorizer (vec.transform()) when making
predictions. The pipeline automatically handles the transformation step when you call model.predict().
Fix:
You should directly use model.predict() on the input text, and it will automatically go through the
CountVectorizer and MultinomialNB steps.
Updated Code:
from sklearn.metrics import accuracy_score, classification_report
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Explanation:
Key Notes:
Predictions for New Data: model.predict(['group C']) works well with the pipeline as it
processes the new input in the same way as the training and test data.
Zero Division Handling: By setting zero_division=0 in the classification report, you're avoiding
issues that can arise if any class has zero predicted samples, which can happen when a certain
class is underrepresented in the test data.