0% found this document useful (0 votes)
17 views5 pages

Mini Project (BDA) Output

The document details a BigMart Sales Analysis Project aimed at predicting product sales using machine learning techniques. It includes data loading from Kaggle, preprocessing steps such as handling missing values and feature engineering, and the implementation of a Random Forest Regressor for model training and evaluation. The project concludes with the creation of a submission file and visualizations of actual versus predicted sales and feature importance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views5 pages

Mini Project (BDA) Output

The document details a BigMart Sales Analysis Project aimed at predicting product sales using machine learning techniques. It includes data loading from Kaggle, preprocessing steps such as handling missing values and feature engineering, and the implementation of a Random Forest Regressor for model training and evaluation. The project concludes with the creation of a submission file and visualizations of actual versus predicted sales and feature importance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

4/22/25, 1:36 AM Untitled4.

ipynb - Colab

# BigMart Sales Analysis Project


# Predicting sales of products in BigMart stores

# Import necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# BigMart Sales Analysis Project


# Predicting sales of products in BigMart stores

# Import necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Import data using kagglehub


try:
# Install dependencies if needed
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Load train data


print("Loading train data from Kaggle...")
train = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"brijbhushannanda1979/bigmart-sales-data",
"Train.csv"
)

# Load test data


print("Loading test data from Kaggle...")
test = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"brijbhushannanda1979/bigmart-sales-data",
"Test.csv"
)

print("Data loaded successfully from Kaggle!")

except Exception as e:
print(f"Error loading data from Kaggle: {e}")
print("Falling back to URL loading...")

try:
# Fallback to GitHub URLs
train_url = "https://raw.githubusercontent.com/suvikramsain/Bigmart-Sales/master/Train.csv"
test_url = "https://raw.githubusercontent.com/suvikramsain/Bigmart-Sales/master/Test.csv"

train = pd.read_csv(train_url)
test = pd.read_csv(test_url)
print("Data loaded successfully from GitHub URLs")
except:
print("Error loading data from URLs. Using local files if available.")
try:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
print("Data loaded successfully from local files")
except:

https://colab.research.google.com/drive/1nQacKbOQQd-CivRD_lxjOwG0_1Eg0wf0#scrollTo=dx1pZ_GLtDdh&printMode=true 1/5
4/22/25, 1:36 AM Untitled4.ipynb - Colab
print("Error: Unable to load data. Please check data availability.")
import sys
sys.exit(1)

Loading train data from Kaggle...


Downloading from https://www.kaggle.com/api/v1/datasets/download/brijbhushannanda1979/bigmart-sales-data?dataset_version_number=1&fi
100%|██████████| 849k/849k [00:00<00:00, 70.8MB/s]Loading test data from Kaggle...

Data loaded successfully from Kaggle!

 

# Take a look at the data


print("Train data shape:", train.shape)
print("Test data shape:", test.shape)
print("\nTrain data columns:", train.columns.tolist())
print("\nFirst few rows of train data:")
print(train.head())

Train data shape: (8523, 12)


Test data shape: (5681, 11)

Train data columns: ['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'Item_MRP', 'Outlet_Ident

First few rows of train data:


Item_Identifier Item_Weight Item_Fat_Content Item_Visibility \
0 FDA15 9.30 Low Fat 0.016047
1 DRC01 5.92 Regular 0.019278
2 FDN15 17.50 Low Fat 0.016760
3 FDX07 19.20 Regular 0.000000
4 NCD19 8.93 Low Fat 0.000000

Item_Type Item_MRP Outlet_Identifier \


0 Dairy 249.8092 OUT049
1 Soft Drinks 48.2692 OUT018
2 Meat 141.6180 OUT049
3 Fruits and Vegetables 182.0950 OUT010
4 Household 53.8614 OUT013

Outlet_Establishment_Year Outlet_Size Outlet_Location_Type \


0 1999 Medium Tier 1
1 2009 Medium Tier 3
2 1999 Medium Tier 1
3 1998 NaN Tier 3
4 1987 High Tier 3

Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 3735.1380
1 Supermarket Type2 443.4228
2 Supermarket Type1 2097.2700
3 Grocery Store 732.3800
4 Supermarket Type1 994.7052

 

# Check for missing values


print("\nMissing values in train data:")
print(train.isnull().sum())
print("\nMissing values in test data:")
print(test.isnull().sum())

Missing values in train data:


Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Missing values in test data:


Item_Identifier 0
Item_Weight 976
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0

https://colab.research.google.com/drive/1nQacKbOQQd-CivRD_lxjOwG0_1Eg0wf0#scrollTo=dx1pZ_GLtDdh&printMode=true 2/5
4/22/25, 1:36 AM Untitled4.ipynb - Colab
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 1606
Outlet_Location_Type 0
Outlet_Type 0
dtype: int64

# Combine train and test datasets for preprocessing


test['Item_Outlet_Sales'] = np.nan
combined = pd.concat([train, test], ignore_index=True)
print("\nCombined data shape:", combined.shape)

# Data preprocessing
# Fill missing values
combined['Item_Weight'].fillna(combined['Item_Weight'].mean(), inplace=True)
combined['Outlet_Size'].fillna('Unknown', inplace=True)

# Fix inconsistent categories


combined['Item_Fat_Content'] = combined['Item_Fat_Content'].replace({
'LF': 'Low Fat',
'low fat': 'Low Fat',
'reg': 'Regular'
})

Combined data shape: (14204, 12)

# Feature Engineering
# Extract year feature from establishment year
current_year = 2025
combined['Outlet_Years'] = current_year - combined['Outlet_Establishment_Year']
combined.drop('Outlet_Establishment_Year', axis=1, inplace=True)

# Item visibility should not be 0, replace with mean


zero_visibility_mask = combined['Item_Visibility'] == 0
combined.loc[zero_visibility_mask, 'Item_Visibility'] = combined['Item_Visibility'].mean()

# Normalize Item_Visibility
combined['Item_Visibility'] = combined['Item_Visibility'] / combined['Item_Visibility'].max()

# Encode categorical variables


le = LabelEncoder()
for column in ['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']:
combined[column] = le.fit_transform(combined[column])

# Create dummy variables for categorical features


categorical_columns = ['Item_Identifier', 'Outlet_Identifier']
combined = pd.get_dummies(combined, columns=categorical_columns, drop_first=True)

# Split back into train and test


train_processed = combined[~combined['Item_Outlet_Sales'].isnull()]
test_processed = combined[combined['Item_Outlet_Sales'].isnull()]

# Prepare data for modeling


X = train_processed.drop('Item_Outlet_Sales', axis=1)
y = train_processed['Item_Outlet_Sales']
X_test = test_processed.drop('Item_Outlet_Sales', axis=1)

# Split into training and validation sets


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the numerical features


scaler = StandardScaler()
numerical_columns = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Years']
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_val[numerical_columns] = scaler.transform(X_val[numerical_columns])
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])

# Train a Random Forest Regressor


print("\nTraining Random Forest Regressor...")
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

https://colab.research.google.com/drive/1nQacKbOQQd-CivRD_lxjOwG0_1Eg0wf0#scrollTo=dx1pZ_GLtDdh&printMode=true 3/5
4/22/25, 1:36 AM Untitled4.ipynb - Colab

Training Random Forest Regressor...


▾ RandomForestRegressor i ?

RandomForestRegressor(max_depth=10, random_state=42)

# Evaluate the model


y_pred_val = rf.predict(X_val)
rmse_val = np.sqrt(mean_squared_error(y_val, y_pred_val))
print(f"Validation RMSE: {rmse_val:.4f}")

# Feature importance
feature_importances = pd.DataFrame({
'Feature': X_train.columns,
'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 10 most important features:")


print(feature_importances.head(10))

# Make predictions on test data


y_pred_test = rf.predict(X_test)

# Create submission file


submission = pd.DataFrame({
'Item_Identifier': test['Item_Identifier'],
'Outlet_Identifier': test['Outlet_Identifier'],
'Item_Outlet_Sales': y_pred_test
})

# Save the submission file


submission.to_csv('bigmart_sales_prediction.csv', index=False)
print("\nSubmission file created successfully!")

# Visualize actual vs predicted values on validation set


plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_pred_val, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--')
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Actual vs Predicted Sales')
plt.tight_layout()
plt.savefig('actual_vs_predicted.png')

# Visualize feature importance


plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances.head(15))
plt.title('Top 15 Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')

print("\nAnalysis completed!")

https://colab.research.google.com/drive/1nQacKbOQQd-CivRD_lxjOwG0_1Eg0wf0#scrollTo=dx1pZ_GLtDdh&printMode=true 4/5
4/22/25, 1:36 AM Untitled4.ipynb - Colab

Validation RMSE: 1027.9054

Top 10 most important features:


Feature Importance
4 Item_MRP 0.498320
7 Outlet_Type 0.314966
8 Outlet_Years 0.042362
1571 Outlet_Identifier_OUT027 0.033873
2 Item_Visibility 0.015169
0 Item_Weight 0.005936
3 Item_Type 0.005011
1318 Item_Identifier_NCE42 0.003391
796 Item_Identifier_FDQ19 0.002214
1218 Item_Identifier_FDY55 0.002083

Submission file created successfully!

Analysis completed!

https://colab.research.google.com/drive/1nQacKbOQQd-CivRD_lxjOwG0_1Eg0wf0#scrollTo=dx1pZ_GLtDdh&printMode=true 5/5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy