0% found this document useful (0 votes)

17 views30 pages

Dsa Final Project Report

The project report outlines a comprehensive analysis of air quality data sourced from monitoring stations in India, focusing on pollutant levels in Kerala and Delhi. It details the processes of data transformation, normalization, and statistical analysis, including t-tests and ANOVA, to evaluate differences in pollution levels. The findings indicate significant differences in average pollutant levels between the two states, supported by various statistical tests and visualizations.

Uploaded by

physizzmva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views30 pages

Dsa Final Project Report

Uploaded by

physizzmva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

FOUNDATIONS OF DATA SCIENCE

DURING THE YEAR 2024

A COURSE PROJECT REPORT

Submitted by
BHUKYA INDU (B230880EC)
M.VIDYADHARI (B231051EC)

DEPARTMENT OF
ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY CALICUT
NOVEMBER 2024
TABLE OF CONTENT
NO. TITLE PAGE NO.
1. Description of Data 3
1.1 Data Source 3
1.2 Data Attributes 3
2. Data Transformation 4
3. Data Wrangling and Descriptive Statistics 4-5
3.1 Data Loading and Initial Inspection 4
3.2 Filtering Data 4
3.3 Handling Missing Values 5
3.4 Feature Engineering
5
3.5 Descriptive Statistics
5
3.6 Visualization
5
4. Data Normalisation 6-9
4.1 First Normal Form (1NF) 6
4.2 Second Normal Form (2NF) 7
4.3 Third Normal Form (3NF) 9
5. Statistical Analysis 11-13
5.1 T-Test for Pollutant Levels between Kerala and Delhi 11
5.2 One-Way ANOVA for State Effect on Pollutant Levels 11
5.3 Two-Way ANOVA for State and Pollutant Type 11
5.4 Proportion Z-Test for High Pollution Levels
12
5.5 Chi-Square for Independence between Pollutant and State
13
6 Data Visualisation 13-17

6.1) Boxplot for Pollutant Levels Across Kerala and Delhi 13

6.2) Histogram of Pollutant Levels for Kerala and Delhi 14

6.3) Distribution Plot of Pollutant Levels in Kerala and Delhi 15

6.4) Heatmap of the Correlation Matrix for Numeric Variables 16

6.5) Bar Plot of High Pollution Proportion by Stat 17

1)Description of Data
1.1 Data Source
The air quality data used in this report is sourced from the Real-Time Air
Quality Index catalog available at data.gov.in. This dataset provides real-
time pollutant concentration levels from monitoring stations across various
cities in India. It consists of 3,229 records with attributes such as station
location, pollutant type, and time-based updates.
1.2 Data Attributes
Sr Attribute Data type Example
No.

1 Country Nominal Data India

2 State Nominal Data Andhra Pradesh

3 City Nominal Data Amaravati
4 Station Nominal Data Secretariat, Amaravati –
APPCB
5 last_update Ordinal Data 16-09-2024 02:00:00
6 Latitude Continuous Data 16.515083

7 longitude Continuous Data 80.518167

8 pollutant_id Nominal Data PM10, SO2, NH3

9 pollutant_min Continuous Data 40.0
10 pollutant_max Continuous Data 75.0
11 pollutant_avg Continuous Data 55.0
2) Data Transformation
Sr Command Purpose
No.
1 .dropna() Deletes rows with missing pollutant readings.
2 .fillna(value) Fills missing values with a default or interpolated
value.
3 .drop_duplicates() Removes duplicate rows to maintain data consistency.

4 .rename(columns) Renames columns for consistency (e.g., pollutant_id to

Pollutant).
5 .astype() Converts column data types (e.g., last_update to
datetime).
6 .groupby() Groups data by city or state to compute average
pollution levels.
7 .reset_index() Resets the index after grouping for a clean DataFrame.

8 .str.strip() Removes extra spaces or special characters from string

values.
9 .apply() Applies custom functions to columns where required.

After transformation, the dataset was uniform and ready for analysis.
Specifically, the operations ensured that all timestamps were in a
consistent format, redundant records were removed, and missing pollutant
values were either filled or excluded. This cleaned and structured data
allowed for effective grouping and visualization, enabling meaningful
insights into air quality trends across various locations.

3) Data Wrangling and Descriptive Statistics

3.1 Data Loading and Initial Inspection: The dataset was loaded
from air_quality.csv using pandas. We conducted an initial
inspection to check column names, data types, and to ensure there
were no major issues with data completeness
3.2 Filtering Data: We focused on air quality data for selected states,
specifically Kerala and Delhi, to narrow down the analysis to these
locations. This filtered dataset will allow us to conduct a more
detailed comparison between the two states.
3.3 Handling Missing Values: For numeric columns, missing values
were replaced using mean imputation to maintain dataset integrity
without introducing significant biases. This step helps to ensure
complete data for statistical analysis.
3.4 Feature Engineering: We added a new feature, high_pollution,
which flags observations where the pollutant_avg exceeds a
threshold (e.g., 50). This binary indicator supports further analysis
on pollution severity.
3.5 Descriptive Statistics: Summary statistics, including mean, standard
deviation, and range, were computed for pollutant_avg in each
state. These statistics provide insight into the general level and
variability of pollutants, which is essential for understanding
pollution patterns.
3.6 Visualization: Boxplots, line plots, and distribution plots were
created to visually explore pollutant levels across the two states.
These plots help identify trends, outliers, and differences in
pollutant distributions between Kerala and Delhi.

Dataset Information:
Column Non-Null Count Dtype
country 3229 object
state 3229 object
city 3229 object
station 3229 object
last_update 3229 object
latitude 3229 float64
longitude 3229 float64
pollutant_id 3229 object
pollutant_min 2998 float64
pollutant_max 2998 float64
pollutant_avg 2998 float64
Descriptive Statistics In a Table
Statistic Latitude Longitude Pollutant Min Pollutant Max Pollutant Avg

Count 3229.000 3229.000 3229.000 3229.000 3229.000

Mean 22.7039 78.5719 16.494 45.770 28.109
Std Dev 5.4758 4.7900 16.305 51.671 26.885
Min 8.5149 70.9092 1.000 1.000 1.000
25% 19.0605 75.6381 5.000 14.000 9.000
50% 23.2648 77.3025 13.000 35.000 22.000
75% 27.1941 80.3230 22.000 57.000 37.000
Max 34.0662 94.6366 114.000 500.000 392.000

4) Data Normalisation
4.1 First Normal Form (1NF)
To ensure the dataset adheres to First Normal Form (1NF), the following
transformations were applied:
1. Atomic Values: Each field contains only a single value. For example,
pollutant_id holds one pollutant type per entry (e.g., PM10 or SO2).
2. No Repeating Groups: There are no multiple values within a single
cell.
3. Unique Records: Duplicate entries were removed using
.drop_duplicates() to ensure every row is unique.
4. Composite Primary Key: Since no single attribute uniquely identifies a
record, a composite primary key was created using the combination
of station, pollutant_id, and last_update. This ensures the uniqueness
of each entry.
Station Pollutant Last Latitude Longitude Pollutant Pollutant Pollutant
Update Min Max Avg
Secretariat, PM10 2024-09- 16. 80. 40.0 75.0 55.0
Amaravati 16 515083 518167
- APPCB 02:00:00
Gulzarpet, SO2 2024- 14. 77. NaN NaN NaN
Anantapur 09-16 675886 593027
- APPCB 02:00:00
Gangineni Cheruvu, PM10 2024-09- 13. 79. NaN NaN NaN
Chittoor 16 204880 097889
- APPCB 02:00:00
Anand Kala SO2 2024-09- 16. 81. 4.0 18.0 11.0
Kshetram, 16 987287 736318
Rajamahendravaram 02:00:00
– APPCB
Tirumala, Tirupati - NH3 2024-09- 13. 79. 1.0 2.0 1.0
APPCB 16 670000 350000
02:00:00

This table ensures compliance with 1NF by ensuring:

• Each field holds a single, atomic value.
• There are no repeating groups or multivalued fields.
• The combination of station, pollutant_id, and last_update serves as a
composite primary key, ensuring the uniqueness of each row

4.2 Second Normal Form (2NF)

To achieve Second Normal Form (2NF), partial dependencies were
removed. In 1NF, some non-key attributes depended only on a part of the
composite primary key. For example:
• Latitude and Longitude depend only on the station.
• Pollutant Min, Max, and Avg depend only on the combination of
pollutant_id and last_update.
Table A: Main Data
Station Pollutant Last Update
Secretariat, Amaravati – APPCB PM10 2024-09-16 02:00:00

Gulzarpet, Anantapur - APPCB SO2 2024-09-16 02:00:00

Gangineni Cheruvu, Chittoor - APPCB PM10 2024-09-16 02:00:00

Anand Kala Kshetram, Rajamahendravaram - SO2 2024-09-16 02:00:00

APPCB

Tirumala, Tirupati - APPCB NH3 2024-09-16 02:00:00

Table B: Pollutant Data

Pollutant Last Update Pollutant Min Pollutant Max Pollutant Avg

PM10 2024-09-16 40.0 75.0 55.0

02:00:00
SO2 2024-09-16 NaN NaN NaN
02:00:00
NH3 2024-09-16 1.0 2.0 1.0
02:00:00
Table C: Station Data
Station City Latitude Longitude
Secretariat, Amaravati - Amaravati 16.515083 80.518167
APPCB
Gulzarpet, Anantapur - Anantapur 14.675886 77.593027
APPCB
Gangineni Cheruvu, Chittoor 13.204880 79.097889
Chittoor - APPCB
Anand Kala Kshetram, Rajamahendravaram 16.987287 81.736318
Rajamahendravaram -
APPCB

Tirumala, Tirupati - Tirupati 13.670000 79.350000

APPCB

4.3 Third Normal Form (3NF)

To achieve Third Normal Form (3NF), we need to eliminate any transitive
dependencies. In the 2NF structure, some attributes were still indirectly
dependent on non-key attributes. For example:
• The city determines the state (i.e., city -> state).
• The state determines the country (i.e., state -> country).
Table A: Main Data
Station Pollutant Last Update
Secretariat, Amaravati – APPCB PM10 2024-09-16 02:00:00

Gulzarpet, Anantapur - APPCB SO2 2024-09-16 02:00:00

Gangineni Cheruvu, Chittoor - APPCB PM10 2024-09-16 02:00:00

Anand Kala Kshetram, Rajamahendravaram - APPCB SO2 2024-09-16 02:00:00

Tirumala, Tirupati - APPCB NH3 2024-09-16 02:00:00

Table B: Pollutant Data
Pollutant Last Update Pollutant Min Pollutant Max Pollutant Avg

PM10 2024-09-16 02:00:00 40.0 75.0 55.0

SO2 2024-09-16 02:00:00 NaN NaN NaN

NH3 2024-09-16 02:00:00 3.0 2.0 1.0

Table C: Station Data

Station City
Secretariat, Amaravati - APPCB Amaravati
Gulzarpet, Anantapur - APPCB Anantapur
Gangineni Cheruvu, Chittoor - APPCB Chittoor
Anand Kala Kshetram, Rajamahendravaram - APPCB Rajamahendravaram

Tirumala, Tirupati - APPCB Tirupati

Table D: City Data

City State
Amaravati Andhra Pradesh
Anantapur Andhra Pradesh
Chittoor Andhra Pradesh
Rajamahendravaram Andhra Pradesh
Tirupati Andhra Pradesh

Table E: State Data

State Country
Andhra Pradesh India
5) Statistical Analysis
5.1 T-Test for Pollutant Levels between Kerala and Delhi
• Objective: This test aims to determine if there is a significant
difference in average pollutant levels between Kerala and Delhi.
• Methodology: A two-sample t-test was performed on pollutant_avg
values from each state. The null hypothesis assumed no difference in
average pollution levels.
• Results: The t-test yielded a p-value of [insert p-value], indicating
whether the difference in pollution levels is statistically significant.
Output :
T-Test between Kerala and Delhi pollutant_avg p-value: 0.004206118877480918

5.2 One-Way ANOVA for State Effect on Pollutant Levels

• Objective: To assess whether the state (Kerala or Delhi) has a
significant effect on pollutant levels.
• Methodology: A one-way ANOVA was conducted using the formula
pollutant_avg ~ C(state).
• Results: The ANOVA results indicated an F-statistic of [insert F-
statistic] and a p-value of [insert p-value], suggesting whether state
differences significantly impact pollutant levels.
One-Way ANOVA Results
Source DF Sum of Squares Mean Square F-value P-value
C(state) 1.0 14,764.4354 14,764.4354 8.3338 0.0042
Residual 270.0 478,341.3249 1,771.6345 NaN NaN

5.3 Two-Way ANOVA for State and Pollutant Type

• Objective: To examine the combined effect of state and pollutant_id
on pollutant_avg.
• Methodology: A two-way ANOVA was conducted to test for both
main effects and interaction effects.
• Results: The results showed whether there are significant interactions
between state and pollutant type in determining average pollution
levels.
Two-Way ANOVA Results
Source DF Sum of Squares Mean Square F-value P-value

C(state) 1.0 14,764.4354 14,764.4354 23.7079 1.956e-06

C(pollutant_id) 6.0 302,549.2657 50,424.8776 80.9695 1.752e-56

C(state) 6.0 15,119.0543 2,519.8424 4.0462 6.787e-04

(pollutant_id)
Residual 258.0 160,673.0050 622.7636 NaN NaN

5.4 Proportion Z-Test for High Pollution Levels

• Objective: This test was conducted to assess the proportion of high
pollution days (pollutant levels > 50) in the dataset.
• Methodology: A z-test for proportions was applied, comparing the
observed high pollution proportion against an expected value (e.g.,
0.5).
• Results: The z-test returned a p-value of [insert p-value], showing if
the observed high pollution rate was significantly different from the
assumed rate.
Output :
Proportion Z-Test for high pollution level (> 50) p-value: 2.059641982631519e-05
5.5 Chi-Square Test for Independence between Pollutant Type and
State
• Objective: This test assesses whether pollutant types are
independent of states, examining if certain pollutants are more
prevalent in Kerala or Delhi.
• Methodology: A chi-square test was performed on a contingency
table of pollutant_id and state.
• Results: The chi-square test yielded a p-value of [insert p-value],
indicating if there is a significant association between pollutant type
and state.
Output :
Chi-Square Test for independence between pollutant_id and state p-value:
0.9959679989232744

6) Data Visualisation
6.1) Boxplot for Pollutant Levels Across Kerala and Delhi
Objective: Visualize the distribution of average pollutant levels across the
Kerala and Delhi
• Number of Variables: 2 (state, pollutant_avg)
• Type of Relation: Categorical vs Numerical
• Type of Plot Selected: Box Plot
This box plot compares the average pollutant levels in Delhi and Kerala.
Here’s a breakdown of the information:
1. Delhi:
o The median pollutant level in Delhi is higher than in Kerala.
o There is a larger spread of data, with more variability in
pollutant levels.
o The box has a wider interquartile range (IQR), indicating more
variation in the middle 50% of pollutant levels.
o There are several outliers, with values significantly above the
rest of the data, reaching over 250.
2. Kerala:
o Kerala has a lower median pollutant level compared to Delhi.
o The data is less variable, as seen from a narrower IQR.
o There are no extreme outliers like in Delhi.
Inference: Pollution levels in Delhi are generally higher and more
variable than in Kerala, with some exceptionally high readings in Delhi.
This suggests that Delhi may experience more severe pollution events
or fluctuations in pollution levels compared to Kerala.

6.2) Histogram of Pollutant Levels for Kerala and Delhi

Objective: Compare the distribution of average pollutant levels between
Kerala and Delhi
• Number of Variables: 2 (state, pollutant_avg)
• Type of Relation: Categorical vs Numerical
• Type of Plot Selected: Histogram (stacked) with color distinction by
state
This histogram displays the frequency distribution of pollutant levels for
Kerala and Delhi, providing insights into the differences in pollution
levels between the two regions.
Observations:
1. Delhi:
o The green bars, representing Delhi, show a wide distribution of
pollutant levels, with a significant number of higher pollutant
readings compared to Kerala.
o Most of the Delhi data is concentrated below 100, with a
notable peak around the lower pollutant levels.
o There are also a few instances where the pollutant levels
exceed 150, even going above 250, which is consistent with
the outliers observed in the previous box plot.
2. Kerala:
o The orange bars, representing Kerala, are concentrated at the
lower end of the pollutant levels, primarily under 50.
o There are fewer high-pollutant readings in Kerala, and the
distribution is more compact, suggesting that extreme
pollution events are less frequent or severe compared to Delhi.
Inference:
This histogram confirms that pollutant levels in Delhi tend to be both
higher and more variable than in Kerala, with Delhi experiencing more
frequent and intense pollution episodes. In contrast, Kerala generally
maintains lower pollutant levels, with most readings concentrated below
50, indicating comparatively better air quality.

6.3) Distribution Plot of Pollutant Levels in Kerala and Delhi

Objective: Visualize the density of average pollutant levels for each state
• Number of Variables: 1 (pollutant_avg)
• Type of Relation: Numerical
• Type of Plot Selected: KDE plot (Kernel Density Estimation)
This graph compares the distribution of pollutant levels between Kerala
and Delhi. Here are the key insights:
1. Pollution Levels in Kerala vs. Delhi:
o Kerala (blue distribution) has lower pollutant levels compared
to Delhi (orange distribution).
o The peak (mode) for Kerala occurs at a much lower pollutant
level, around 20-30, whereas Delhi has a wider and flatter
distribution.
2. Spread and Skewness:
o The distribution of pollution in Delhi is more spread out,
indicating higher variability in pollutant levels. It has a long tail
extending towards higher pollution values, showing that Delhi
experiences extreme pollution levels more frequently than
Kerala.
o Kerala's pollutant levels are more concentrated, showing less
variation and generally lower pollution levels.
3. Overlap:
o There is some overlap between the two regions, suggesting
that Delhi and Kerala experience similar pollutant levels in a
certain range (roughly between 0 and 50), but beyond that,
Delhi's pollution levels extend significantly higher.
In summary, the graph suggests that Delhi experiences much higher and
more variable pollution compared to Kerala, with Kerala having more
moderate and less dispersed pollutant levels.

6.4) Heatmap of the Correlation Matrix for Numeric Variables

Objective: Understand the correlation between numeric variables in the
dataset
• Number of Variables: Multiple (all numeric columns)
• Type of Relation: Numerical vs Numerical
• Type of Plot Selected: Heatmap
This image shows a correlation matrix of numeric variables, with correlation
values color-coded. Here's an interpretation of the matrix:
1. Latitude and Longitude:
o Latitude and longitude have a positive correlation (0.76),
indicating that the two variables have some linear relationship,
though it is not perfect.
2. Pollutant Variables (min, max, avg):
o The pollutant-related variables (minimum, maximum, and
average pollutant levels) are highly correlated with each other:
▪ Pollutant min vs. pollutant max has a correlation of 0.82.
▪ Pollutant min vs. pollutant avg has a correlation of 0.90.
▪ Pollutant max vs. pollutant avg has a very high
correlation of 0.95.
o This indicates that areas experiencing higher minimum
pollutant levels also tend to have higher maximum and
average pollutant levels, showing strong internal consistency
among pollutant measures.
3. Latitude and Longitude vs. Pollutant Levels:
o Latitude and longitude show low correlations with pollutant
levels (correlation values range from 0.06 to 0.20), meaning
there is very little linear relationship between geographic
coordinates and pollution metrics in this dataset.

6.5) Bar Plot of High Pollution Proportion by State

Objective: Compare the proportion of high pollution levels across states
• Number of Variables: 2 (state, high_pollution)
• Type of Relation: Categorical vs Numerical
Type of Plot Selected: Bar plot
This bar chart shows the proportion of high pollution levels (where the
pollutant average is greater than 50) for two states: Delhi and Kerala.
Here’s an analysis:
1. Delhi:
o A large proportion (over 40%) of the data points for Delhi
have high pollution levels (pollutant average > 50). This
highlights that Delhi frequently experiences poor air quality.
2. Kerala:
o In contrast, only a small proportion (less than 10%) of the data
points for Kerala have high pollution levels. This suggests that
Kerala experiences relatively better air quality, with significantly
fewer instances of high pollution.
6.6) Andhra Pradesh State Data Analysis and Visualization
Objective: Distribution of Average Pollutant Levels
• Number of Variables: 1 (pollutant_avg)
• Type of Relation: Numerical
• Type of Plot Selected: Histogram with KDE (Kernel Density Estimation)

This histogram shows the distribution of average pollutant levels, with

frequency on the y-axis and average pollutant level on the x-axis. The
distribution appears to be right-skewed, meaning that most pollutant
levels are clustered toward lower values, with a long tail extending to
higher pollutant levels.
Key inferences include:
1. Concentration of Lower Levels: A significant number of observations
fall below 50 on the pollutant scale, indicating that lower pollutant
levels are much more common in the dataset.
2. Fewer High Pollutant Levels: As pollutant levels increase, their
frequency drops sharply, indicating that extremely high pollutant
levels are rare.

Objective: Average Pollutant Levels by Station

• Number of Variables: 2 (station, pollutant_avg)
• Type of Relation: Categorical–Numerical
• Type of Plot Selected: Box Plot
This scatter plot displays the average pollutant levels recorded at various
stations, with each station represented on the x-axis and the pollutant
levels on the y-axis. Here are some key insights:
1. Wide Variation Across Stations: There is significant variability in
pollutant levels across different stations, with most values clustering
below 100.
2. High Outliers: A few stations have extreme pollutant levels reaching
as high as around 400, which stand out from the general
distribution.
3. Frequent Low Levels: Most pollutant levels are relatively low, similar
to the histogram you shared earlier. This reinforces the trend that
while high pollutant levels are observed, they are rare.
4. Dense Cluster at the Bottom: The majority of data points are
concentrated below 50, indicating that low pollutant levels are
typical across most stations.
Objective: Comparison of Average Pollutant Levels by Pollutant Type
• Number of Variables: 2 (pollutant_id, pollutant_avg)
• Type of Relation: Categorical–Numerical
• Type of Plot Selected: Bar Plot

This bar chart shows the average pollutant levels by pollutant type, with
pollutant types on the x-axis and their corresponding average levels on the
y-axis. Here are some observations:
1. PM10 and PM2.5 Have the Highest Levels: Particulate matter (PM10
and PM2.5) shows the highest average levels among the pollutants,
with PM10 being the highest. This suggests that particulate pollution
is a significant concern in this dataset.
2. Low Levels for Ozone and NH3: Ozone and NH3 (ammonia) have the
lowest average levels, indicating that these pollutants are less
prevalent compared to others.
3. Moderate Levels for CO and NO2: Carbon monoxide (CO) and
nitrogen dioxide (NO2) show moderate average levels, falling
between the extremes observed for particulate matter and gases like
SO2 and NH3.
4. Variation in Pollutant Types: The variation in average levels
highlights the differing impacts and sources of each pollutant type,
with particulate matter likely originating from sources like
construction, traffic, and industrial activities, while gases may come
from combustion processes and other emissions.

Objective: Comparison of Min, Max, and Avg Pollutant Levels by Type

• Number of Variables: 3 (pollutant_id, pollutant_min, pollutant_max,
pollutant_avg)
• Type of Relation: Categorical–Numerical with Multiple Metrics
• Type of Plot Selected: Grouped Bar Plot

This bar chart displays the minimum, maximum, and average levels of
various pollutants. Here’s a breakdown of the key observations:
1. PM10 and PM2.5: These particulate matters have the highest
maximum levels among all pollutants, with PM10 reaching the
highest maximum level close to 90, while PM2.5 is also notably high.
The average and minimum values for these pollutants are also
relatively high, indicating frequent high concentrations.
2. CO: Carbon monoxide has a moderately high maximum level, though
its average and minimum values are lower compared to PM10 and
PM2.5, suggesting occasional spikes in concentration.
3. Ozone: Ozone shows a lower range overall compared to PM10,
PM2.5, and CO but still has a distinguishable peak at the maximum
level.
4. NO2 and SO2: These pollutants have lower levels across all metrics
(min, max, and average) compared to the others, implying generally
lower presence in the monitored data.
5. NH3: Ammonia has consistently low values across minimum,
maximum, and average levels, indicating it is the least present
pollutant in this data set.
Overall, PM10 and PM2.5 are the most prominent pollutants with
significant variability, while NH3 shows consistently low levels.

Objective: Average Pollutant Levels by Pollutant Type

• Number of Variables: 1 (pollutant_id)
• Type of Relation: Categorical
• Type of Plot Selected: Pie Chart
This pie chart illustrates the average pollutant levels by type, as a
percentage of the total pollutant levels. Here’s a summary of the key
points:
1. PM10: This pollutant contributes the most to the total average
levels, accounting for 31.0% of the overall pollutant presence. This
reinforces that PM10 is a major pollutant in this data set.
2. PM2.5: The second-highest contributor at 20.9%, PM2.5 is another
significant pollutant, indicating a substantial presence of particulate
matter in the atmosphere.
3. CO (Carbon Monoxide): With 18.0%, CO also contributes notably to
the overall pollutant levels, though it's slightly less than PM10 and
PM2.5.
4. Ozone: Ozone accounts for 11.6%, showing a moderate contribution
compared to other pollutants.
5. NO2: Nitrogen dioxide contributes 9.7%, which is moderate but less
than CO and particulate matter pollutants.
6. SO2 (Sulfur Dioxide): At 6.6%, SO2 has a relatively minor
contribution compared to the primary pollutants.
7. NH3 (Ammonia): NH3 has the lowest average presence, at just 2.2%,
indicating it is the least prevalent pollutant in this data set.

AQI Project
No ratings yet
AQI Project
25 pages
Presentation AirQuality Prediction Using Machine Learning
No ratings yet
Presentation AirQuality Prediction Using Machine Learning
16 pages
Notes in Environmental Data Analysis
100% (1)
Notes in Environmental Data Analysis
11 pages
Exploratory Graphs
No ratings yet
Exploratory Graphs
23 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Python Data Cleaning
100% (1)
Python Data Cleaning
20 pages
Pandas
No ratings yet
Pandas
20 pages
Experiment - 1 csd201
No ratings yet
Experiment - 1 csd201
19 pages
Technology Was Developed For The Betterment of Our Lives in The World. Then Why Don't We Use It For The Betterment of This World?
No ratings yet
Technology Was Developed For The Betterment of Our Lives in The World. Then Why Don't We Use It For The Betterment of This World?
26 pages
The Chemical Analysis of Water Quality of India
No ratings yet
The Chemical Analysis of Water Quality of India
16 pages
Environmental Pollution Analysis and Prediction of Influential Factors: A Data-Driven Investigation
No ratings yet
Environmental Pollution Analysis and Prediction of Influential Factors: A Data-Driven Investigation
14 pages
Data Science Project 12TH Report Print
No ratings yet
Data Science Project 12TH Report Print
24 pages
Air Quality Prediction Using Machine Learning Algorithms
100% (1)
Air Quality Prediction Using Machine Learning Algorithms
4 pages
Shashank Bodduna: Informatics Practices Project XII
No ratings yet
Shashank Bodduna: Informatics Practices Project XII
20 pages
Report Latest - RK
No ratings yet
Report Latest - RK
20 pages
41b Data Wrangling, Grouping and Aggregation
No ratings yet
41b Data Wrangling, Grouping and Aggregation
31 pages
VA Case Study 2
No ratings yet
VA Case Study 2
27 pages
Exploratory Data Analysis, Inference, Interpretation
No ratings yet
Exploratory Data Analysis, Inference, Interpretation
45 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
Air Quality Project
No ratings yet
Air Quality Project
8 pages
Essentials of Modern Business Statistics With Microsoft Excel 8th Edition David R. Anderson - Ebook PDF Download
100% (10)
Essentials of Modern Business Statistics With Microsoft Excel 8th Edition David R. Anderson - Ebook PDF Download
67 pages
Airquality
No ratings yet
Airquality
20 pages
DAC Phase5
No ratings yet
DAC Phase5
25 pages
Sample Report
No ratings yet
Sample Report
17 pages
L36 AQI Analysis
No ratings yet
L36 AQI Analysis
4 pages
Air Quality Index Analysis Using Machine Learning 1647514117
No ratings yet
Air Quality Index Analysis Using Machine Learning 1647514117
20 pages
Sample Template File For Project
No ratings yet
Sample Template File For Project
8 pages
Lavanya Sharma IP File 2024-25-1
No ratings yet
Lavanya Sharma IP File 2024-25-1
37 pages
DAP Report
No ratings yet
DAP Report
29 pages
Aqi Data Analytics Report
No ratings yet
Aqi Data Analytics Report
7 pages
World Air Quality Analysis
No ratings yet
World Air Quality Analysis
15 pages
Graphical Representation
No ratings yet
Graphical Representation
8 pages
1
No ratings yet
1
6 pages
Ip Project
No ratings yet
Ip Project
21 pages
DAC Phase2
No ratings yet
DAC Phase2
8 pages
UFCFLR-15-M Data Management Fundamentals 2021
No ratings yet
UFCFLR-15-M Data Management Fundamentals 2021
9 pages
DMV - 4 - Jupyter Notebook
No ratings yet
DMV - 4 - Jupyter Notebook
8 pages
Juniper Apstra 5.0.1 Release Notes
No ratings yet
Juniper Apstra 5.0.1 Release Notes
71 pages
ML#05
No ratings yet
ML#05
35 pages
Data Analytic Assignment
No ratings yet
Data Analytic Assignment
6 pages
Prediction of Air Pollution Using Artificial Intelligence: A Case Study of Delhi NCT
No ratings yet
Prediction of Air Pollution Using Artificial Intelligence: A Case Study of Delhi NCT
24 pages
Aqi 13591
No ratings yet
Aqi 13591
3 pages
Internshala Summer Training Report On Data Science
No ratings yet
Internshala Summer Training Report On Data Science
70 pages
Print Paper
No ratings yet
Print Paper
10 pages
Lecture 1 - Mosfet - m2024-25
No ratings yet
Lecture 1 - Mosfet - m2024-25
114 pages
A Predictive Data Feature Exploration-Based Air Quality Prediction Approach
No ratings yet
A Predictive Data Feature Exploration-Based Air Quality Prediction Approach
12 pages
(Ebook PDF) Modern Business Statistics, With Microsoft Office Excel 4th Edition Instant Download
100% (1)
(Ebook PDF) Modern Business Statistics, With Microsoft Office Excel 4th Edition Instant Download
53 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
Phase 3
No ratings yet
Phase 3
23 pages
JASP A Students Guide v14 Nov2020
No ratings yet
JASP A Students Guide v14 Nov2020
172 pages
App Rating Prediction Project
100% (5)
App Rating Prediction Project
14 pages
Presentation On Flight Price Prediction 2
No ratings yet
Presentation On Flight Price Prediction 2
30 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Analyzing and Predicting Factors Effecting Environmental Pollution
No ratings yet
Analyzing and Predicting Factors Effecting Environmental Pollution
11 pages
Acknowledgement
No ratings yet
Acknowledgement
25 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Dma 89
No ratings yet
Dma 89
21 pages
Document
No ratings yet
Document
29 pages
Project Synopsis
No ratings yet
Project Synopsis
1 page
Bda - Unit 5
No ratings yet
Bda - Unit 5
24 pages
Class 07
No ratings yet
Class 07
13 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Introduction To Data Processing in Environmental Engineering - I
No ratings yet
Introduction To Data Processing in Environmental Engineering - I
21 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Numpy New
No ratings yet
Numpy New
16 pages
Case Study Guidelines
No ratings yet
Case Study Guidelines
7 pages
Printable - 11-2 - Lesson Quiz
No ratings yet
Printable - 11-2 - Lesson Quiz
1 page
Review Exercises
No ratings yet
Review Exercises
11 pages
Prac 7
No ratings yet
Prac 7
5 pages
Stat210 FL17 LCN 1
No ratings yet
Stat210 FL17 LCN 1
43 pages
FDS Most Imp Question
No ratings yet
FDS Most Imp Question
12 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
82 pages
CENTRAL TENDENCY Topical Past Papers
No ratings yet
CENTRAL TENDENCY Topical Past Papers
35 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Stats Formulas
No ratings yet
Stats Formulas
54 pages
DSA Assignmet 1
No ratings yet
DSA Assignmet 1
14 pages
Phase1. (Team 11) Document
No ratings yet
Phase1. (Team 11) Document
2 pages
Seaborn Visualization
No ratings yet
Seaborn Visualization
18 pages
Matplotlib Library
No ratings yet
Matplotlib Library
17 pages
Statistics Fundamentals Succinctly
No ratings yet
Statistics Fundamentals Succinctly
104 pages
R - Boxplots
No ratings yet
R - Boxplots
4 pages
Getting Started With Pandas. - New
No ratings yet
Getting Started With Pandas. - New
13 pages
13.exploratory Data Analysis
0% (1)
13.exploratory Data Analysis
10 pages
Data Analysis of M2 Money Supply
No ratings yet
Data Analysis of M2 Money Supply
8 pages
Cedrix James Estoquia - OLLC Lesson 4.5 Presentation and Interpretation of Data Application
100% (1)
Cedrix James Estoquia - OLLC Lesson 4.5 Presentation and Interpretation of Data Application
4 pages
Data and Graphs Contents
No ratings yet
Data and Graphs Contents
4 pages
MaxFreq Qs
No ratings yet
MaxFreq Qs
2 pages
STAT5002 Midterm Review Solutions N
No ratings yet
STAT5002 Midterm Review Solutions N
8 pages
KNIME Guide - Week 2
No ratings yet
KNIME Guide - Week 2
2 pages
Pds Question Bank
No ratings yet
Pds Question Bank
5 pages
6.2 Measures of Spread Updated
No ratings yet
6.2 Measures of Spread Updated
2 pages
6 Box and Whisker Notes Handout
No ratings yet
6 Box and Whisker Notes Handout
2 pages
Probability
No ratings yet
Probability
4 pages
QBM101 Tutorial Module 1 Appendix
No ratings yet
QBM101 Tutorial Module 1 Appendix
3 pages
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
A General Introduction to Data Analytics
From Everand
A General Introduction to Data Analytics
João Moreira
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Dsa Final Project Report

Uploaded by

Dsa Final Project Report

Uploaded by

FOUNDATIONS OF DATA SCIENCE

DURING THE YEAR 2024

6.1) Boxplot for Pollutant Levels Across Kerala and Delhi 13

6.2) Histogram of Pollutant Levels for Kerala and Delhi 14

6.3) Distribution Plot of Pollutant Levels in Kerala and Delhi 15

6.4) Heatmap of the Correlation Matrix for Numeric Variables 16

6.5) Bar Plot of High Pollution Proportion by Stat 17

1 Country Nominal Data India

2 State Nominal Data Andhra Pradesh

7 longitude Continuous Data 80.518167

8 pollutant_id Nominal Data PM10, SO2, NH3

4 .rename(columns) Renames columns for consistency (e.g., pollutant_id to

8 .str.strip() Removes extra spaces or special characters from string

3) Data Wrangling and Descriptive Statistics

Count 3229.000 3229.000 3229.000 3229.000 3229.000

This table ensures compliance with 1NF by ensuring:

4.2 Second Normal Form (2NF)

Gulzarpet, Anantapur - APPCB SO2 2024-09-16 02:00:00

Gangineni Cheruvu, Chittoor - APPCB PM10 2024-09-16 02:00:00

Anand Kala Kshetram, Rajamahendravaram - SO2 2024-09-16 02:00:00

Tirumala, Tirupati - APPCB NH3 2024-09-16 02:00:00

Table B: Pollutant Data

PM10 2024-09-16 40.0 75.0 55.0

Tirumala, Tirupati - Tirupati 13.670000 79.350000

4.3 Third Normal Form (3NF)

Gulzarpet, Anantapur - APPCB SO2 2024-09-16 02:00:00

Gangineni Cheruvu, Chittoor - APPCB PM10 2024-09-16 02:00:00

Anand Kala Kshetram, Rajamahendravaram - APPCB SO2 2024-09-16 02:00:00

Tirumala, Tirupati - APPCB NH3 2024-09-16 02:00:00

PM10 2024-09-16 02:00:00 40.0 75.0 55.0

SO2 2024-09-16 02:00:00 NaN NaN NaN

NH3 2024-09-16 02:00:00 3.0 2.0 1.0

Table C: Station Data

Tirumala, Tirupati - APPCB Tirupati

Table D: City Data

Table E: State Data

5.2 One-Way ANOVA for State Effect on Pollutant Levels

5.3 Two-Way ANOVA for State and Pollutant Type

C(state) 1.0 14,764.4354 14,764.4354 23.7079 1.956e-06

C(pollutant_id) 6.0 302,549.2657 50,424.8776 80.9695 1.752e-56

C(state) 6.0 15,119.0543 2,519.8424 4.0462 6.787e-04

5.4 Proportion Z-Test for High Pollution Levels

6.2) Histogram of Pollutant Levels for Kerala and Delhi

6.3) Distribution Plot of Pollutant Levels in Kerala and Delhi

6.4) Heatmap of the Correlation Matrix for Numeric Variables

6.5) Bar Plot of High Pollution Proportion by State

This histogram shows the distribution of average pollutant levels, with

Objective: Average Pollutant Levels by Station

Objective: Comparison of Min, Max, and Avg Pollutant Levels by Type

Objective: Average Pollutant Levels by Pollutant Type

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.