Dsa Final Project Report
Dsa Final Project Report
Submitted by
BHUKYA INDU (B230880EC)
M.VIDYADHARI (B231051EC)
DEPARTMENT OF
ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY CALICUT
NOVEMBER 2024
TABLE OF CONTENT
NO. TITLE PAGE NO.
1. Description of Data 3
1.1 Data Source 3
1.2 Data Attributes 3
2. Data Transformation 4
3. Data Wrangling and Descriptive Statistics 4-5
3.1 Data Loading and Initial Inspection 4
3.2 Filtering Data 4
3.3 Handling Missing Values 5
3.4 Feature Engineering
5
3.5 Descriptive Statistics
5
3.6 Visualization
5
4. Data Normalisation 6-9
4.1 First Normal Form (1NF) 6
4.2 Second Normal Form (2NF) 7
4.3 Third Normal Form (3NF) 9
5. Statistical Analysis 11-13
5.1 T-Test for Pollutant Levels between Kerala and Delhi 11
5.2 One-Way ANOVA for State Effect on Pollutant Levels 11
5.3 Two-Way ANOVA for State and Pollutant Type 11
5.4 Proportion Z-Test for High Pollution Levels
12
5.5 Chi-Square for Independence between Pollutant and State
13
6 Data Visualisation 13-17
After transformation, the dataset was uniform and ready for analysis.
Specifically, the operations ensured that all timestamps were in a
consistent format, redundant records were removed, and missing pollutant
values were either filled or excluded. This cleaned and structured data
allowed for effective grouping and visualization, enabling meaningful
insights into air quality trends across various locations.
Dataset Information:
Column Non-Null Count Dtype
country 3229 object
state 3229 object
city 3229 object
station 3229 object
last_update 3229 object
latitude 3229 float64
longitude 3229 float64
pollutant_id 3229 object
pollutant_min 2998 float64
pollutant_max 2998 float64
pollutant_avg 2998 float64
Descriptive Statistics In a Table
Statistic Latitude Longitude Pollutant Min Pollutant Max Pollutant Avg
4) Data Normalisation
4.1 First Normal Form (1NF)
To ensure the dataset adheres to First Normal Form (1NF), the following
transformations were applied:
1. Atomic Values: Each field contains only a single value. For example,
pollutant_id holds one pollutant type per entry (e.g., PM10 or SO2).
2. No Repeating Groups: There are no multiple values within a single
cell.
3. Unique Records: Duplicate entries were removed using
.drop_duplicates() to ensure every row is unique.
4. Composite Primary Key: Since no single attribute uniquely identifies a
record, a composite primary key was created using the combination
of station, pollutant_id, and last_update. This ensures the uniqueness
of each entry.
Station Pollutant Last Latitude Longitude Pollutant Pollutant Pollutant
Update Min Max Avg
Secretariat, PM10 2024-09- 16. 80. 40.0 75.0 55.0
Amaravati 16 515083 518167
- APPCB 02:00:00
Gulzarpet, SO2 2024- 14. 77. NaN NaN NaN
Anantapur 09-16 675886 593027
- APPCB 02:00:00
Gangineni Cheruvu, PM10 2024-09- 13. 79. NaN NaN NaN
Chittoor 16 204880 097889
- APPCB 02:00:00
Anand Kala SO2 2024-09- 16. 81. 4.0 18.0 11.0
Kshetram, 16 987287 736318
Rajamahendravaram 02:00:00
– APPCB
Tirumala, Tirupati - NH3 2024-09- 13. 79. 1.0 2.0 1.0
APPCB 16 670000 350000
02:00:00
6) Data Visualisation
6.1) Boxplot for Pollutant Levels Across Kerala and Delhi
Objective: Visualize the distribution of average pollutant levels across the
Kerala and Delhi
• Number of Variables: 2 (state, pollutant_avg)
• Type of Relation: Categorical vs Numerical
• Type of Plot Selected: Box Plot
This box plot compares the average pollutant levels in Delhi and Kerala.
Here’s a breakdown of the information:
1. Delhi:
o The median pollutant level in Delhi is higher than in Kerala.
o There is a larger spread of data, with more variability in
pollutant levels.
o The box has a wider interquartile range (IQR), indicating more
variation in the middle 50% of pollutant levels.
o There are several outliers, with values significantly above the
rest of the data, reaching over 250.
2. Kerala:
o Kerala has a lower median pollutant level compared to Delhi.
o The data is less variable, as seen from a narrower IQR.
o There are no extreme outliers like in Delhi.
Inference: Pollution levels in Delhi are generally higher and more
variable than in Kerala, with some exceptionally high readings in Delhi.
This suggests that Delhi may experience more severe pollution events
or fluctuations in pollution levels compared to Kerala.
This bar chart shows the average pollutant levels by pollutant type, with
pollutant types on the x-axis and their corresponding average levels on the
y-axis. Here are some observations:
1. PM10 and PM2.5 Have the Highest Levels: Particulate matter (PM10
and PM2.5) shows the highest average levels among the pollutants,
with PM10 being the highest. This suggests that particulate pollution
is a significant concern in this dataset.
2. Low Levels for Ozone and NH3: Ozone and NH3 (ammonia) have the
lowest average levels, indicating that these pollutants are less
prevalent compared to others.
3. Moderate Levels for CO and NO2: Carbon monoxide (CO) and
nitrogen dioxide (NO2) show moderate average levels, falling
between the extremes observed for particulate matter and gases like
SO2 and NH3.
4. Variation in Pollutant Types: The variation in average levels
highlights the differing impacts and sources of each pollutant type,
with particulate matter likely originating from sources like
construction, traffic, and industrial activities, while gases may come
from combustion processes and other emissions.
This bar chart displays the minimum, maximum, and average levels of
various pollutants. Here’s a breakdown of the key observations:
1. PM10 and PM2.5: These particulate matters have the highest
maximum levels among all pollutants, with PM10 reaching the
highest maximum level close to 90, while PM2.5 is also notably high.
The average and minimum values for these pollutants are also
relatively high, indicating frequent high concentrations.
2. CO: Carbon monoxide has a moderately high maximum level, though
its average and minimum values are lower compared to PM10 and
PM2.5, suggesting occasional spikes in concentration.
3. Ozone: Ozone shows a lower range overall compared to PM10,
PM2.5, and CO but still has a distinguishable peak at the maximum
level.
4. NO2 and SO2: These pollutants have lower levels across all metrics
(min, max, and average) compared to the others, implying generally
lower presence in the monitored data.
5. NH3: Ammonia has consistently low values across minimum,
maximum, and average levels, indicating it is the least present
pollutant in this data set.
Overall, PM10 and PM2.5 are the most prominent pollutants with
significant variability, while NH3 shows consistently low levels.