0% found this document useful (0 votes)
6 views67 pages

Fods Final Done

The document outlines the installation and features of essential Python libraries for data science: NumPy, SciPy, Jupyter, Statsmodels, and Pandas. Each library is described in terms of its functionalities, such as numerical operations, statistical analysis, interactive computing, data manipulation, and modeling capabilities. The document serves as a guide for utilizing these tools in scientific computing and data analysis.

Uploaded by

Gowtham Kumar V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views67 pages

Fods Final Done

The document outlines the installation and features of essential Python libraries for data science: NumPy, SciPy, Jupyter, Statsmodels, and Pandas. Each library is described in terms of its functionalities, such as numerical operations, statistical analysis, interactive computing, data manipulation, and modeling capabilities. The document serves as a guide for utilizing these tools in scientific computing and data analysis.

Uploaded by

Gowtham Kumar V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

1.

Exploring Data Science Tools: NumPy, SciPy, Jupyter, Statsmodels, and


Pandas

AIM:
Download, install, and explore the features of NumPy, SciPy, Jupyter, Statsmodels, and Pandas
packages. These libraries are essential tools for scientific computing and data analysis in
Python, offering functionalities such as numerical operations, statistical analysis, interactive
computing environments, advanced statistical modeling, and powerful data manipulation and
analysis capabilities.
PROCEDURE:
Step 1: Start
Step 2: Open google colab
Step 3: To download and install packages use the following codes
1. pip install NumPy
2. pip install SciPy
3. pip install Juypter
4. pip install Statsmodels
5. pip install Pandas
Step 4: Stop
FEATURES:
1.NumPy:
NumPy is a powerful numerical computing library in Python. It provides support for large,
multi-dimensional arrays and matrices, along with a collection of mathematical functions to
operate on these elements. Here are some key aspects of NumPy:
Arrays:
The fundamental data structure in NumPy is the array. Arrays in NumPy are similar to Python
lists, but they allow for more efficient manipulation of numerical data. Arrays can be one-
dimensional, like a list, or multi-dimensional, like a matrix.
Vectorized Operations:
NumPy enables vectorized operations, meaning you can perform operations on entire arrays at
once without using explicit loops. This makes it significantly faster and more efficient than
traditional Python lists when dealing with numerical data.

1
Out[1] :
Original array: [1 2 3 4 5]
Sum of array elements: 15
Mean of array elements: 3.0
Maximum element in the array: 5
Minimum element in the array: 1
Reshaped array: [[1 2 3 4 5]]
1 2 3
2D array:[ ]
4 5 6
Sum of elements along rows: [6 1 5]
Sum of elements along columns:[5 7 9]

2
Mathematical Functions:
NumPy provides a wide range of mathematical functions that operate element-wise on arrays.
These include basic arithmetic operations, trigonometric functions, logarithmic functions, and
more.
Broadcasting:
Broadcasting is a powerful feature in NumPy that allows for arithmetic operations between
arrays of different shapes and sizes. NumPy automatically broadcasts the smaller array to the
shape of the larger one, making it easier to perform operations on arrays of different
dimensions.
Random Module:
NumPy includes a random module that generates random numbers with various distributions.
This is useful for tasks like simulations, random sampling, and statistical analysis.
Linear Algebra Operations:
NumPy provides a set of functions for linear algebra operations, such as matrix multiplication,
eigenvalue decomposition, and solving linear systems of equations.
Indexing and Slicing:
NumPy supports advanced indexing and slicing operations, allowing you to access and
manipulate specific elements or subarrays within an array.
This Python program demonstrates basic usage of the NumPy library for numerical computing,
including array creation, manipulation, and basic operations:
In [1]: import numpy as np
# Create a NumPy array
arr = np.array([1, 2, 3, 4, 5])
print("Original array:", arr)
# Perform basic operations on the array
print("Sum of array elements:", np.sum(arr))
print("Mean of array elements:", np.mean(arr))
print("Maximum element in the array:", np.max(arr))
print("Minimum element in the array:", np.min(arr))
# Reshape the array
reshaped_arr = arr.reshape(1, 5)
print ("Reshaped array:", reshaped_arr)
# Create a 2D array

3
4
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print ("2D array:")
print (arr_2d)
# Perform operations on 2D array
print ("Sum of elements along rows:", np.sum(arr_2d, axis=1))
print ("Sum of elements along columns:", np.sum(arr_2d, axis=0))
Overall, it is a cornerstone library for numerical computing in Python, and it serves as the
foundation for many other scientific computing libraries and tools. Its efficiency and ease of
use make it an essential tool for tasks involving numerical data and scientific computing.
2.SciPy:
Scipy is an open-source library in Python that builds on the capabilities of NumPy and provides
additional functionality for scientific computing. It is designed to work seamlessly with NumPy
arrays and provides tools for a wide range of scientific and technical computing tasks. Let's
delve into some key aspects of Scipy:
Integration and Differentiation:
Scipy includes modules for numerical integration (scipy.integrate) and differentiation
(scipy.misc). These modules provide functions for integrating and differentiating various
mathematical expressions.
Optimization:
The optimization module (scipy.optimize) in Scipy offers a variety of optimization algorithms
for minimizing or maximizing mathematical functions. It includes both local and global
optimization algorithms.
Interpolation:
The interpolation module (scipy.interpolate) provides functions for interpolating between data
points. This is useful for estimating values between known data points or for resampling data.
Signal and Image Processing:
Scipy includes modules for signal processing (scipy.signal) and image processing
(scipy.ndimage). These modules offer functions for filtering, convolution, Fourier analysis, and
various operations on signals and images.
Linear Algebra:
While NumPy provides basic linear algebra operations, Scipy extends this functionality with
additional features and optimizations. The linear algebra module (scipy.linalg) includes more
advanced operations, such as matrix factorization and solving linear systems.

5
6
Statistical Functions:
Scipy's stats module (scipy.stats) offers a wide range of statistical functions for probability
distributions, hypothesis testing, and descriptive statistics. It's a valuable tool for statistical
analysis.
Sparse Matrix Operations:
Scipy provides a module for working with sparse matrices (scipy.sparse), which are memory-
efficient representations of large matrices with many zero elements. This is particularly useful
in tasks involving large datasets.
Clustering:
The clustering module (scipy.cluster) includes algorithms for clustering, such as hierarchical
clustering and k-means clustering, which are essential in machine learning and data analysis.
File I/O:
Scipy facilitates reading and writing data in various file formats, including MATLAB files,
NetCDF files, and more.
Integration with other Libraries:
Scipy integrates well with other scientific computing libraries, such as NumPy, Matplotlib for
plotting, and Sympy for symbolic mathematics.
This code snippet utilizes SciPy's `solve` function to find the solution of a system of linear
equations defined by the coefficient matrix `A` and the constant vector `b`.
In [2]: import numpy as np
from scipy.linalg import solve
# Define the coefficients of the linear equations
A = np.array([[2, 1], [1, -3]])
# Define the constants
b = np.array([5, -4])
# Solve the system of linear equations
x = solve(A, b)
# Print the solution
print("Solution for x and y:")
print(x)
Overall, Scipy complements NumPy and provides a rich set of tools for scientists, engineers,
and researchers working on diverse scientific computing tasks. Its modular structure makes it

7
Out [2]: Solution for x and y:
2x + y = 5
x - 3y = -4

Jupyter notebook

8
easy to use specific functionalities without loading the entire library, making it a versatile tool
in the Python scientific computing ecosystem.

3.Juypter:
Jupyter is an open-source project that allows you to create and share live code, equations,
visualizations, and narrative text. The name "Jupyter" is a combination of the three core
programming languages it supports: Julia, Python, and R. Here are the key features of Jupyter:
Interactive Computing:
Jupyter provides an interactive computing environment where you can write and execute code
in a flexible and dynamic manner. It supports various programming languages, not just the
original three.
Notebooks:
The primary interface in Jupyter is the notebook. A Jupyter notebook is a document that
contains live code, equations, visualizations, and narrative text. It allows you to create and
share documents that combine code, text, and results.
Support for Multiple Languages:
While originally designed for Julia, Python, and R, Jupyter has grown to support many other
languages through "kernels." Kernels are programming language-specific backends that allow
Jupyter to interact with different languages.
Rich Output:
Jupyter notebooks support the display of rich media outputs, including HTML, images, videos,
and custom MIME types. This makes it easy to integrate code, visualizations, and explanations
in a single document
Live Code Execution:
You can execute code in a Jupyter notebook cell by cell, which allows for an incremental and
interactive development process. This is particularly useful for data analysis, exploration, and
learning.
Markdown Support:
Jupyter notebooks support Markdown cells, allowing you to include formatted text, headings,
lists, and even LaTeX equations. This makes it easy to create well-documented and readable
notebooks.
Integration with Libraries:

Jupyter integrates well with scientific computing libraries like NumPy, Pandas, Matplotlib, and
more. This makes it a popular choice for data analysis, machine learning, and scientific
research.

9
Jupyter Hub

10
Widgets for Interactivity:
Jupyter supports interactive widgets that can be embedded in notebooks to create interactive
user interfaces for data exploration and analysis.
Version Control:
Jupyter notebooks can be version-controlled using systems like Git. This allows for
collaboration and tracking changes over time.
Export Options:
Jupyter notebooks can be exported to various formats, including HTML, PDF, and slideshows.
This makes it easy to share your work with others who may not have Jupyter installed.
JupyterHub:
JupyterHub is a multi-user version of Jupyter that allows multiple users to share a Jupyter
notebook server. It is often used in educational settings or collaborative research environments.
Jupyter has become a widely used tool in various domains, including data science, research,
education, and beyond, thanks to its versatility and ease of use. It provides an interactive and
collaborative platform for working with code and data.

4.Pandas:
Pandas is an open-source data manipulation and analysis library for Python. It provides data
structures for efficiently storing and manipulating large datasets and tools for working with
structured data. Here are the key features of Pandas:
DataFrame:
The central data structure in Pandas is the DataFrame, a two-dimensional table with labeled
axes (rows and columns). It is similar to a spreadsheet or SQL table and is used to store and
manipulate structured data.
Series:
Pandas also introduces the Series data structure, which is a one-dimensional labeled array. A
DataFrame is essentially a collection of Series with a shared index.
Data Cleaning and Preprocessing:
Pandas provides a wide range of functions for cleaning and preprocessing data, including
handling missing values, removing duplicates, filling gaps, and transforming data types.
Data Selection and Indexing:
Pandas allows for easy selection, indexing, and slicing of data. You can use labels, boolean
indexing, and positional indexing to extract specific subsets of data.

11
12
Data Alignment and Merging:
Pandas aligns data automatically based on labels, making it easy to perform operations on
datasets with different structures. The library also provides powerful merging and joining
capabilities for combining datasets.
Grouping and Aggregation:
Pandas supports the grouping of data based on one or more keys and allows for the application
of various aggregation functions, such as sum, mean, count, etc., on the grouped data.
Time Series Functionality:
Pandas includes robust support for time series data, with specialized data structures and
functions for handling time-based indexing, resampling, and time zone conversion.
Input/Output Tools:
Pandas provides functions to read and write data in various file formats, including CSV, Excel,
SQL databases, and more. This makes it easy to work with data from different sources.
Statistical and Mathematical Operations:
Pandas includes a variety of statistical and mathematical functions for analysing data. This
includes descriptive statistics, correlation, covariance, and more.
Plotting and Visualization:
Pandas integrates with Matplotlib, a popular plotting library, to provide built-in plotting
functions. This allows for quick and easy visualization of data directly from Pandas
DataFrames.
Integration with NumPy:
Pandas is built on top of NumPy, which means it seamlessly integrates with NumPy arrays.
This allows for efficient data manipulation and analysis, especially when working with
numerical data.
Wide Adoption in Data Science:
Pandas is a foundational library in the Python data science ecosystem and is widely used in
combination with other libraries like NumPy, Scikit-learn, and Matplotlib for comprehensive
data analysis and machine learning tasks
This Python program utilizes Pandas to create, manipulate, and analyse a DataFrame containing sample
data with columns for 'Name', 'Age', 'Gender', and 'City', demonstrating basic operations such as
filtering individuals under 35 and computing the average age by gender.
In [3]: import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [25, 30, 35, 40, 45],' Gender': ['F', 'M', 'M', 'M', 'F'],

13
Out [3]:
Original DataFrame:

Name Age Gender City


Alice 25 F New York
Bob 30 M Los Angeles
Charlie 35 M Chicago
David 40 M Houston
Emily 45 F Phoenix

People younger than 35:

Name Age Gender City


Alice 25 F New York
Bob 30 M Los Angeles

Average age by gender:

Gender Age
F 35
M 35

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']


}

14
# Creating a DataFrame from the sample data
df = pd.DataFrame(data)
# Displaying the original DataFrame
print("Original DataFrame:")
print(df)
print()
# Filtering data: selecting individuals younger than 35
young_people = df[df['Age'] < 35]
# Displaying filtered data
print("People younger than 35:")
print(young_people)
print()
# Grouping data by gender and getting the average age
average_age_by_gender = df.groupby('Gender')['Age'].mean()
# Displaying average age by gender
print("Average age by gender:")
print(average_age_by_gender)
In summary, Pandas is a powerful and flexible library that simplifies data manipulation and
analysis in Python. It is an essential tool for data scientists, analysts, and researchers working
with structured data.

5.Statsmodel:
Statsmodels is a Python library for estimating and testing statistical models. It provides classes
and functions for the estimation of many different statistical models, as well as for conducting
hypothesis tests, statistical tests, and statistical data exploration. Let's explore some of the key
features of Statsmodels:
Estimation of Statistical Models:
Statsmodels supports the estimation of a wide range of statistical models, including linear
regression, generalized linear models, robust linear models, and time-series analysis.
This makes it a versatile tool for various statistical modeling tasks.

15
16
Linear Regression Models:
Statsmodels provides classes for estimating linear regression models, including ordinary least
squares (OLS) regression. It allows for detailed statistical analysis of the model, including
hypothesis testing and confidence interval estimation.
Generalized Linear Models (GLM):
Statsmodels supports a variety of generalized linear models, such as logistic regression,
Poisson regression, and more. GLM is a framework that extends linear regression to
accommodate different types of response variables and error distributions.
Time Series Analysis:
Statsmodels includes classes and functions for time series analysis, such as autoregressive
integrated moving average (ARIMA) models, seasonal decomposition of time series (STL),
and state space models.
Statistical Tests:
Statsmodels provides a wide range of statistical tests for hypothesis testing and model
evaluation. This includes t-tests, F-tests, chi-square tests, and many others.
Hypothesis Testing:
The library allows users to perform hypothesis tests on model parameters, including testing the
significance of coefficients in regression models. This is crucial for assessing the reliability
and relevance of variables in a model.
Model Diagnostics:
Statsmodels offers tools for diagnosing the quality and appropriateness of statistical models.
This includes checking for multicollinearity, heteroscedasticity, and other assumptions of
statistical models.
ANOVA (Analysis of Variance):
Statsmodels supports analysis of variance, allowing users to assess the variation between
different groups or factors in a dataset.
Econometric Models:
Statsmodels is widely used in econometrics for estimating and analyzing economic models. It
provides specialized classes for econometric modeling and analysis.
Interactive Data Exploration:
Statsmodels integrates with Matplotlib and other visualization libraries, allowing for
interactive exploration and visualization of statistical results.

17
Out [4]:

18
Documentation and Community Support:
Statsmodels has comprehensive documentation that helps users understand the library's
functionalities and use it effectively. The library also benefits from an active community,
making it easier to find support and examples.
In [4]: import numpy as np
import statsmodels.api as sm
# Sample data
X = np.array([1, 2, 3, 4, 5]) # Independent variable
y = np.array([2, 3, 4, 5, 6]) # Dependent variable
# Add constant for intercept term
X = sm.add_constant(X)
# Fit the model
model = sm.OLS(y, X).fit()
# Print model summary
print(model.summary())
Statsmodels is a valuable tool for statisticians, economists, data scientists, and researchers who
need to perform detailed statistical analysis and modeling in Python. It complements other
libraries in the scientific Python ecosystem, such as NumPy and Pandas, to provide a
comprehensive set of tools for data analysis and modelling.

RESULT:
NumPy, SciPy, Juypter, Statsmodels and Pandas are downloaded and installed successfully. Its
features are also explored and studied.

19
Out [1]: First_Array:
1 2 4
[6 8 9]
7 3 5
Second_Array:
0 5 6
[8 9 10]
6 5 4
Out [2]: [94 76 77 89 81 65]
Out [3]: Arr_1:
1 2 4
[6 8 9]
7 3 5
Arr_2:
0 5 6
[8 9 10]
6 5 4
Arr_3:
1 7 10
[14 17 19]
13 8 9
Arr_4:
40 43 42
[118 147 152]
54 87 92

20
2. Working with NumPy Array

AIM:
This Python program employs NumPy to handle arrays, enabling efficient numerical
computation and manipulation of multidimensional data. It demonstrates basic operations such
as array creation, slicing, mathematical computations, and statistical analysis, showcasing
NumPy's versatility and performance in scientific computing tasks
PROCEDURE:
Step 1: Start
Step 2: Open Google Colab
Step 3: Import NumPy module/library
Step 4: Populate arrays with specific numbers
Step 5: Perform addition and multiplication with arrays
Step 6: Display the array
PROGRAM (Input and Output):
Creating arrays:
In[1]:import numpy as np
First_Array=np.array([[1,2,3],[4,5,6],[7,8,9]])
Second_Array=np.full((3,3),5)
print(‘First_Array:\n’,First_Array)
print(‘Second_Array:\n’,Second_Array)

Populating arrays with random numbers:


In[2]: random_integers=np.random.randint(low=50,high=101,size(6))
print(random_integrs)
Mathematical operations with arrays:
In[3]: Arr_1=np.full((3,3),5)

Arr_2=np.ones((3,3))
print(‘Arr_1:\n’,Arr_1)
print(‘Arr_2:\n’,Arr_2)
Arr_3=Arr_1+Arr_2
print(‘ADDITION OF TWO ARRAYS:\n’,Arr_3)

21
22
Arr_4=Arr_1*Arr_3
print(‘MULTIPLICATION OF TWO ARRAYS:\n’,Arr_4)

MATRIX ADDITION:
Example:
1 2 4
A = [6 8 9]
7 3 5
0 5 6
B = [8 9 10]
6 5 4
C=A+B
1 2 4 0 5 6 1 7 10
= [6 8 9] +[8 9 10] = [14 17 19]
7 3 5 6 5 4 13 8 9
MATRIX MULTIPLICATION:
Example:
C=AB
1 2 4 0 5 6
= [6 8 9] [8 9 10]
7 3 5 6 5 4
(0 + 16 + 24) (5 + 18 + 20) (6 + 20 + 16)
= [(0 + 64 + 54) (310 + 72 + 45) (36 + 80 + 36)]
(0 + 24 + 30) (35 + 27 + 25) (42 + 30 + 20)
40 43 42
= [118 147 152]
54 87 92

RESULT:
We have successfully worked with NumPy and exposed its basic features.

23
Out [1]: DataFrame:
Name Age City
Alice 25 New York
Bob 30 Los Angeles
David 35 Chicago
Charlie 28 Houton

Out [2]: First three rows:

NAME AGE CITY


Alice 25 New York
David 30 Los Angel
Charlie 28 Houston
Out [3]: Specific column:

NAME
Alice
Bob
David
Charlie
Out [4]: New:

NAME AGE SALARY


Alice 26 55000
Bob 30 60000
David 35 75000
Charlie 28 70000

24
3. Working with Pandas DataFrames

AIM:
This Python program showcases the usage of Pandas library for data manipulation, particularly
with DataFrames. It loads sample data into a DataFrame, filters individuals under 35, computes
the average age by gender, and presents the results through concise data analysis and
manipulation techniques.
PROCEDURE:
Step 1: Start
Step 2: Open Google colab
Step 3: Import NumPy and Pandas libraries
Step 4: Create a DataFrame
Step 5: Display first rows
Step 6: Display specific column
Step 7: Sort the columns
Step 8: Display each output
Step 9: Stop
PROGRAM (Input and Output):
Creating a DataFrame:
In [1]: import pandas as pd
data={‘Name’:[‘Alice’,’Bob,’David’,’Charlie’],
’Age’:[25,30,35,28],
‘City’:[‘New York’,’Los Angeles’,’Chicago’,’Houston’]}
df=pd.DataFrame(data)
print(‘DataFrame:\n’,df)
Entering first three rows:
In [2]: print(‘First three rows:\n’,df.head(3))
Entering specific column:
In[3]: print(‘Specific column:\n’,df[‘Name’])

25
26
Sorting:
In[4]: df[‘Salary]=[55000,60000,75000,50000]
df.loc[0,’Age’]=26
df=df.drop(columns=’City’)
print(‘New:\n’,df)
df_sorted=df.sort_values(by=’Age’,ascending=True)
print(‘After sorting:’,df_sorted.head())

RESULT:

We have successfully worked with pandas and exposed its basic features.

27
CSV Output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

Excel Output:
a b c D

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

HTML Output:

28
4(a).Reading Data from Text Files, Excel, and the Web
Aim:
Pandas, a powerful Python library, facilitates data extraction from various
sources. With its intuitive functions, pandas simplifies reading data from text
files, Excel spreadsheets, and web resources seamlessly. Utilizing its versatile
capabilities, analysts can effortlessly import, manipulate, and analyse datasets
from diverse formats, enhancing productivity and insights.
Procedure:
To read data from csv file using pandas package.
To read data from excel file using pandas package.
To read data from html file using pandas package.

Program Code:
Data Input and Output:
This notebook is the reference code for getting input and output, pandas can read a variety of
file types using its pd.read_ methods. Let’s take a look at the most common data types:
import pandas as pd
CSV :
CSV Input
In[1]:df=pd.read_csv('example')
df

Excel :
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.
➢ Upload the Excel File to Google Colab:
➢ Before reading data from an Excel file, you need to upload the file to your Colab
environment.
➢ Click on the "Files" tab on the left sidebar.
➢ Click on the "Upload" button and select the Excel file from your local machine.
➢ Install and Import Pandas:
➢ Pandas is a powerful library for data manipulation and analysis in Python. In most
cases, Google Colab comes with Pandas pre-installed.

29
30
➢ Read Data from Excel:
➢ Assuming you've uploaded an Excel file named 'data.xlsx'
➢ Handle Multiple Sheets:
➢ If your Excel file contains multiple sheets, you can specify the sheet name or use
the default behavior of reading the first sheet.
➢ Save Changes:
➢ If you make changes to the DataFrame and want to save it back to an Excel file.
Excel Input :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

HTML :

You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:
pip install lxml
pip install html5lib==1.1
pip install BeautifulSoup4

Then restart Jupyter Notebook. (or use conda install)


Pandas can read table tabs off of html.
HTML Input
Pandas read_html function will read tables off of a webpage and return a list of DataFrame
objects:
url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-ban
k-list
df = pd.read_html(url)
df[0]
match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)
df_list[0]

Result :
Exploring commands for read data from csv file, excel file and html are successfully executed.

31
Out [1]:

𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 types

5.5 2.3 4.0 1.3 Iris-versicolor


6.3 2.7 4.9 1.8 Iris-virginica
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa

Out [2]:

Enter setal length:5.0


Enter setal width:4.7
Enter petal length:3.6
Enter petal width:4.2

Out [3]:

𝒙𝟒 𝒙𝟑 𝒙𝟐 𝒙𝟏 types d
0 5.1 3.5 1.4 0.2 Iris-setosa 4.721229

1 5.5 2.3 4.0 1.3 Iris-versicolor 3.818377

2 6.3 2.7 4.9 1.8 Iris-virginica 3.624914


3 4.6 3.4 1.4 0.3 Iris-setosa 4.679744
4 5.0 3.4 1.5 0.2 Iris-setosa 4.701064

32
4(b). Descriptive Analytics: Iris Dataset Exploration

Aim :
The provided Python code reads an iris dataset from a CSV file, takes input for sepal and
petal measurements, calculates Euclidean distances from the input to each data point, and
then sorts the dataset based on these distances.
Procedure :
To understand idea behind Descriptive Statistics.
Load the packages we will need and also the `iris` dataset.
load_iris() loads in an object containing the iris dataset, which I stored in `iris_obj`.
Basic statistics: count, mean, median, min, max
Program Code:
In [1]: import pandas as pd
import numpy as np
df=pd.read_csv('/content/iris2.csv')
df.head()
In [2]: sl=float (input ('Enter setal length:'))
sw=float (input ('Enter setal width:'))
pl=float (input ('Enter petal length:'))
pw=float (input ('Enter petal width:'))
df
In [3]: df['a']=((df['x1']-sl)**2.0)
df['b']=((df['x2']-sw)**2.0)
df['c']=((df['x3']-pl)**2.0)
df['e']=((df['x4']-pw)**2.0)
df['d']=np.sqrt(df[['a','b','c','e']].sum(axis=1))
df.drop(df[['a','b','c','e']],axis=1,inplace=True)
df.head()
In [4]: df_sort=df.sort_values(by=['d'])
df_sort.head()

33
Out [4]:

𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 types d
6.2 3.4 5.4 2.3 Iris-virginica 3.159114
5.9 3.0 4.2 1.5 Iris-versicolor 3.368976

5.9 3.0 5.1 1.8 Iris-virginica 3.421988


5.2 2.7 3.9 1.4 Iris-versicolor 3.459769
6.5 3.0 5.2 2.0 Iris-virginica 3.541186

34
Result:
Exploring various commands for doing descriptive analytics on the Iris data set successfully
executed.

35
Out [1]:

𝐖𝟏 𝐖𝟐 F
40 49 6
50 59 8
60 69 12
70 79 14
80 89 7
90 99 3

Out [2]:

𝐖𝟏 𝐖𝟐 f E x fx

40 49 6 89 44.5 267.0

50 59 8 109 54.5 436.0

60 69 12 129 64.5 774.0

70 79 14 149 74.5 1043.0

80 89 7 169 84.5 591.5

90 99 3 189 94.5 283.5

36
5. Comparative Analysis of Statistical Methods
Aim:
a) Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b) Bivariate analysis: Linear and logistic regression modelling.
c) Multiple Regression analysis.
d) Also compare the results of the above analysis for the two data sets.
Program Code:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
In [1]: import pandas as pd
df=pd.read_csv('/content/dslab.csv')
df.style
In [2]: df['e']=df['w1']+df['w2']
df['x']=df['e']/2
df['fx']=df['f']*df['x']
df
In [3]: g=df['fx'].sum()/df['f'].sum()
print("Mean Weight is")
print(g)
In [4]: df['m']=df['x']-g
df['y']=df['m'].abs()
df['fy']=df['f']*df['y']
df['Y^2']=df['y']**2
df['fY^2']=df['Y^2']*df['f']
df.style
In [5]: M=(df['f'].sum()*df['y'].sum())/df['f'].sum()
print("Mean deviation")
print(M)
In [6]: M1=((df['f'].sum()*df['Y^2'].sum())/df['f'].sum())
print("Variance")

37
Out [3]:
Mean Weight is 67.9
Out [4]:
𝐖𝟏 W2 f e x fx m y fy Y2 fY2
40 49 6 89 44.500000 267.000000 -23.400000 23.400000 140.400000 547.560000 3285.360000
50 59 8 109 54.500000 436.000000 -13.400000 13.400000 107.200000 179.560000 1436.480000
60 69 12 129 64.500000 774.000000 -3.400000 3.400000 40.800000 11.560000 138.720000
70 79 14 149 74.500000 1043.000000 6.600000 6.600000 92.400000 43.560000 609.840000
80 89 7 169 84.500000 591.500000 16.600000 16.600000 116.200000 275.560000 1928.920000
90 99 3 189 94.500000 283.500000 26.600000 26.600000 79.800000 707.560000 2122.680000

Out [5]:
Mean deviation : 90.0
Out [6]:
Variance : 1765.36
Out [7]:
Standard Deviation : 42.016187356779525
Out [8]:
Median : 68.66666666666667

print(M1)

38
In[7]: M2=(M1)**(1/2)
print("Standard Deviation")
print(M2)
In[8]: lm=59.5
cf=14
fm=12
c=10
me=lm+((((df['f'].sum()/2)-cf)/fm)*c)
print("Median")
print(me)
MANUAL SOLUTION:

𝒘𝟐 𝒘𝟏 f
40 49 6
50 59 8
60 69 12
70 79 14
80 89 7
90 99 3
MEAN:
The mean is calculated by summing up all the values multiplied by their frequencies and then
dividing by the total frequency.
Mean = (406 + 508 + 6012 + 7014 + 807 + 903) / (6 + 8 + 12 + 14 + 7 + 3)
= (240 + 400 + 720 + 980 + 560 + 270) / 50
= 3170 / 50
= 63.4
Median:
To find the median, we need to first calculate the cumulative frequency and then locate the
middle value.
Cumulative Frequency: 6, 6+8=14, 14+12=26, 26+14=40, 40+7=47, 47+3=50
The median falls in the interval 70 - 79. We use the formula for finding the median in a grouped

39
,
frequency distribution: Formula
40
L + [(N/2 - F) / f] * w
where: L = Lower boundary of the median class (70)
N = Total frequency (50)
F = Cumulative frequency of the class before the median class (26)
f = Frequency of the median class (14)
w = Width of the median class (10)
Median= 70 + [(25 - 26) / 14] * 10
= 70 + (-1 / 14) * 10
= 70 - 0.714
= 69.286
Mode:
The mode is the value with the highest frequency. In this case, it's the interval 70 - 79, with a
frequency of 14.
Variance:
Variance (σ2 ) = [Σ (f * (x − μ)2)] / N
Where:
Σ denotes the sum of
f is the frequency
x is the midpoint of each class interval
μ is the mean
N is the total frequency
Variance = [(6 * (44.5 − 63.4)2 ) + (8 * (54.5 − 63.4)2 ) + (12 * (64.5 − 63.4)2 ^2) + (14
* (74.5 − 63.4)2 ) + (7 * (84.5 − 63.4)2 ) + (3 * (94.5 − 63.4)2 )] / 50
= [(6 ∗ (−18.9)2) + (8 * (−8.9)2 ) + (12 * (1.1)2) + (14 * (11.1)2 ) + (7 * (21.12 ))
+ (3 * (31.1)2 )] / 50
= [(6 * 357.21) + (8 * 79.21) + (12 * 1.21) + (14 * 123.21) + (7 * 445.21) + (3 *
967.21)] / 50
= [2143.26 + 633.68 + 14.52 + 1724.94 + 3116.47 + 2901.63] / 50
= 12,534.5 / 50
=250.69
Standard deviation:
Standard deviation= √variance

41
=√250 ⋅ 6

42
=15.8303
b).Bivariate analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')

43
44
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()
plot11=df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='
Non-Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label
='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()
plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()
plot2 =sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')

45
46
plt.show()
## Blood Pressure variable
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],c
olor='green',label='Non Diab.')
sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',
label='Diab')
axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

47
Out[1]:-

𝑿𝟏 𝑿𝟐 Y

3 8 -3.7
4 5 3.5
5 7 2.5
6 3 11.5
2 1 5.

Out[2]:-

𝑿𝟏 𝑿𝟐 Y 𝐘𝟏 Y-𝐘𝟏

3 8 -3.7 -3.732351 0.001047

4 5 3.5 3.565580 0.004301

5 7 2.5 2.503009 0.000009

6 3 11.5 11.473041 0.000727


2 1 5.7 5.690721 0.000086

Out[3]:-

Standard error: 0.04534783712911137


Error: -0.03917855813224613

48
plt.show()
fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))
plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],colo
r='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orie
nt='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
c.Multiple Regression analysis :
In[1]: import matplotlib.pyplot as plt
import pandas as jd
df=jd.read_csv("/content/Book1.csv")
df
In[2]:from numpy.core.fromnumeric import mean
b1=((j*x1y)-(x1x2*x2y))/((i*j)-(x1x2)**2)
b2=((i*x2y)-(x1x2*x1y))/((i*j)-(x1x2)**2)
k= df['Y'].mean()
l= df['𝑋1'].mean()
m= df['𝑋2'].mean()
a=k-(b1*l)-(b2*m)
df['Y']=a+(b1*df['X1'])+(b2*df['X2'])
o=sum(df['Y1'])
df['Y-Y1 ']=(df['Y']-df['Y1'])**2
p=sum(df[''Y-Y1 ''])

49
Out[4] :

50
sd=(p/(n-2))**(1/2)
In[3]: print("Standard error:",sd)
t=p-sd
print("Error:",t)
df
In[4]: import matplotlib.pyplot as plt
row1 = df['Y']
row2 = df['Y1']
plt.plot(row1, label='Row 1')
plt.plot(row2, label='Row 2')
plt.xlabel('Y')
plt.ylabel('Y1 ')
plt.title('Multiple Regression of Y')
plt.legend()
plt.show()
MANUAL SOLUTION:

𝐗𝟏 𝐗𝟐 Y 𝐗 𝟐𝟏 𝐗 𝟐𝟐 𝐗𝟏𝐗𝟐 𝐗𝟏Y 𝐗𝟐𝐘

3 8 -3.7 9 64 24 -11.1 -29.6

4 5 3.5 16 25 20 14 17.5

5 7 2.5 25 49 35 35 17.5

6
3 11.5 36 9 18 18 34.5

2 1 5.7 4 1 2 2 5.7

Mean:20 Mean:24 Mean:19.5 Mean:90 Mean:148 Mean:99 Mean:95.8 Mean:45.6

y1 = a + b1 x1 + b2 x2
(∑x1 )2
∑x1 = ∑x12 −
n
(20)2
= 90 − 5

=10

51
52
∑𝑥22
∑𝑥22 = ∑𝑥22 −
𝑛
(24)2
= 148 −
5
=32.8

(∑𝑥1 )(∑𝑦)
𝛴𝑥1 𝑦 = ∑𝑥1 𝑦 −
𝑛
(20)(19.5)
= 95.8 −
5
=17.8
(𝛴𝑥2 )(∑𝑦)
∑𝑥2 𝑦 = 𝛴𝑥2 𝑦 −
𝑛
(24)(19.5)
= 45.6 −
5

=-48
(∑𝑥1 )(∑𝑥2)
𝛴𝑥1 𝑥2 = ∑𝑥1 𝑥2 −
𝑛
(20)(24)
= 99 −
5
=3
∑x22 ∑x1 y − ∑x1 x2 Σx2 y
𝑏1 =
∑x21 (Σx2 )2 − (Σx1 x2 )2
(32.8)(178) − 3(−48)
𝑏1 =
10(32.8) − 9(32 )
727.8
𝑏1 = 319

=2.28
∑𝑥12 ∑𝑥2 𝑦−𝛴𝑥1 𝑥2 ∑𝑥1 𝑦
b2 =
(𝛴𝑥1 )2 𝛴𝑥22 −(∑𝑥1 𝑥2 )2

b2 = 10(10
−48)−3(17.8)
(32.8)−9

−533⋅4
𝑏1 =
319

=-1.612
𝑎 = 𝑦̅ − 𝑏1 ̅̅̅ ̅̅̅
𝑥1 − 𝑏2 𝑥2

53
Out[1]:

54
= 3.9 − (2.28)(4) − (−1.67)(4 ⋅ 8)
=2.796
𝑦 ′ = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2
= 2.796 + (2 ⋅ 28)(4) + (−1 ⋅ 67)(4 ⋅ 8)
=5.236

∑(𝑦 − 𝑦 ′ )2
𝑆𝐸𝐸 = √
𝛱−2

0.01064
=√
5−2

=0.0595
d)Also compare the results of the above analysis for the two data sets
In[1]: #sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='g
reen',label='Non Diab.')
sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0][1],color='red',label='
Diab')
axes[0][1].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')

55
Out[2]:

56
plt.tight_layout()
plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary(Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
In[2]: fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))
plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()

Result:
The dataset is read and univariate, bivariate and multiple regression analysis have
been done successfully.

57
Out[1]:

.
Line plot for year and passengers

Out[2]:-

Multiple regression analysis

58
6.APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI
DATA SETS.
Aim:
Explore diverse plotting functions for analyzing UCI datasets effectively, applying a range of
visualization techniques to gain insights into the data's patterns and trends, facilitating
comprehensive exploration and understanding of the dataset's characteristics through graphical
representations, utilizing a variety of plotting tools to create informative visualizations that aid
in the interpretation and communication of findings.
Procedure:
1. Install seaborn package and import the package.
2. Normal curves, density or contour plots, correlation and scatter plots, and histogram
plots are visualized.
3. 3d plotting done using plotly package
a. Normal curves
In[1]:#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")

b. Density and contour plots


In[2]: iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)

c. Correlation and scatter plots


In[3]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#correlation visualized using heatmap function
df = sns.load_dataset("titanic")
plt.figure(figsize=(10,8))
ax = sns.heatmap(df annot=True, fmt="d")

59
Out[3]:

Correlation graph

Scatter plot

60
plt.title(“correlation Heat map”)
plt.show
#scatter plots of categorical variable
sns.catplot(data=df, x="age", y="fare", hue="class")
plt.title("Scatter Plot of Age vs. Fare with Class")
plt.show()

d. Histograms
In[4]:-#histogram of datafra,e
df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")

e. Three dimensional plotting


In[5]:-#3d plotting using ploty package
import plotly as px
df = sns.load_dataset("iris")
px.scatter_3d(df,x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",
size="SepalLengthCm", color="Species",
color_discrete_map={"Joly":"blue","Bergeron":"violet","Coderre":"pink"})

61
Out[4]:-

Histograms
Out [5]:-

Three-dimensional plotting
graph

62
Result :
Exploring various visual plots are successfully executed.

63
Out[1]:

Latitude = 50
Longitude = -100

Out[2]:

Latitude = 45
Longitude = -100

64
7.VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
Aim:
BASEBASEMAP
Leverage Basemap to visualize geographic data, creating detailed maps that display spatial
patterns effectively. Utilize Basemap's features to analyze geographical datasets, enabling
insightful exploration and interpretation. Generate visually appealing maps using Basemap,
facilitating the communication of geographic insights.

Procedure:
1. Install the basemap package
Install the below package:
Use google colab (in anaconda prompt , conda version is need to change, it may affect our
other packages compatability)
pip install basemap
(or)
conda install -c https://conda.anaconda.org/anaconda basemap
2. Explore on various projection options example: ortho, lcc.
3. Mark the location using longitude and latitude
%matplotlib inline
In[1]: import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5)
m.bluemarble()
In[2]:fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)

65
Out[3]:-

Out[4]:-

66
plt.text(x, y, ' Seattle', fontsize=12)
In[3]:-from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines: line.set(linestyle='-', alpha=0.3, color='w')
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None, llcrnrlat=-90,
urcrnrlat=90,llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
In[4]:-fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,lon_0=0, lat_0=50, lat_1=45,
lat_2=55,width=1.6E7, height=1.2E7)
draw_map(m)

Result:
Exploring Geographic Data with Basemap is successfully executed.

67

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy