Areer: A Warm Welcome To Careerera Family
Areer: A Warm Welcome To Careerera Family
CAREERERA
A Warm Welcome To Careerera Family
CAREERERA
®
• There are dress shoes, hiking boots, sandals, etc. Using EDA, you are
open to the fact that any number of people might buy any number of
different types of shoes.
• You visualize the data using exploratory data analysis to find that
most customers buy 1-3 different types of shoes.
ADVANTAGES OF EDA
1. It gives us valuable insights into the data.
2. It helps us with feature selection
3. Visualization is an effective way of detecting outliers.
DISADVANTAGES OF EDA
1. If not perform properly EDA can misguide a problem.
2. EDA does not effective when we deal with high-dimensional data.
WHAT IS NUMPY?
• NumPy is a Python library used for working with arrays.
• NumPy is a package which contains classes, functions, variables ,
large library of mathematical functions etc. to work with scientific
calculation.
• NumPy can be used to create n dimensional array where n is any
integer . we can create 1 dimensional array , 2 dimensional array, 3
dimensional array and n dimensional array.
WHY USE NumPy?
• In Python we have lists that serve the purpose of arrays, but they are
slow to process.
• NumPy aims to provide an array object that is up to 50x faster than
traditional Python lists.
• The array object in NumPy is called ndarray, it provides a lot of
supporting functions that make working with ndarray very easy.
• Arrays are very frequently used in data science, where speed and
resources are very important.
WHY IS NumPy FASTER THAN LISTS?
• NumPy arrays are stored at one continuous place in memory unlike
lists, so processes can access and manipulate them very efficiently.
• This behavior is called locality of reference in computer science.
• This is the main reason why NumPy is faster than lists. Also it is
optimized to work with latest CPU architectures.
Import NumPy
• There are two ways to import NumPy :-
• import NumPy:- this will import the intire NumPy module.
• from NumPy import*:- this will import all class, objects,
variables etc. from NumPy package. here * means all.
Import NumPy
from NumPy import*
NumPy as np
• NumPy is usually imported under the np alias.
• alias: In Python alias are an alternate name for referring to the same
thing.
Create a NumPy ndarray Object
• NumPy is used to work with arrays. The array object in NumPy is called
ndarray.
• We can create a NumPy ndarray object by using the array() function.
• To create an ndarray, we can pass a list, tuple or any array-like object
into the array() method, and it will be converted into an ndarray
CREATE A NumPy ndarray
OBJECT
DIMENSIONS IN ARRAYS
• A dimension in arrays is one level of array depth (nested arrays).
• nested array: are arrays that have arrays as their elements.
0-D Arrays
1-D Arrays
2-D Arrays
3-D Arrays
0-D Arrays
• 0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
1-D Arrays
• An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
• These are the most common and basic arrays.
2-D Arrays
• An array that has 1-D arrays as its elements is called a 2-D array.
• These are often used to represent matrix or 2nd order tensors.
3-D Arrays
• An array that has 2-D arrays (matrices) as its elements is called 3-D array.
• These are often used to represent a 3rd order tensor.
CHECK NUMBER OF
DIMENSIONS?
• NumPy Arrays provides the ndim attribute that returns an integer
that tells us how many dimensions the array have.
HIGHER DIMENSIONAL ARRAYS
• An array can have any number of dimensions.
• When the array is created, you can define the number of dimensions by
using the ndim argument.
NumPy ARRAY INDEXING
• For small data sets you might be able to replace the wrong data
one by one, but not for big data sets.
• To replace wrong data for larger data sets you can create some
rules, e.g. set some boundaries for legal values, and replace any
values that are outside of the boundaries.
Replacing Values
• Loop through all values in the "Duration" column.
• If the value is higher than 120, set it to 120:
Removing Rows
• Another way of handling wrong data is to remove the rows that contains
wrong data.
• This way you do not have to find out what to replace them with, and there
is a good chance you do not need them to do your analyses.
Discovering Duplicates
• Duplicate rows are rows that have been registered more than one
time.
• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.
• To discover duplicates, we can use the duplicated() method.
• The duplicated() method returns a Boolean values for each row:
Discovering Duplicates
Removing Duplicates
• To remove duplicates, use the drop_duplicates() method.
Pandas - Data Correlations
• A great aspect of the Pandas module is the corr() method.
• The corr() method calculates the relationship between each column
in your data set.
Pandas - Data Correlations
• The corr() method ignores "not numeric" columns.
• Result Explained The Result of the corr() method is a table with a lot
of numbers that represents how well the relationship is between two
columns.
• The number varies from -1 to 1.
• 1 means that there is a 1 to 1 relationship (a perfect correlation), and
for this data set, each time a value went up in the first column, the
other one went up as well.
• 0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
Pandas - Data Correlations
• -0.9 would be just as good relationship as 0.9, but if you increase one
value, the other will probably go down.
• 0.2 means NOT a good relationship, meaning that if one value goes
up does not mean that the other will.
• What is a good correlation? It depends on the use, but I think it is safe
to say you have to have at least 0.6 (or -0.6) to call it a good
correlation.
• Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship
with itself.
Pandas - Data Correlations
• Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very
good correlation, and we can predict that the longer you work out,
the more calories you burn, and the other way around: if you burned
a lot of calories, you probably had a long work out.
• Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a
very bad correlation, meaning that we can not predict the max pulse
by just looking at the duration of the work out, and vice versa.
WHAT IS Matplotlib?
• Matplotlib is a low level graph plotting library in python that serves as
a visualization utility.
• Matplotlib was created by John D. Hunter.
• Matplotlib is open source and we can use it freely.
• Matplotlib is mostly written in python, a few segments are written in
C, Objective-C and JavaScript for Platform compatibility.
Import Matplotlib
• Once Matplotlib is installed, import it in your applications by adding
the import module statement:
• EXAMPLE
• Draw a line in a diagram from position (0,0) to position (6,250):
Pyplot
• EXAMPLE
Plotting x and y points
• The plot() function is used to draw points (markers) in a diagram.
• By default, the plot() function draws a line from point to point.
• The function takes parameters for specifying points in the diagram.
• Parameter 1 is an array containing the points on the x-axis.
• Parameter 2 is an array containing the points on the y-axis.
• If we need to plot a line from (1, 3) to (8, 10), we have to pass two
arrays [1, 8] and [3, 10] to the plot function.
Plotting x and y points
• Example
• Draw a line in a diagram from position (1, 3) to position (8, 10):