0% found this document useful (0 votes)

47 views16 pages

EDA Unit-3

Uploaded by

koyyalavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views16 pages

EDA Unit-3

Uploaded by

koyyalavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

1.

Explain with a suitable example the benefits of data transformation

Data transformation involves converting raw data into a more suitable format for analysis,
modeling, or visualization. The process may include handling missing values, scaling
variables, encoding categorical data, and creating new features. Here, let's explore the benefits
of data transformation through a practical example:
Example: Predicting House Prices
Consider a dataset containing information about houses for sale, with features such as square
footage, number of bedrooms, neighborhood, and house prices. We'll discuss the benefits of
data transformation in the context of building a predictive model for house prices.
1. Handling Missing Values:
- Scenario: The dataset may have missing values, such as incomplete records for certain
houses.
- Benefit: Data transformation techniques, such as imputation (filling in missing values),
ensure that the model is trained on complete and informative data. For example, you might
replace missing values in the 'square footage' column with the mean value of the available data.
2. Scaling Numerical Variables:
- Scenario: The 'square footage' and 'number of bedrooms' variables may have different
scales.
- Benefit: Scaling ensures that variables with different scales contribute equally to the model.
For instance, the 'square footage' values might be scaled to a standard range, preventing
variables with larger magnitudes from dominating the modeling process.
3. Encoding Categorical Variables:
- Scenario: The 'neighborhood' variable is categorical, and machine learning models typically
require numerical input.
- Benefit: By transforming categorical variables into numerical representations (encoding),
models can effectively use this information. For instance, a technique like one-hot encoding
can be applied to represent each neighborhood as a binary feature, improving the model's ability
to capture neighborhood-related patterns.
4. Creating Interaction Features:

EDA_UNIT-3 1
- Scenario: The relationship between the 'number of bedrooms' and 'square footage' might be
more complex than each variable individually suggests.
- Benefit: By creating interaction features, such as the product of 'number of bedrooms' and
'square footage,' the model can capture non-linear relationships and potential synergies between
variables.
5. Normalizing the Target Variable:
- Scenario: House prices may have a skewed distribution.
- Benefit: Normalizing the target variable (e.g., using log transformation) can improve the
model's ability to capture patterns across the entire price range. This is especially beneficial
when the target variable has a wide range of values.
6. Handling Outliers:
- Scenario: The dataset may contain outliers in the 'house prices' variable.
- Benefit: Data transformation techniques, such as winsorizing or log transformation, can
mitigate the impact of outliers on the model, making it more robust to extreme values.
7. Improving Model Interpretability:
- Scenario: The model may struggle to interpret the relationships between variables due to
non-linearity or complex interactions.
- Benefit: Feature engineering and transformation can simplify complex relationships,
making the model more interpretable. For example, converting continuous variables into
categorical bins may reveal clearer patterns.
Data transformation is crucial for preparing data to meet the requirements of machine learning
models and analyses. It enhances the quality of input data, improves model performance, and
contributes to the interpretability of the results. In the context of predicting house prices, these
transformations can lead to a more accurate and robust model that effectively captures the
underlying patterns in the data.
2. Explain with a suitable example the challenges of data transformation.
Data transformation is a critical step in the data preprocessing pipeline, but it comes with its
set of challenges. Let's explore these challenges through a practical example involving a dataset
of customer reviews for a product.
Example: Sentiment Analysis on Customer Reviews

EDA_UNIT-3 2
Consider a dataset containing customer reviews with text data, ratings, and additional metadata.
The goal is to perform sentiment analysis on the reviews to understand customer sentiments.
Here are some challenges associated with data transformation in this context:
1. Handling Text Data:
- Challenge: Customer reviews are often in unstructured text form, making it challenging to
extract meaningful features for analysis.
- Example: A review might say, "The product is great, but the delivery was slow."
- Solution: Text data requires techniques like text cleaning, tokenization, and perhaps more
advanced natural language processing (NLP) methods to convert it into a format suitable for
analysis.
2. Dealing with Missing or Incomplete Data:
- Challenge: The dataset may have missing or incomplete reviews or ratings.
- Example: Some customers might provide a rating without leaving a detailed review, or
certain reviews may lack a corresponding rating.
- Solution: Strategies such as imputation or data removal need to be applied to handle missing
or incomplete data while ensuring the quality of the analysis.
3. Handling Categorical Variables:
- Challenge: The dataset may contain categorical variables like product categories or
customer types.
- Example: Product categories could include "Electronics," "Clothing," and "Books."
- Solution: Categorical variables often need to be encoded or transformed using techniques
like one-hot encoding to be effectively incorporated into models.
4. Addressing Class Imbalance:
- Challenge: Sentiment analysis tasks often face class imbalance, where the number of
positive, negative, and neutral reviews is uneven.
- Example: The dataset may have a majority of positive reviews and a smaller number of
negative reviews.
- Solution: Techniques like oversampling, undersampling, or using specialized algorithms
need to be applied to handle class imbalance and prevent the model from being biased towards
the majority class.
5. Scaling Numeric Variables:
- Challenge: Numeric variables like review lengths or ratings might have different scales.
- Example: Review lengths could range from a few words to several paragraphs.

EDA_UNIT-3 3
- Solution: Scaling or normalization techniques may be necessary to ensure that variables
with different scales are treated equally by the model.
6. Feature Engineering for Sentiment Analysis:
- Challenge: Identifying relevant features for sentiment analysis might be non-trivial.
- Example: Sentiments might be influenced by specific words, phrases, or the overall tone of
the review.
- Solution: Feature engineering involves creating meaningful features from the data, such as
sentiment scores, word counts, or sentiment lexicon-based features.
7. Managing Computational Complexity:
- Challenge: The complexity of NLP tasks, especially with large datasets, can be
computationally intensive.
- Example: Processing and transforming a large volume of customer reviews for sentiment
analysis.
- Solution: Efficient algorithms, parallel processing, or distributed computing may be
required to manage computational complexity.
8. Maintaining Interpretability:
- Challenge: As transformations and feature engineering become more sophisticated,
maintaining the interpretability of the model may become challenging.
- Example: A model with complex text embeddings may be challenging to interpret compared
to a simple model with basic features.
- Solution: Balancing the need for model accuracy with the need for interpretability is crucial.
Additionally, using interpretable models or post-hoc interpretability techniques can help.

Addressing these challenges in the context of sentiment analysis on customer reviews requires
a thoughtful and iterative approach to data transformation, feature engineering, and model
development. The goal is to extract meaningful insights from the data while managing the
complexities inherent in real-world datasets.
3. What is NaN? Explain the procedure of dealing with them.
NaN stands for "Not a Number," and it is a special floating-point value in computing that
represents undefined or unrepresentable values, especially in the context of numerical
operations. In the context of data analysis and programming languages like Python, NaN is
often used to represent missing or undefined data.

EDA_UNIT-3 4
Dealing with NaN values in a dataset is a crucial step in data preprocessing. Here's a procedure
for handling NaN values:
1. Identify NaN Values:
- Use functions or methods to identify the presence of NaN values in your dataset. In Python,
popular libraries like Pandas provide functions like ìsna()`, ìsnull()`, or ìnfo()` to identify
NaN values.
```python
import pandas as pd
# Assuming 'df' is your DataFrame
# Check for NaN values in the entire DataFrame
print(df.isna().sum())
# Check for NaN values in a specific column
print(df['column_name'].isna().sum())
```
2. Remove NaN Values:
- If the number of NaN values is relatively small compared to the dataset, you might choose
to remove rows or columns containing NaN values. This can be done using the `dropna()`
method in Pandas.
```python
# Remove rows with NaN values
df_cleaned = df.dropna()

# Remove columns with NaN values

df_cleaned = df.dropna(axis=1)
```
3. Impute or Fill NaN Values:
- Instead of removing NaN values, you can choose to impute or fill them with specific values.
Common strategies include replacing NaN values with the mean, median, or mode of the
column.
```python
# Impute NaN values with the mean of the column
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

EDA_UNIT-3 5
# Impute NaN values with the median of the column
df['column_name'].fillna(df['column_name'].median(), inplace=True)
```
4. Forward or Backward Fill:
- For time-series data, you might consider forward or backward filling, where NaN values
are filled with the value from the previous or next time point.
```python
# Forward fill NaN values in a column
df['column_name'].fillna(method='ffill', inplace=True)

# Backward fill NaN values in a column

df['column_name'].fillna(method='bfill', inplace=True)
```
5. Use Interpolation:
- Interpolation is another technique to fill NaN values based on the values of adjacent points.
This is particularly useful for time-series or sequential data.
```python
# Linear interpolation
df['column_name'].interpolate(method='linear', inplace=True)
```
6. Replace NaN with a Specific Value:
- If NaN values represent a specific condition, you can replace them with a meaningful value.

```python
# Replace NaN values with a specific value
df['column_name'].fillna('Unknown', inplace=True)
```
7. Keep NaN Values for Analysis:
- In some cases, keeping NaN values might be essential for analysis. Certain algorithms can
handle NaN values, and they might contain meaningful information (e.g., missing responses in
a survey).
8. Advanced Techniques:

EDA_UNIT-3 6
- For more sophisticated handling of missing data, you can explore advanced techniques such
as machine learning-based imputation methods or use libraries like scikit-learn's
`SimpleImputer` for more advanced strategies.

```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df['column_name'] = imputer.fit_transform(df[['column_name']])
```
Remember that the choice of how to deal with NaN values depends on the nature of the data
and the specific requirements of your analysis or modeling task. Always carefully consider the
implications of your chosen approach on the integrity and validity of the dataset.

4. Write short note on Reshaping and pivoting.

Reshaping and Pivoting in Data Manipulation:
Reshaping:
Reshaping refers to the process of restructuring the layout or structure of a dataset. It involves
transforming data from one shape or form to another, making it more suitable for analysis or
visualization. Common operations in reshaping include converting data between long and wide
formats, merging datasets, and transforming variables.

- Long Format vs. Wide Format:

- Long Format: Also known as "tidy" data, it often has one row per observation and one
column per variable.
- Wide Format: Data is arranged with multiple columns, where each column represents a
variable, and each row represents a unique observation.

EDA_UNIT-3 7
Pivoting:
Pivoting is a specific type of reshaping operation where you rotate or flip the data, transforming
it from one arrangement to another. It involves selecting a column to become the new index,
another column to become the new columns, and a third column to become the values in the
new DataFrame.

- Pivot Function in Pandas:

- In Pandas, the `pivot` function is commonly used for pivoting data. It allows you to specify
the index, columns, and values to reshape the DataFrame.

Both reshaping and pivoting play essential roles in data manipulation, allowing data scientists
and analysts to efficiently transform and structure data to meet the requirements of various
analytical tasks.

EDA_UNIT-3 8
5. Explain various data transformation techniques
Data transformation involves converting raw data into a suitable format for analysis, modeling,
or visualization. Various techniques are employed to handle different aspects of data, including
handling missing values, scaling variables, encoding categorical data, and creating new
features. Here are some common data transformation techniques:
1. Handling Missing Values:
- Imputation: Fill in missing values with estimated or calculated values. Common imputation
methods include mean imputation, median imputation, and mode imputation.
- Deletion: Remove rows or columns with missing values. This is suitable when the number
of missing values is small compared to the size of the dataset.
2. Scaling and Normalization:
- Min-Max Scaling: Rescale numerical features to a specific range (e.g., [0, 1]) using the
formula

- Standardization (Z-score normalization): Transform features to have a mean of 0 and a

standard deviation of 1 using the formula

3. Log Transformation:
- Purpose: Mitigate the effects of skewed distributions.
- Formula:

4. Encoding Categorical Variables:

- One-Hot Encoding: Create binary columns for each category, indicating the presence or
absence of that category.
- Label Encoding: Convert categorical labels into numerical values. It is suitable for ordinal
variables.
5. Binning or Discretization:
- Purpose: Convert continuous variables into categorical bins.
- Example: Age groups, income brackets.
- Pandas Example:

EDA_UNIT-3 9
6. Feature Scaling for Machine Learning:
- Purpose: Ensure that all features contribute equally to the model.
- Example: In algorithms like k-nearest neighbors, where distances between data points
matter.
- Pandas Example:

7. Handling Outliers:
- Winsorizing: Replace extreme values with values closer to the mean or median. It helps
mitigate the impact of outliers.
- Trimming: Remove a certain percentage of extreme values from the dataset.
8. Date and Time Transformations:
- Extracting Date Components: Extract year, month, day, etc., from a date column.
- Time Differences: Calculate the difference between two timestamps.
9. Creating Interaction Features:
- Purpose: Capture interactions between variables.
- Example: Creating a new feature by multiplying two existing features.
- Pandas Example:
```python
df['Interaction'] = df['Feature1'] df['Feature2']
```
10. Handling Skewed Data:
- Box-Cox Transformation: A power transformation that stabilizes the variance and makes
the data more normal.
- Yeo-Johnson Transformation: Similar to Box-Cox but can handle zero and negative values.
11. Logistic Transformation for Percentage Data:
- Purpose: Stabilize variance for data that represents percentages.

EDA_UNIT-3 10
These techniques are applied based on the characteristics of the data and the requirements of
the analysis or modeling task. The choice of transformation depends on factors such as the
distribution of the data, the nature of variables, and the assumptions of the analytical methods
being used.

6. Explain the use of concentrating along with an axis with suitable example

It seems there might be a slight misunderstanding in the term "concentrating along an axis" as
it is not a standard term in the context of data manipulation or programming. However, if you're
referring to operations that involve aggregating or focusing on specific axes in a dataset, I can
provide information about aggregating along an axis using functions like groupby in Pandas.
Grouping and Aggregating Along an Axis (e.g., Sum, Mean):
In Pandas, the groupby operation allows you to split a DataFrame into groups based on some
criteria and then apply a function to each group independently. This can be seen as a way of
"concentrating" or focusing on specific subsets of the data along a particular axis, often with
the goal of aggregation.
Example: Grouping and Aggregating Data
Consider a dataset representing sales data for a retail store:

In this example, we are concentrating on the 'Date' axis and aggregating the sum of 'Sales' for
each date. The resulting grouped_data is a Series where the 'Date' column serves as the index,
and the sum of sales is the aggregated value for each date.
Output:

EDA_UNIT-3 11
7. Explain various methods of Outlier Detection in data
Outliers are data points that significantly deviate from the rest of the dataset. Detecting
outliers is crucial for data cleaning and ensuring the robustness of statistical analyses and
machine learning models. Various methods exist for outlier detection, and the choice of
method depends on the nature of the data and the characteristics of outliers. Here are some
common methods for outlier detection:
1. Z-Score or Standard Score:
- Method: Calculate the z-score for each data point, which represents how many standard
deviations away it is from the mean.
- Threshold: Define a threshold (e.g., z-score > 3 or z-score < -3) to identify outliers.

2. IQR (Interquartile Range) Method:

- Method: Calculate the IQR (difference between the third quartile Q3 and the first quartile
Q1), and identify outliers outside a specified range.

- Implementation:

3. MAD (Median Absolute Deviation):

EDA_UNIT-3 12
- Method: Calculate the median absolute deviation from the median and identify outliers
based on a threshold.

-
Implementation:

4. Visualization Methods:
- Box Plot: Outliers are often visualized as points beyond the "whiskers" of a box plot.
- Scatter Plot: Visual inspection of a scatter plot can reveal points that deviate from the
overall pattern.
- Histogram: Visualization of the distribution may reveal extreme values.
5. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
- Method: DBSCAN identifies outliers based on the density of data points. Outliers are
often data points with low local density.

6. Isolation Forest:
- Method: Isolation Forest is an ensemble method that builds isolation trees to identify
anomalies as points that require fewer splits to isolate.

7. Local Outlier Factor (LOF):

- Method: LOF measures the local density deviation of a data point with respect to its
neighbors.

EDA_UNIT-3 13
8. Statistical Methods:
- Grubbs' Test: Detects a single outlier in a univariate dataset.
- Modified Z-Score Method: Similar to the Z-score method but uses a robust estimate of
standard deviation.
It's essential to consider the characteristics of the data and the assumptions of each method
when choosing an outlier detection approach. Combining multiple methods or using domain-
specific knowledge can enhance the accuracy of outlier detection.
8. Discuss in detail about discretization and binning
Discretization:
Definition: Discretization is the process of transforming continuous data into discrete bins or
categories. It involves dividing a continuous range of values into intervals or bins and assigning
each data point to a specific bin.
- Purpose:
- Simplifies the data and reduces its granularity.
- Enables the use of categorical or ordinal variables in analyses.
- Helps handle non-linear relationships in machine learning models.
- Methods of Discretization:
- Equal-Width Binning: Divides the range of values into equal-width intervals. It's a simple
approach but may not be suitable for skewed distributions.
- Equal-Frequency Binning (Quantile Binning): Divides the data into bins containing
approximately the same number of observations. It ensures each bin has a similar frequency of
occurrences.
- Clustering-Based Discretization: Uses clustering algorithms like k-means to group similar
values into bins.
- Decision Tree-Based Discretization: Employs decision trees to find optimal split points
for binning.
- Example (Equal-Width Binning in Python):

EDA_UNIT-3 14
Output:

2. Binning:
- Definition: Binning is a specific form of discretization where continuous data is grouped
into bins or intervals based on predefined criteria. It is often used interchangeably with
discretization.
- Purpose:
- Simplifies complex data structures.
- Prepares data for analysis or visualization.
- Enhances interpretability.
- Binning Strategies:
- Fixed-Width Binning: Divides the range into fixed-width intervals.
- Adaptive Binning: Dynamically adjusts bin boundaries based on data distribution.
- Custom Binning: Specifies bin edges manually, allowing for domain-specific
considerations.

- Example (Custom Binning in Python):

EDA_UNIT-3 15
Output:

Considerations:
- The choice between discretization and binning methods depends on the nature of the data,
the analysis goals, and the requirements of downstream tasks.
- The number of bins, bin widths, and bin edges should be chosen thoughtfully based on the
characteristics of the data.
- Discretization and binning introduce information loss, so it's essential to carefully consider
the trade-offs and implications for the specific analysis or modeling task.

EDA_UNIT-3 16

Ids Model 2
No ratings yet
Ids Model 2
63 pages
ETL (Extract, Transform, Load)
No ratings yet
ETL (Extract, Transform, Load)
43 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Rajat Agarwal-21bcon630
No ratings yet
Rajat Agarwal-21bcon630
13 pages
Projects 20241
No ratings yet
Projects 20241
3 pages
Ass2 Transformation
No ratings yet
Ass2 Transformation
6 pages
Capstone Overview
No ratings yet
Capstone Overview
58 pages
Wang Asu 0010N 21448
No ratings yet
Wang Asu 0010N 21448
81 pages
Codes and Concepts of ML-Developer
No ratings yet
Codes and Concepts of ML-Developer
125 pages
3-Data Considerations
No ratings yet
3-Data Considerations
46 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
PLC From Zero To Hero
No ratings yet
PLC From Zero To Hero
388 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
Normalization: Real-World Example
No ratings yet
Normalization: Real-World Example
3 pages
Dev U2
No ratings yet
Dev U2
96 pages
Data Science Management - Vss
No ratings yet
Data Science Management - Vss
84 pages
Asdf
No ratings yet
Asdf
4 pages
Eda 25-26 Ai&Ml 5 Sem Syllabus
No ratings yet
Eda 25-26 Ai&Ml 5 Sem Syllabus
3 pages
Project List Data Analytics
No ratings yet
Project List Data Analytics
13 pages
Question Bank
No ratings yet
Question Bank
13 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
Data Science Assignments
No ratings yet
Data Science Assignments
6 pages
ML Lab 3
No ratings yet
ML Lab 3
8 pages
Data Science LVCSession 2
No ratings yet
Data Science LVCSession 2
13 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Data Transformation:: X Ormalized
No ratings yet
Data Transformation:: X Ormalized
5 pages
Answer Key Split Up Fds
No ratings yet
Answer Key Split Up Fds
11 pages
TAC Technical Report Template - AI Stream
No ratings yet
TAC Technical Report Template - AI Stream
13 pages
LLM2
No ratings yet
LLM2
6 pages
DL EDA Process
No ratings yet
DL EDA Process
2 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
MCS 226
No ratings yet
MCS 226
13 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Advanced Techniques in Machine Learning and Optimization
No ratings yet
Advanced Techniques in Machine Learning and Optimization
8 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Machine Learning: Design, Development and Augmented Intelligence
No ratings yet
Machine Learning: Design, Development and Augmented Intelligence
25 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
FDS Question Paper-01
No ratings yet
FDS Question Paper-01
13 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
Phase 2
No ratings yet
Phase 2
14 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Sample Phase 2 Document
No ratings yet
Sample Phase 2 Document
7 pages
Standard Structure of Exploratory Data Analysis
No ratings yet
Standard Structure of Exploratory Data Analysis
6 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Assignment I
No ratings yet
Assignment I
3 pages
Kavin
No ratings yet
Kavin
13 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Artificial Intelligence - (Unit - 1)
No ratings yet
Artificial Intelligence - (Unit - 1)
47 pages
B07-Hybrid Photothermal-Photocatalyst Sheets For Solar-Driven Overall Water Splitting Coupled To Water Purification
No ratings yet
B07-Hybrid Photothermal-Photocatalyst Sheets For Solar-Driven Overall Water Splitting Coupled To Water Purification
14 pages
T.C. Altinbas University Institute of Graduate Studies
No ratings yet
T.C. Altinbas University Institute of Graduate Studies
75 pages
JHC Common Entrance Exam Cee For Fy Ug Self Financing Programmes 2024 25
No ratings yet
JHC Common Entrance Exam Cee For Fy Ug Self Financing Programmes 2024 25
20 pages
WS2 Sin, CosLaw
No ratings yet
WS2 Sin, CosLaw
4 pages
Chapter 7 (Part I) - User Defined Datatypes
No ratings yet
Chapter 7 (Part I) - User Defined Datatypes
53 pages
Constructions Reverse and Inspired S He Dec22
No ratings yet
Constructions Reverse and Inspired S He Dec22
3 pages
M2 Lesson 4 Slides For Students
No ratings yet
M2 Lesson 4 Slides For Students
48 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Last Minute Notes
No ratings yet
Last Minute Notes
2 pages
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Project Report On Conflict Management
No ratings yet
Project Report On Conflict Management
57 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
8051 Instruction Set
No ratings yet
8051 Instruction Set
50 pages
Group 2 - How Does Music Impact Plant Growth
No ratings yet
Group 2 - How Does Music Impact Plant Growth
5 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
"Node - CPP" : #Include #Include #Include Class Public New
No ratings yet
"Node - CPP" : #Include #Include #Include Class Public New
9 pages
TC 20140501 0022-Desbloqueado PDF
No ratings yet
TC 20140501 0022-Desbloqueado PDF
5 pages
Truss Bridge
No ratings yet
Truss Bridge
23 pages
Auditing in Oracle 10g Release 2
No ratings yet
Auditing in Oracle 10g Release 2
9 pages
Inheritance B
No ratings yet
Inheritance B
7 pages
SSTB - D - Vol 2 - Form 2 - Title and Main Content
No ratings yet
SSTB - D - Vol 2 - Form 2 - Title and Main Content
11 pages
Module 3 - Pneumatics Activity 1
No ratings yet
Module 3 - Pneumatics Activity 1
2 pages
Frafos ABC SBC Brochure
No ratings yet
Frafos ABC SBC Brochure
4 pages
Victaulic Grooved IPS-CS Installation
No ratings yet
Victaulic Grooved IPS-CS Installation
3 pages
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
Bomba Kobe T200 - Manual de Partes
100% (1)
Bomba Kobe T200 - Manual de Partes
13 pages
Wrninv 45 K Zhyt Bgro TTF9 X Le JT XCAVWEgf Ah IFn C
No ratings yet
Wrninv 45 K Zhyt Bgro TTF9 X Le JT XCAVWEgf Ah IFn C
12 pages
H53015302 TRQ XXX
No ratings yet
H53015302 TRQ XXX
2 pages
Macros 4 B
No ratings yet
Macros 4 B
5 pages
Discrete-Time Simulation With Simulink: ECE4560: Digital Control Laboratory
No ratings yet
Discrete-Time Simulation With Simulink: ECE4560: Digital Control Laboratory
5 pages
Omnia SST: Audio Processing Software
No ratings yet
Omnia SST: Audio Processing Software
3 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
How To Install Ubuntu Linux From USB Drive
No ratings yet
How To Install Ubuntu Linux From USB Drive
2 pages
MRC110 125 220
No ratings yet
MRC110 125 220
3 pages
5.1 Pages From Pages From ASME - PCC-2-2008 - Stored Energy Cal
No ratings yet
5.1 Pages From Pages From ASME - PCC-2-2008 - Stored Energy Cal
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

EDA Unit-3

Uploaded by

EDA Unit-3

Uploaded by

1.

Explain with a suitable example the benefits of data transformation

# Remove columns with NaN values

# Backward fill NaN values in a column

4. Write short note on Reshaping and pivoting.

- Long Format vs. Wide Format:

- Pivot Function in Pandas:

- Standardization (Z-score normalization): Transform features to have a mean of 0 and a

4. Encoding Categorical Variables:

2. IQR (Interquartile Range) Method:

3. MAD (Median Absolute Deviation):

7. Local Outlier Factor (LOF):

- Example (Custom Binning in Python):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.