0% found this document useful (0 votes)
3 views12 pages

Unit 1 (DWV)

Data wrangling is the process of cleaning, structuring, and enriching raw data to make it usable for analysis, ensuring decisions are based on high-quality information. The process involves six key steps: Discover, Structure, Clean, Enrich, Validate, and Publish, which are essential for effective data management in today's data-driven environment. Data cleaning, formatting, and outlier detection are critical components of data wrangling that enhance data quality and support accurate analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Unit 1 (DWV)

Data wrangling is the process of cleaning, structuring, and enriching raw data to make it usable for analysis, ensuring decisions are based on high-quality information. The process involves six key steps: Discover, Structure, Clean, Enrich, Validate, and Publish, which are essential for effective data management in today's data-driven environment. Data cleaning, formatting, and outlier detection are critical components of data wrangling that enhance data quality and support accurate analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT-1

DATA WRANGLING

Data, in its raw form, often contains errors, is incomplete, or is not in a readily usable format.
The data wrangling process transforms this raw data into a more usable form, enabling
organizations to uncover valuable insights more efficiently. This process not only saves time but
also ensures that the decisions made are based on accurate and high-quality data.
What is Data Wrangling?

Data wrangling, or data munging, is a crucial process in the data analytics workflow that
involves cleaning, structuring, and enriching raw data to transform it into a more suitable format
for analysis. This process includes cleaning the data by removing or correcting inaccuracies,
inconsistencies, and duplicates. It also involves structuring the data, often converting it into a
tabular form that is easier to work with in analytical applications.

Enriching the data is another critical step, where new information is added to make the data more
useful for analysis and validated to ensure its accuracy and quality. Data wrangling makes raw
data more accessible and meaningful, enabling analysts and data scientists to derive valuable
insights more efficiently and accurately. Data Wrangling Process

Wrangling data involves the systematic and iterative transformation of raw, unstructured, or
messy data into a clean, structured, and usable format for data science and analytics.
Here we describe the 6 key steps:

Step 1: Discover

Initially, your focus is on understanding and exploring the data you’ve gathered. This involves
identifying data sources, assessing data quality, and gaining insights into the structure and format
of the data. Your goal is to establish a foundation for the subsequent data preparation steps by
recognizing potential challenges and opportunities in the data.

Step 2: Structure

In the data structuring step, you organize and format the raw data in a way that facilitates
efficient analysis. The specific form your data will take depends on which analytical model
you’re using, but structuring typically involves reshaping data, handling missing values, and
converting data types. This ensures that the data is presented in a coherent and standardized
manner, laying the groundwork for further manipulation and exploration.

Step 3: Clean

Data cleansing is a crucial step to address inconsistencies, errors, and outliers within the dataset.
This involves removing or correcting inaccurate data, handling duplicates, and addressing any
anomalies that could impact the reliability of analyses. By cleaning the data, your focus is on
enhancing data accuracy and reliability for downstream processes.

Step 4: Enrich

Enriching your data involves enhancing it with additional information to provide more context or
depth. This can include merging datasets, extracting relevant features, or incorporating external
data sources. The goal is to augment the original dataset, making it more comprehensive and
valuable for analysis. If you do add data, be sure to structure and clean that new data.

Step 5: Validate

Validation ensures the quality and reliability of your processed data. You’ll check for
inconsistencies, verify data integrity, and confirm that the data adheres to predefined standards.
Validation helps in building your confidence in the accuracy of the dataset and ensures that it
meets the requirements for meaningful analysis.

Step 6: Publish

Now your curated and validated dataset is prepared for analysis or dissemination to business
users. This involves documenting data lineage and the steps taken during the entire wrangling
process, sharing metadata, and preparing the data for storage or integration into data science and
analytics tools. Publishing facilitates collaboration and allows others to use the data for their
analyses or decision-making processes.
Why Data Wrangling Matters in 2024?

The relevance of data wrangling continues to grow in 2024 for several reasons:

1. Volume and Variety of Data: With the explosion of data from the internet, social media, IoT
devices, and many other sources, the volume and variety of data organizations need to
manage and analyze have increased exponentially. Data wrangling helps in handling this vast
amount of varied data efficiently.

2. Advanced Analytics and AI: The advancements in analytics and artificial intelligence (AI)
demand high-quality data. Data wrangling ensures that the data fed into these advanced
models is clean, accurate, and structured, which is critical for the success of AI and machine
learning projects.

3. Faster Decision Making: In today's fast-paced world, making quick, informed decisions is
crucial for staying competitive. Data wrangling accelerates data preparation, enabling
organizations to analyze data and gain insights more rapidly.

4. Compliance and Data Governance: Organizations must ensure their data is handled and
processed correctly, given the increasing data privacy and usage regulations, such as GDPR
and CCPA. Data wrangling ensures compliance by cleaning and structuring data according to
these regulations.

5. Enhanced Data Quality and Accuracy: Data analytics' integrity heavily depends on the
quality and accuracy of the underlying data. Data wrangling helps improve the quality and
accuracy of data, enhancing the reliability of the insights derived from it.

2. DATA CLEAN UP BASICS

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you
might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more
performant dataset.

Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be
analyzed as the same category.

Step 3: Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with. However, sometimes
it is the appearance of an outlier that will prove a theory you are working on. Remember: just
because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.

Step 4: Handle missing data

You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.

1. As a first option, you can drop observations that have missing values, but doing this will
drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null
values.
Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation:

• Does the data make sense?


• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?

False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you.

5 characteristics of quality data

1. Validity. The degree to which your data conforms to defined business rules or
constraints.
2. Accuracy. Ensure your data is close to the true values.
3. Completeness. The degree to which all required data is known.
4. Consistency. Ensure your data is consistent within the same dataset and/or across
multiple data sets.
5. Uniformity. The degree to which the data is specified using the same unit of measure.
Advantages and benefits of data cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:

• Removal of errors when multiple sources of data are at play.


• Fewer errors make for happier clients and less-frustrated employees.
• Ability to map the different functions and what your data is intended to do.
• Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
• Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.

3.What is Formatting?

Formatting, in the context of data management, refers to the process of structuring and arranging
data to conform to certain rules or guidelines. It is a necessary step in data analysis, ensuring
consistent, clean, and ready-to-use data. It serves as the linchpin for various operations like
data extraction, transformation, and loading (ETL), thereby laying the groundwork for
subsequent data analysis and insights.
Functionality and Features

Formatting allows for standardization and normalization of data. It aids in error detection
and data cleaning, setting the stage for reliable data analytics. In addition, it supports diverse
data types, encompassing structured and unstructured data, facilitating seamless
interoperability amongst various data systems.

Benefits and Use Cases

Formatting provides numerous benefits, including improved data quality, increased efficiency
in data processing and analytics, and enhanced compatibility between different systems and
platforms. Its uses extend across industries, enabling efficient data analysis for business
intelligence, predictive modeling, machine learning algorithms, and more.

Challenges and Limitations

Despite its benefits, formatting comes with challenges, such as handling massive data volumes,
managing complex data types, and maintaining data integrity during transformation. In
addition, it requires sophisticated tools and technical expertise to manage effectively.

Integration with Data Lakehouse

Formatting plays a vital role in a data lakehouse environment. It facilitates the ingestion of
diverse data types into the lakehouse, transforming them into a structured form suitable for
querying and analysis. By organizing data effectively in a data lakehouse, formatting operations
enable efficient BI reporting, AI modeling, and advanced analytics.

Security Aspects

While handling data formatting, it's critical to consider security. Ensuring data privacy, access
control, and data governance are crucial in the formatting process. Innovative solutions like
Dremio provide built-in data protection measures, offering robust security during data
formatting.

Performance

Efficient formatting significantly impacts data processing performance, allowing for faster
queries, smoother ETL processes, and optimized analytics. Dremio's technology excels in this
area, providing high-speed data formatting and transformation capabilities.

4. Outlier detection
Outliers are samples which deviate extremely from other data samples. The process of detecting
outliers is also known as anomaly detection. Outlier detection is important in medical
applications as well as many other applications such as credit card fraud detection, intrusion
detection, fault detection in smart grids, image processing, network surveillance, and any
application that requires concentrating on uncommon activities/phenomena. Outlier detection can
be considered as the complement of the clustering process, since the former looks for the
anomalies and minorities while the latter focuses on the larger groups (data majority) and
clusters them in groups based on how similar they are to each other
.It is important to distinguish between outliers and noisy data. Outliers are genuine objects in the
data set that have unique characteristics that isolate them from the remaining objects. While
noisy data result from erroneous data that randomly appear in datasets. Outliers reveal that they
have been generated by a mechanism different than the one that generated the remaining
data. Fig. 2.2 illustrates how outliers look like in a neuroimaging dataset (Obafemi-Ajayi et al.,
2017).

There are many outlier detection techniques, some are statistical, proximity based, and clustering
based methods (Han, Kamber, & Pei, 2012). In fact, for small datasets, outliers can be detected
visually as shown in Figure, where the data has been visualized using principal
component analysis as discussed in Chapter 8. In the figure on the left the outlier classified as
part of cluster 2 is far away from any other samples. After using a different clustering approach,
it has been classified alone which is the most common case in outlier detection.

Outlier Detection

Identifying and dealing with outliers is a key part of the data analysis. Outlier detection refers to
identifying data that is significantly different from the majority of your other data. These outliers
can be abnormal data points, fraudulent transactions, faulty sensor readings, etc. Detecting
outliers is important for data cleaning so as to avoid skewing ML analysis.

There are various statistical and ML techniques for detecting outliers. Statistical methods rely on
things like mean, standard deviation, quantiles, etc. to identify outliers. Machine learning methods
use things like isolation forests, one-class SVMs, autoencoders, etc. With so many choices, the
technique depends on factors like data size, type of anomaly, how the anomalies will be treated,
etc.
Statistical Methods

For smaller datasets, simple statistical methods like z-scores and quantile ranges can be used to
identify outliers. For example, z-score measures how many standard deviations an observation is
from the mean. A threshold like z=3 can be used to detect potential outliers.

Another commonly used method is to employ the interquartile range; the (IQR) identifies outliers
between the 1st and 3rd quartiles. Any observation outside 1.5 * IQR can be considered an outlier.

Machine Learning Methods

Machine learning models like isolation forests, one-class SVMs, or autoencoders are excellent at
outlier detection. Isolation forests isolate anomalies rather than simply profiling normal data.
They build decision trees that partition data recursively, thus isolating outliers quicker, and with
fewer partitions. One-class SVMs learn a boundary around normal data points; new samples
outside the boundary are flagged as anomalies. Autoencoders learn compressed representations of
data. Samples with high reconstruction error are potential outliers.

5.DUPLICATES

Data cleaning is an essential step in any data science project, as it can improve the quality,
accuracy, and reliability of your data. However, one of the common challenges that data
scientists face is how to handle duplicates in their data sets. Duplicates are records that have
identical or very similar values for some or all of the variables, and they can introduce bias,
noise, and errors in your analysis and modeling. In this article, you will learn what are the main
causes and types of duplicates, how to detect and measure them, and what are the best ways to
handle them depending on your data and objectives.

1Causes of duplicates
Duplicates can arise from various sources, such as human errors, data entry mistakes, data
integration issues, web scraping errors, or data collection methods. For example, a customer may
fill out a form twice with slightly different information, a data entry operator may copy and paste
the same record multiple times, a data integration process may merge data from different sources
without checking for uniqueness, a web scraper may extract the same page more than once, or a
survey may collect responses from the same respondent using different identifiers. Some of these
causes are easy to avoid or fix, while others may require more complex solutions.

Duplicate data entries are like the uninvited guests in a data-driven party. Their origins
can be diverse: sometimes it's the users inadvertently resubmitting a form, or perhaps a
database merge that didn't account for overlapping records. Other times, system bugs or
software hiccups lead to redundant information being saved. For businesses with multiple
data entry points, a lack of synchronization can easily result in repeated records.
Understanding the causes behind these duplicates is the first pivotal step. By tackling the
root, we not only cleanse our current dataset but fortify it against future redundancies.

To navigate the world of duplicates, we first dive into the root causes. Understanding
why duplicates emerge is essential. Is it human error, data integration complexities, or
system quirks? Identifying the cause allows us to implement preventive measures
effectively.

Duplicate entries can infiltrate data from diverse origins, including human errors, data
entry mishaps, integration glitches, web scraping inconsistencies, and varied data
collection methods. A customer filling out a form twice with slight variations, a data
entry operator duplicating records, integration processes merging data without uniqueness
checks, repetitive web scraping, or survey responses from the same respondent with
different identifiers—all contribute to duplicates. While some causes are easily
preventable or correctable, others demand more intricate solutions.

Identify Duplicates: Use Pandas or other data manipulation tools to identify duplicate
records based on key attributes. Assess Importance: Consider the context and significance
of the data. Not all duplicates are errors; some may be legitimate. Decide Handling
Method: Depending on the context, decide whether to remove duplicates, merge them, or
keep them for analysis. Remove Exact Duplicates: If duplicates are errors, remove them
to maintain data integrity. Merge or Aggregate Data: If duplicates represent different
aspects of the same entity, merge or aggregate them to consolidate information. Use
Unique Identifiers: If possible, rely on unique identifiers for data to avoid introducing
duplicates.

2 Types of duplicates
Duplicates can be classified into two main types: exact duplicates and near duplicates. Exact
duplicates are records that have the same values for all or a subset of the variables, and they are
usually easier to identify and remove. Near duplicates are records that have similar but not
identical values for some or all of the variables, and they are more difficult to detect and handle.
Near duplicates can result from variations in spelling, formatting, punctuation, capitalization,
abbreviations, synonyms, or missing values. For example, two records may have the same name
but different email addresses, or the same address but different phone numbers.

6.Normalization and standardization

Normalization and standardization are two essential techniques used in data preprocessing in
machine learning and data science. Both techniques are used to transform data into a common
scale to make it easier to process and analyze. Although these techniques are often used
interchangeably, they have different applications and can be used in different contexts. In this
article, we will explore the differences between normalization and standardization, their
applications, and how to use them effectively in your data analysis.

What is Normalization?
Normalization in machine learning is a data preprocessing technique used to change the value of
the numerical column in the dataset to a common scale without distorting the differences in the
range of values or losing information.

In simple terms, Normalization refers to the process of transforming features in a dataset to a


specific range. This range can be different depending on the chosen normalization technique.

The two most common normalization techniques are Min-Max Scaling and Z-Score
Normalization, which is also called Standardization.
Now, let's discuss Min-Max Scaling.
Min-Max Scaling
This method rescales the features the features to a fixed range, usually 0 to 1. The formula for
calculating the scaled value of a feature is:
Normalized Value = Value - Min/ Max - Min
where,
Value: Original Value of the feature
Min: Minimum value of the feature across all the data points.
Max: Maximum value of the feature across all the data points.
Advantages and Disadvantages of Normalization
Advantages Disadvantages
Data Dependency: The normalisation
Improves Algorithm Performance: Normalization
process makes the training data dependent
can lead to faster convergence and improve the
on the specific scale, which might not be
performance of machine learning algorithms, especially
appropriate for all kinds of data
those that are sensitive to the scale of input features.
distributions.
Loss of Information: In some cases,
Consistent Scale: It brings all the variables to the same normalization can lead to a loss of
scale, making it easier to compare the importance of information, especially if the data is sparse
features directly. and the normalization compresses different
values into a small range.

Sensitivity to New Data: The parameters


Reduces the Impact of Outliers: Methods like Min-
used for normalization (min, max, mean,
Max scaling can reduce the impact of outliers, although
standard deviation) can change with the
this can also be a disadvantage in cases where outliers
introduction of new data, requiring re-
are important.
normalization with updated parameters
What is Standardization?
Standardization is a data preprocessing technique used in statistics and machine learning to
transform the features of your dataset so that they have a mean of 0 and a standard deviation of 1.
This process involves rescaling the distribution of values so that the mean of observed values is
aligned to 0 and the standard deviation to 1.
• Standardisation aims to adjust the scale of data without distorting differences in the ranges of
values or losing information.
• Unlike other scaling techniques, standardization maintains all original data points' information
(except for cases of constant columns).
• It ensures that no single feature dominates the model's output due to its scale, leading to more
balanced and interpretable models.
Formula of Standardization
Z = (x-mean)/standard deviation
Advantages and Disadvantages of Standardization
Advantages Disadvantages
Improves Convergence Speed: Standardization Not Bound to a Specific Range: Unlike Min-
can speed up the convergence of many machine Max scaling, standardization does not bound
learning algorithms by ensuring features have the features to a specific range, which might be a
same scale. requirement for certain algorithms.
Handles Outliers Better: It is less sensitive to May Hide Useful Information: In some cases,
outliers compared to Min-Max scaling because it the process of standardizing can hide useful
scales data based on the distribution's standard information about outliers that could be
deviation. beneficial for the model.
Useful for Algorithms Assuming Normal
Requirement for Recalculation: Whenever new
Distribution: Many machine learning algorithms
data is added to the dataset, the standardization
assume that the input features are normally
process may need to be recalculated and applied
distributed. Standardization makes this assumption
again to maintain consistency.
more valid.
When to use Normalization?
• When using algorithms that assume the input features are on a similar scale or bounded
range, such as neural networks. These algorithms often assume input values are in the range
[0,1].
• When you want to speed up the convergence of gradient descent by ensuring all features
contribute equally to the cost function.
• If the data doesn't follow a Gaussian distribution.
• For models where the magnitude of variables is important, such as k-nearest neighbours.
When to use Standardization?
• Algorithms that assume the input features are normally distributed with zero mean and unit
variance, such as Support Vector Machines, Logistic Regression, etc.
• Standardization can be a better choice if your data contains many outliers as it scales the data
based on the standard deviation.
• It is often used before applying Principal Component Analysis (PCA) to ensure that each
feature contributes equally to the analysis.
• "If the data features exhibit a Gaussian distribution, meaning that the data is normally
distributed."
Key Difference Between Normalization and Standardization
• Standardization transforms data to have a mean of 0 and a standard deviation of 1, whereas
normalization scales the data to a specific user-defined range between 0-1 or -1-1.
• Normalization makes no assumption about the underlying data distribution, while
standardization is often used when the data is assumed to be normally distributed.
• Standardization is preferred for algorithms that are sensitive to feature scale or assume
normality, such as Logistic Regression and Support Vector Machines, while normalization is
better suited for distance-based algorithms like k-nearest neighbours (KNN).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy