Unit 1 (DWV)
Unit 1 (DWV)
DATA WRANGLING
Data, in its raw form, often contains errors, is incomplete, or is not in a readily usable format.
The data wrangling process transforms this raw data into a more usable form, enabling
organizations to uncover valuable insights more efficiently. This process not only saves time but
also ensures that the decisions made are based on accurate and high-quality data.
What is Data Wrangling?
Data wrangling, or data munging, is a crucial process in the data analytics workflow that
involves cleaning, structuring, and enriching raw data to transform it into a more suitable format
for analysis. This process includes cleaning the data by removing or correcting inaccuracies,
inconsistencies, and duplicates. It also involves structuring the data, often converting it into a
tabular form that is easier to work with in analytical applications.
Enriching the data is another critical step, where new information is added to make the data more
useful for analysis and validated to ensure its accuracy and quality. Data wrangling makes raw
data more accessible and meaningful, enabling analysts and data scientists to derive valuable
insights more efficiently and accurately. Data Wrangling Process
Wrangling data involves the systematic and iterative transformation of raw, unstructured, or
messy data into a clean, structured, and usable format for data science and analytics.
Here we describe the 6 key steps:
Step 1: Discover
Initially, your focus is on understanding and exploring the data you’ve gathered. This involves
identifying data sources, assessing data quality, and gaining insights into the structure and format
of the data. Your goal is to establish a foundation for the subsequent data preparation steps by
recognizing potential challenges and opportunities in the data.
Step 2: Structure
In the data structuring step, you organize and format the raw data in a way that facilitates
efficient analysis. The specific form your data will take depends on which analytical model
you’re using, but structuring typically involves reshaping data, handling missing values, and
converting data types. This ensures that the data is presented in a coherent and standardized
manner, laying the groundwork for further manipulation and exploration.
Step 3: Clean
Data cleansing is a crucial step to address inconsistencies, errors, and outliers within the dataset.
This involves removing or correcting inaccurate data, handling duplicates, and addressing any
anomalies that could impact the reliability of analyses. By cleaning the data, your focus is on
enhancing data accuracy and reliability for downstream processes.
Step 4: Enrich
Enriching your data involves enhancing it with additional information to provide more context or
depth. This can include merging datasets, extracting relevant features, or incorporating external
data sources. The goal is to augment the original dataset, making it more comprehensive and
valuable for analysis. If you do add data, be sure to structure and clean that new data.
Step 5: Validate
Validation ensures the quality and reliability of your processed data. You’ll check for
inconsistencies, verify data integrity, and confirm that the data adheres to predefined standards.
Validation helps in building your confidence in the accuracy of the dataset and ensures that it
meets the requirements for meaningful analysis.
Step 6: Publish
Now your curated and validated dataset is prepared for analysis or dissemination to business
users. This involves documenting data lineage and the steps taken during the entire wrangling
process, sharing metadata, and preparing the data for storage or integration into data science and
analytics tools. Publishing facilitates collaboration and allows others to use the data for their
analyses or decision-making processes.
Why Data Wrangling Matters in 2024?
The relevance of data wrangling continues to grow in 2024 for several reasons:
1. Volume and Variety of Data: With the explosion of data from the internet, social media, IoT
devices, and many other sources, the volume and variety of data organizations need to
manage and analyze have increased exponentially. Data wrangling helps in handling this vast
amount of varied data efficiently.
2. Advanced Analytics and AI: The advancements in analytics and artificial intelligence (AI)
demand high-quality data. Data wrangling ensures that the data fed into these advanced
models is clean, accurate, and structured, which is critical for the success of AI and machine
learning projects.
3. Faster Decision Making: In today's fast-paced world, making quick, informed decisions is
crucial for staying competitive. Data wrangling accelerates data preparation, enabling
organizations to analyze data and gain insights more rapidly.
4. Compliance and Data Governance: Organizations must ensure their data is handled and
processed correctly, given the increasing data privacy and usage regulations, such as GDPR
and CCPA. Data wrangling ensures compliance by cleaning and structuring data according to
these regulations.
5. Enhanced Data Quality and Accuracy: Data analytics' integrity heavily depends on the
quality and accuracy of the underlying data. Data wrangling helps improve the quality and
accuracy of data, enhancing the reliability of the insights derived from it.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you
might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more
performant dataset.
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be
analyzed as the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with. However, sometimes
it is the appearance of an outlier that will prove a theory you are working on. Remember: just
because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
1. As a first option, you can drop observations that have missing values, but doing this will
drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null
values.
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation:
False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you.
1. Validity. The degree to which your data conforms to defined business rules or
constraints.
2. Accuracy. Ensure your data is close to the true values.
3. Completeness. The degree to which all required data is known.
4. Consistency. Ensure your data is consistent within the same dataset and/or across
multiple data sets.
5. Uniformity. The degree to which the data is specified using the same unit of measure.
Advantages and benefits of data cleaning
Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:
3.What is Formatting?
Formatting, in the context of data management, refers to the process of structuring and arranging
data to conform to certain rules or guidelines. It is a necessary step in data analysis, ensuring
consistent, clean, and ready-to-use data. It serves as the linchpin for various operations like
data extraction, transformation, and loading (ETL), thereby laying the groundwork for
subsequent data analysis and insights.
Functionality and Features
Formatting allows for standardization and normalization of data. It aids in error detection
and data cleaning, setting the stage for reliable data analytics. In addition, it supports diverse
data types, encompassing structured and unstructured data, facilitating seamless
interoperability amongst various data systems.
Formatting provides numerous benefits, including improved data quality, increased efficiency
in data processing and analytics, and enhanced compatibility between different systems and
platforms. Its uses extend across industries, enabling efficient data analysis for business
intelligence, predictive modeling, machine learning algorithms, and more.
Despite its benefits, formatting comes with challenges, such as handling massive data volumes,
managing complex data types, and maintaining data integrity during transformation. In
addition, it requires sophisticated tools and technical expertise to manage effectively.
Formatting plays a vital role in a data lakehouse environment. It facilitates the ingestion of
diverse data types into the lakehouse, transforming them into a structured form suitable for
querying and analysis. By organizing data effectively in a data lakehouse, formatting operations
enable efficient BI reporting, AI modeling, and advanced analytics.
Security Aspects
While handling data formatting, it's critical to consider security. Ensuring data privacy, access
control, and data governance are crucial in the formatting process. Innovative solutions like
Dremio provide built-in data protection measures, offering robust security during data
formatting.
Performance
Efficient formatting significantly impacts data processing performance, allowing for faster
queries, smoother ETL processes, and optimized analytics. Dremio's technology excels in this
area, providing high-speed data formatting and transformation capabilities.
4. Outlier detection
Outliers are samples which deviate extremely from other data samples. The process of detecting
outliers is also known as anomaly detection. Outlier detection is important in medical
applications as well as many other applications such as credit card fraud detection, intrusion
detection, fault detection in smart grids, image processing, network surveillance, and any
application that requires concentrating on uncommon activities/phenomena. Outlier detection can
be considered as the complement of the clustering process, since the former looks for the
anomalies and minorities while the latter focuses on the larger groups (data majority) and
clusters them in groups based on how similar they are to each other
.It is important to distinguish between outliers and noisy data. Outliers are genuine objects in the
data set that have unique characteristics that isolate them from the remaining objects. While
noisy data result from erroneous data that randomly appear in datasets. Outliers reveal that they
have been generated by a mechanism different than the one that generated the remaining
data. Fig. 2.2 illustrates how outliers look like in a neuroimaging dataset (Obafemi-Ajayi et al.,
2017).
There are many outlier detection techniques, some are statistical, proximity based, and clustering
based methods (Han, Kamber, & Pei, 2012). In fact, for small datasets, outliers can be detected
visually as shown in Figure, where the data has been visualized using principal
component analysis as discussed in Chapter 8. In the figure on the left the outlier classified as
part of cluster 2 is far away from any other samples. After using a different clustering approach,
it has been classified alone which is the most common case in outlier detection.
Outlier Detection
Identifying and dealing with outliers is a key part of the data analysis. Outlier detection refers to
identifying data that is significantly different from the majority of your other data. These outliers
can be abnormal data points, fraudulent transactions, faulty sensor readings, etc. Detecting
outliers is important for data cleaning so as to avoid skewing ML analysis.
There are various statistical and ML techniques for detecting outliers. Statistical methods rely on
things like mean, standard deviation, quantiles, etc. to identify outliers. Machine learning methods
use things like isolation forests, one-class SVMs, autoencoders, etc. With so many choices, the
technique depends on factors like data size, type of anomaly, how the anomalies will be treated,
etc.
Statistical Methods
For smaller datasets, simple statistical methods like z-scores and quantile ranges can be used to
identify outliers. For example, z-score measures how many standard deviations an observation is
from the mean. A threshold like z=3 can be used to detect potential outliers.
Another commonly used method is to employ the interquartile range; the (IQR) identifies outliers
between the 1st and 3rd quartiles. Any observation outside 1.5 * IQR can be considered an outlier.
Machine learning models like isolation forests, one-class SVMs, or autoencoders are excellent at
outlier detection. Isolation forests isolate anomalies rather than simply profiling normal data.
They build decision trees that partition data recursively, thus isolating outliers quicker, and with
fewer partitions. One-class SVMs learn a boundary around normal data points; new samples
outside the boundary are flagged as anomalies. Autoencoders learn compressed representations of
data. Samples with high reconstruction error are potential outliers.
5.DUPLICATES
Data cleaning is an essential step in any data science project, as it can improve the quality,
accuracy, and reliability of your data. However, one of the common challenges that data
scientists face is how to handle duplicates in their data sets. Duplicates are records that have
identical or very similar values for some or all of the variables, and they can introduce bias,
noise, and errors in your analysis and modeling. In this article, you will learn what are the main
causes and types of duplicates, how to detect and measure them, and what are the best ways to
handle them depending on your data and objectives.
1Causes of duplicates
Duplicates can arise from various sources, such as human errors, data entry mistakes, data
integration issues, web scraping errors, or data collection methods. For example, a customer may
fill out a form twice with slightly different information, a data entry operator may copy and paste
the same record multiple times, a data integration process may merge data from different sources
without checking for uniqueness, a web scraper may extract the same page more than once, or a
survey may collect responses from the same respondent using different identifiers. Some of these
causes are easy to avoid or fix, while others may require more complex solutions.
Duplicate data entries are like the uninvited guests in a data-driven party. Their origins
can be diverse: sometimes it's the users inadvertently resubmitting a form, or perhaps a
database merge that didn't account for overlapping records. Other times, system bugs or
software hiccups lead to redundant information being saved. For businesses with multiple
data entry points, a lack of synchronization can easily result in repeated records.
Understanding the causes behind these duplicates is the first pivotal step. By tackling the
root, we not only cleanse our current dataset but fortify it against future redundancies.
To navigate the world of duplicates, we first dive into the root causes. Understanding
why duplicates emerge is essential. Is it human error, data integration complexities, or
system quirks? Identifying the cause allows us to implement preventive measures
effectively.
Duplicate entries can infiltrate data from diverse origins, including human errors, data
entry mishaps, integration glitches, web scraping inconsistencies, and varied data
collection methods. A customer filling out a form twice with slight variations, a data
entry operator duplicating records, integration processes merging data without uniqueness
checks, repetitive web scraping, or survey responses from the same respondent with
different identifiers—all contribute to duplicates. While some causes are easily
preventable or correctable, others demand more intricate solutions.
Identify Duplicates: Use Pandas or other data manipulation tools to identify duplicate
records based on key attributes. Assess Importance: Consider the context and significance
of the data. Not all duplicates are errors; some may be legitimate. Decide Handling
Method: Depending on the context, decide whether to remove duplicates, merge them, or
keep them for analysis. Remove Exact Duplicates: If duplicates are errors, remove them
to maintain data integrity. Merge or Aggregate Data: If duplicates represent different
aspects of the same entity, merge or aggregate them to consolidate information. Use
Unique Identifiers: If possible, rely on unique identifiers for data to avoid introducing
duplicates.
2 Types of duplicates
Duplicates can be classified into two main types: exact duplicates and near duplicates. Exact
duplicates are records that have the same values for all or a subset of the variables, and they are
usually easier to identify and remove. Near duplicates are records that have similar but not
identical values for some or all of the variables, and they are more difficult to detect and handle.
Near duplicates can result from variations in spelling, formatting, punctuation, capitalization,
abbreviations, synonyms, or missing values. For example, two records may have the same name
but different email addresses, or the same address but different phone numbers.
Normalization and standardization are two essential techniques used in data preprocessing in
machine learning and data science. Both techniques are used to transform data into a common
scale to make it easier to process and analyze. Although these techniques are often used
interchangeably, they have different applications and can be used in different contexts. In this
article, we will explore the differences between normalization and standardization, their
applications, and how to use them effectively in your data analysis.
What is Normalization?
Normalization in machine learning is a data preprocessing technique used to change the value of
the numerical column in the dataset to a common scale without distorting the differences in the
range of values or losing information.
The two most common normalization techniques are Min-Max Scaling and Z-Score
Normalization, which is also called Standardization.
Now, let's discuss Min-Max Scaling.
Min-Max Scaling
This method rescales the features the features to a fixed range, usually 0 to 1. The formula for
calculating the scaled value of a feature is:
Normalized Value = Value - Min/ Max - Min
where,
Value: Original Value of the feature
Min: Minimum value of the feature across all the data points.
Max: Maximum value of the feature across all the data points.
Advantages and Disadvantages of Normalization
Advantages Disadvantages
Data Dependency: The normalisation
Improves Algorithm Performance: Normalization
process makes the training data dependent
can lead to faster convergence and improve the
on the specific scale, which might not be
performance of machine learning algorithms, especially
appropriate for all kinds of data
those that are sensitive to the scale of input features.
distributions.
Loss of Information: In some cases,
Consistent Scale: It brings all the variables to the same normalization can lead to a loss of
scale, making it easier to compare the importance of information, especially if the data is sparse
features directly. and the normalization compresses different
values into a small range.