SMA Expt 3
SMA Expt 3
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No-03
A.1 Aim:
Data Cleaning and Storage- Pre-process, filter and store social media data for business (Using Python,
MongoDB, R, etc.).
Lab Outcome Collect, monitor, store and track social media data
A-2 Prerequisite
Data Mining, Data Analytics
A.3 OutCome
Students will able to perform cleaning, pre-processing and filtering on social media data.
A.4 Theory:
What is data cleaning?
Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning. Data cleaning is
the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled. If data is incorrect, outcomes and algorithms
are unreliable, even though they may look correct. There is no one absolute way to prescribe the
exact steps in the data cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you know you are doing
it the right way every time. While the techniques used for data cleaning may vary according to the
types of data your company stores, you can follow these basic steps to map out a framework for
your organization.
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos or incorrect capitalization. These inconsistencies can cause mislabelled categories or classes.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with.
● There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
● As a first option, you can drop observations that have missing values, but doing this will drop or
lose information, so be mindful of this before you remove it.
● As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not
actual observations.
● As a third option, you might alter the way the data is used to effectively navigate null values.
At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation.
o Does the data make sense?
o Does the data follow the appropriate rules for its field?
o Does it prove or disprove your working theory or bring any insight to light?
o Can you find trends in the data to help you for your next theory?
o If not, is that because of a data quality issue?
Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to create
a culture of quality data in your organization. To do this, you should document the tools you might
use to create this strategy.
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)
● Students can use any social media data to perform cleaning, pre-processing and filtering.
● Use the chosen data to perform cleaning, pre-processing and filtering.
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual outcome listed above
and learning/observation noted in section B.3)
Data Cleaning and Storage- Pre-process, filter and store social media data for business
Improved Data Quality: Clean data is essential for accurate analysis and reliable insights. High-
quality data ensures that decisions made based on the data are well-informed.
Enhanced Analysis: Clean data supports better statistical analysis and modeling, leading to more
accurate predictions and reports.
Efficiency: Clean data reduces the time and effort needed for analysis since users spend less time
checking for errors or inconsistencies.
Trustworthiness: In an environment where decisions are data-driven, stakeholders trust data more
when they know it has been cleaned and validated.
Q2. What are the Steps Involved in Data Cleaning? Enlist the Data Cleaning Techniques.
Steps Involved in Data Cleaning:
Ans)
Data Profiling: Assess the dataset to understand its structure, statistics, and quality.
Identifying Errors: Look for anomalies, duplicates, and inconsistencies in the data.
Handling Missing Values: Address gaps by either removing, imputing, or leaving them as-is,
depending on the situation.
Standardization: Normalize data formats, such as date format, categorical labels, etc.
Validation: Ensure that the data complies with defined rules and constraints (e.g., valid range,
data types).
Deduplication: Identify and remove duplicate records.
Correcting Errors: Fix identified issues based on predefined logic or through manual inspection.
Data Cleaning Techniques:
Outlier Detection: Identifying and managing values that are significantly different from the rest
of the data.
Normalization/Standardization: Scaling values to be on a similar range or format.
Imputation: Filling in missing data with replacing values (mean, median, mode, etc.).
Text Parsing and Tokenization: Breaking down text data into simpler, manageable pieces.
Aggregation: Summarizing data at a higher level to reduce a large volume of data.
Transformation: Applying functions to change the data's structure or format.
Q3. What are the Steps Involved in Data Pre-processing? Enlist the Data Pre-processing
Techniques.
Steps Involved in Data Pre-processing:
Ans)
Data Collection: Gather raw data from various sources.
Data Integration: Combine data from different sources into a unified dataset.
Data Cleaning: (as detailed above) Remove noise and inconsistencies.
Data Transformation: Convert the data into an appropriate format or structure suited for analysis.
Data Reduction: Reduce the volume of data without significant loss of information, either
through sampling or dimensionality reduction.
Data Discretization: Convert continuous attributes into categorical ones by creating bins or
intervals.
Data Scaling: Normalize or standardize the data to ensure uniformity across various data points.
Data Pre-processing Techniques: