0% found this document useful (0 votes)

12 views9 pages

SMA Expt 3

The lab manual outlines an experiment focused on data cleaning and storage for social media data using tools like Python and MongoDB. It emphasizes the importance of data cleaning in ensuring data quality and provides a structured process for cleaning data, including steps like removing duplicates, fixing structural errors, and handling missing data. Students are expected to apply these concepts practically and understand the significance of data cleaning in business decision-making.

Uploaded by

Laukik Pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views9 pages

SMA Expt 3

Uploaded by

Laukik Pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

LAB MANUAL

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No-03

A.1 Aim:
Data Cleaning and Storage- Pre-process, filter and store social media data for business (Using Python,
MongoDB, R, etc.).

Lab Objective To understand the fundamental concepts of social media networks

Lab Outcome Collect, monitor, store and track social media data

A-2 Prerequisite
Data Mining, Data Analytics

A.3 OutCome
Students will able to perform cleaning, pre-processing and filtering on social media data.

A.4 Theory:
What is data cleaning?

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning. Data cleaning is
the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled. If data is incorrect, outcomes and algorithms
are unreliable, even though they may look correct. There is no one absolute way to prescribe the
exact steps in the data cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you know you are doing
it the right way every time. While the techniques used for data cleaning may vary according to the
types of data your company stores, you can follow these basic steps to map out a framework for
your organization.

The steps involved in data cleaning are:

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations.
Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos or incorrect capitalization. These inconsistencies can cause mislabelled categories or classes.

Step 3: Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with.

Step 4: Handle missing data

● There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
● As a first option, you can drop observations that have missing values, but doing this will drop or
lose information, so be mindful of this before you remove it.
● As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not
actual observations.
● As a third option, you might alter the way the data is used to effectively navigate null values.

Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation.
o Does the data make sense?
o Does the data follow the appropriate rules for its field?
o Does it prove or disprove your working theory or bring any insight to light?
o Can you find trends in the data to help you for your next theory?
o If not, is that because of a data quality issue?

Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to create
a culture of quality data in your organization. To do this, you should document the tools you might
use to create this strategy.

Process of Data Cleaning:

The following steps show the process of data cleaning in data mining.
1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information is
especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and reduce
the chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-time.
Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in separate
data erasing tools that can analyze rough data in quantity and automate the operation.
5. Research on data: Before this activity, our data must be standardized, validated, and
scrubbed for duplicates. There are many third-party sources, and these Approved &
authorized parties sources can capture information directly from our databases. They help
us to clean and compile the data to ensure completeness, accuracy, and reliability for
business decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing and
strengthening the client and sending more targeted data to prospective customers.

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)

Roll. No.: A17 Name: Laukik Pawar

Class: BE_A Batch: A1
Date of Experiment: Date of Submission:
Grade:
B.1.Study the fundamentals of social media platform and implement data cleaning, pre-
processing, filtering and storing social media data for business:
(Paste your Search material completed during the 2 hours of practical in the lab here)

● Students can use any social media data to perform cleaning, pre-processing and filtering.
● Use the chosen data to perform cleaning, pre-processing and filtering.

B.2 Input and Output:

(Command and its output)
B.3 Observations and learning:
(Students are expected to comment on the output obtained with clear observations and
learning for each task/ sub part assigned)
We applied Data Cleaning and Storage- Pre-process, filter and store social media data for business

B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual outcome listed above
and learning/observation noted in section B.3)
Data Cleaning and Storage- Pre-process, filter and store social media data for business

B.5 Question of Curiosity

(To be answered by student based on the practical performed and learning/observations)

Q1. What is Data Cleaning? Explain its Importance.

Ans)
Data Cleaning refers to the process of detecting and correcting (or removing) corrupt or
inaccurate records from a dataset. This process involves identifying errors, inconsistencies, and
duplicates in the data, as well as handling missing values and standardizing formats.

Importance of Data Cleaning:

Improved Data Quality: Clean data is essential for accurate analysis and reliable insights. High-
quality data ensures that decisions made based on the data are well-informed.

Enhanced Analysis: Clean data supports better statistical analysis and modeling, leading to more
accurate predictions and reports.

Efficiency: Clean data reduces the time and effort needed for analysis since users spend less time
checking for errors or inconsistencies.

Trustworthiness: In an environment where decisions are data-driven, stakeholders trust data more
when they know it has been cleaned and validated.

Regulatory Compliance: In several industries, maintaining accurate records is a legal

requirement, making data cleaning vital for compliance with regulations.

Q2. What are the Steps Involved in Data Cleaning? Enlist the Data Cleaning Techniques.
Steps Involved in Data Cleaning:
Ans)
Data Profiling: Assess the dataset to understand its structure, statistics, and quality.
Identifying Errors: Look for anomalies, duplicates, and inconsistencies in the data.
Handling Missing Values: Address gaps by either removing, imputing, or leaving them as-is,
depending on the situation.
Standardization: Normalize data formats, such as date format, categorical labels, etc.
Validation: Ensure that the data complies with defined rules and constraints (e.g., valid range,
data types).
Deduplication: Identify and remove duplicate records.
Correcting Errors: Fix identified issues based on predefined logic or through manual inspection.
Data Cleaning Techniques:

Outlier Detection: Identifying and managing values that are significantly different from the rest
of the data.
Normalization/Standardization: Scaling values to be on a similar range or format.
Imputation: Filling in missing data with replacing values (mean, median, mode, etc.).
Text Parsing and Tokenization: Breaking down text data into simpler, manageable pieces.
Aggregation: Summarizing data at a higher level to reduce a large volume of data.
Transformation: Applying functions to change the data's structure or format.

Q3. What are the Steps Involved in Data Pre-processing? Enlist the Data Pre-processing
Techniques.
Steps Involved in Data Pre-processing:
Ans)
Data Collection: Gather raw data from various sources.
Data Integration: Combine data from different sources into a unified dataset.
Data Cleaning: (as detailed above) Remove noise and inconsistencies.
Data Transformation: Convert the data into an appropriate format or structure suited for analysis.
Data Reduction: Reduce the volume of data without significant loss of information, either
through sampling or dimensionality reduction.
Data Discretization: Convert continuous attributes into categorical ones by creating bins or
intervals.
Data Scaling: Normalize or standardize the data to ensure uniformity across various data points.
Data Pre-processing Techniques:

Normalization: Scaling the feature values to a range between 0 and 1.

Standardization: Centering the data around the mean and scaling to unit variance.
Encoding Categorical Variables: Using techniques like one-hot encoding, label encoding, etc., to
convert categorical data into a numerical format.
Feature Selection: Selecting a subset of relevant features for use in model construction.
Feature Engineering: Creating new features based on existing data to improve model
performance.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or t-SNE for
reducing the number of features.
Binning: Grouping values into bins to reduce the influence of minor observation errors.
These processes ensure that the dataset is prepared adequately for analytics or machine learning
processes, leading to more reliable outcomes.

Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Normal Pulse Voltammetry
100% (2)
Normal Pulse Voltammetry
10 pages
2104 RZIM Academy Notes 5.1
No ratings yet
2104 RZIM Academy Notes 5.1
5 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning in Excel
100% (1)
Data Cleaning in Excel
68 pages
Advertising Response Models
50% (2)
Advertising Response Models
36 pages
Method Statement For Installation
No ratings yet
Method Statement For Installation
6 pages
1.1 How-To-Use-This-Competency-Based-Learning-Material
No ratings yet
1.1 How-To-Use-This-Competency-Based-Learning-Material
2 pages
Tle 75602
No ratings yet
Tle 75602
70 pages
The Gomti Riverfront in Lucknow, India: Revitalization of A Cultural Heritage Landscape
No ratings yet
The Gomti Riverfront in Lucknow, India: Revitalization of A Cultural Heritage Landscape
20 pages
Manual Alesis Qx25 Quickstart Guide Revb
No ratings yet
Manual Alesis Qx25 Quickstart Guide Revb
40 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Cor Jesu College, Inc. College of Health Sciences: Infographic Competition
No ratings yet
Cor Jesu College, Inc. College of Health Sciences: Infographic Competition
3 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
12 pages
UNEP MC COP 3 INF 28 List Participants - English
No ratings yet
UNEP MC COP 3 INF 28 List Participants - English
70 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
National-Oilwell: Top Drive
No ratings yet
National-Oilwell: Top Drive
6 pages
An Application of Ultrasound Technology in Condition Monitoring-Rev.1-Web
No ratings yet
An Application of Ultrasound Technology in Condition Monitoring-Rev.1-Web
16 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Akash Internship Report
No ratings yet
Akash Internship Report
49 pages
Mark Scheme (RESULTS) October 2020: Pearson Edexcel GCE in Physics (8PH0) Paper 2: Core Physics II
No ratings yet
Mark Scheme (RESULTS) October 2020: Pearson Edexcel GCE in Physics (8PH0) Paper 2: Core Physics II
19 pages
C-Data Gepon Olt Fd2000s Ems User Manual-V2.0
No ratings yet
C-Data Gepon Olt Fd2000s Ems User Manual-V2.0
67 pages
Tiploa LMD Fim 1904 010 1
No ratings yet
Tiploa LMD Fim 1904 010 1
2 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Cleaning: Definition
No ratings yet
Data Cleaning: Definition
2 pages
Aspects of Data Quality (Excellent!)
No ratings yet
Aspects of Data Quality (Excellent!)
2 pages
Stock Tables and Stock Types
No ratings yet
Stock Tables and Stock Types
10 pages
Cleaning and Preparing Data
No ratings yet
Cleaning and Preparing Data
12 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Data Cleansing Steps
No ratings yet
Data Cleansing Steps
8 pages
Civic Education Lesson Plan
No ratings yet
Civic Education Lesson Plan
2 pages
Belt Conveyor (V1)
No ratings yet
Belt Conveyor (V1)
45 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
C-42 Exp 3 Sma
No ratings yet
C-42 Exp 3 Sma
8 pages
Sapien Labs Age of First Smartphone and Mental Wellbeing Outcomes
No ratings yet
Sapien Labs Age of First Smartphone and Mental Wellbeing Outcomes
26 pages
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
No ratings yet
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
2 pages
Bellman Ford
No ratings yet
Bellman Ford
36 pages
2 新车准备
No ratings yet
2 新车准备
7 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
300 GPD Water Maker
No ratings yet
300 GPD Water Maker
7 pages
Chemistry Sheet Haxked - 5
No ratings yet
Chemistry Sheet Haxked - 5
7 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
Lec 9
No ratings yet
Lec 9
1 page
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Data Cleansing
No ratings yet
Data Cleansing
4 pages
Untitled Document 13
No ratings yet
Untitled Document 13
3 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
1 Data Cleaning A Foundation For Data Analysis
No ratings yet
1 Data Cleaning A Foundation For Data Analysis
9 pages
Data Cleaning in Power Query - Best Practices and Techniques
No ratings yet
Data Cleaning in Power Query - Best Practices and Techniques
20 pages
05 Data Cleaning
No ratings yet
05 Data Cleaning
9 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Sriparna New Resume
No ratings yet
Sriparna New Resume
1 page
20PMHS012 RH
No ratings yet
20PMHS012 RH
32 pages
Data Cleaning - Importance and Techniques
No ratings yet
Data Cleaning - Importance and Techniques
1 page
? Data Cleaning 101
No ratings yet
? Data Cleaning 101
17 pages
TIẾNG ANH CHUYÊN NGÀNH 2
No ratings yet
TIẾNG ANH CHUYÊN NGÀNH 2
12 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
From The Canterbury Tales - The Prologue
No ratings yet
From The Canterbury Tales - The Prologue
24 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Data Cleaning
No ratings yet
Data Cleaning
6 pages
The Ultimate Guide To Data Cleaning With SQL 1738769035
No ratings yet
The Ultimate Guide To Data Cleaning With SQL 1738769035
36 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
How To Choose The Journal That's Right For Your Study - PLOS
No ratings yet
How To Choose The Journal That's Right For Your Study - PLOS
13 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Segmentation
No ratings yet
Data Segmentation
11 pages
12 - Data Cleaning
No ratings yet
12 - Data Cleaning
8 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
What Are The Steps in Data Cleaning and Preparatio
No ratings yet
What Are The Steps in Data Cleaning and Preparatio
2 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Crystal Reports Vs PowerBI 1
No ratings yet
Crystal Reports Vs PowerBI 1
3 pages
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
From Everand
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
Jamie Murphy
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SMA Expt 3

Uploaded by

SMA Expt 3

Uploaded by

LAB MANUAL

Lab Objective To understand the fundamental concepts of social media networks

The steps involved in data cleaning are:

Step 1: Remove duplicate or irrelevant observations

Step 3: Filter unwanted outliers

Step 4: Handle missing data

Step 5: Validate and QA

Process of Data Cleaning:

Roll. No.: A17 Name: Laukik Pawar

B.2 Input and Output:

B.5 Question of Curiosity

Q1. What is Data Cleaning? Explain its Importance.

Importance of Data Cleaning:

Regulatory Compliance: In several industries, maintaining accurate records is a legal

Normalization: Scaling the feature values to a range between 0 and 1.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.