0% found this document useful (0 votes)
12 views9 pages

SMA Expt 3

The lab manual outlines an experiment focused on data cleaning and storage for social media data using tools like Python and MongoDB. It emphasizes the importance of data cleaning in ensuring data quality and provides a structured process for cleaning data, including steps like removing duplicates, fixing structural errors, and handling missing data. Students are expected to apply these concepts practically and understand the significance of data cleaning in business decision-making.

Uploaded by

Laukik Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

SMA Expt 3

The lab manual outlines an experiment focused on data cleaning and storage for social media data using tools like Python and MongoDB. It emphasizes the importance of data cleaning in ensuring data quality and provides a structured process for cleaning data, including steps like removing duplicates, fixing structural errors, and handling missing data. Students are expected to apply these concepts practically and understand the significance of data cleaning in business decision-making.

Uploaded by

Laukik Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LAB MANUAL

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No-03

A.1 Aim:
Data Cleaning and Storage- Pre-process, filter and store social media data for business (Using Python,
MongoDB, R, etc.).

Lab Objective To understand the fundamental concepts of social media networks

Lab Outcome Collect, monitor, store and track social media data

A-2 Prerequisite
Data Mining, Data Analytics

A.3 OutCome
Students will able to perform cleaning, pre-processing and filtering on social media data.

A.4 Theory:
What is data cleaning?

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning. Data cleaning is
the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled. If data is incorrect, outcomes and algorithms
are unreliable, even though they may look correct. There is no one absolute way to prescribe the
exact steps in the data cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you know you are doing
it the right way every time. While the techniques used for data cleaning may vary according to the
types of data your company stores, you can follow these basic steps to map out a framework for
your organization.

The steps involved in data cleaning are:

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations.
Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos or incorrect capitalization. These inconsistencies can cause mislabelled categories or classes.

Step 3: Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with.

Step 4: Handle missing data

● There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
● As a first option, you can drop observations that have missing values, but doing this will drop or
lose information, so be mindful of this before you remove it.
● As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not
actual observations.
● As a third option, you might alter the way the data is used to effectively navigate null values.

Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation.
o Does the data make sense?
o Does the data follow the appropriate rules for its field?
o Does it prove or disprove your working theory or bring any insight to light?
o Can you find trends in the data to help you for your next theory?
o If not, is that because of a data quality issue?

Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to create
a culture of quality data in your organization. To do this, you should document the tools you might
use to create this strategy.

Process of Data Cleaning:


The following steps show the process of data cleaning in data mining.
1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information is
especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and reduce
the chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-time.
Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in separate
data erasing tools that can analyze rough data in quantity and automate the operation.
5. Research on data: Before this activity, our data must be standardized, validated, and
scrubbed for duplicates. There are many third-party sources, and these Approved &
authorized parties sources can capture information directly from our databases. They help
us to clean and compile the data to ensure completeness, accuracy, and reliability for
business decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing and
strengthening the client and sending more targeted data to prospective customers.

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)

Roll. No.: A17 Name: Laukik Pawar


Class: BE_A Batch: A1
Date of Experiment: Date of Submission:
Grade:
B.1.Study the fundamentals of social media platform and implement data cleaning, pre-
processing, filtering and storing social media data for business:
(Paste your Search material completed during the 2 hours of practical in the lab here)

● Students can use any social media data to perform cleaning, pre-processing and filtering.
● Use the chosen data to perform cleaning, pre-processing and filtering.

B.2 Input and Output:


(Command and its output)
B.3 Observations and learning:
(Students are expected to comment on the output obtained with clear observations and
learning for each task/ sub part assigned)
We applied Data Cleaning and Storage- Pre-process, filter and store social media data for business

B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual outcome listed above
and learning/observation noted in section B.3)
Data Cleaning and Storage- Pre-process, filter and store social media data for business

B.5 Question of Curiosity


(To be answered by student based on the practical performed and learning/observations)

Q1. What is Data Cleaning? Explain its Importance.


Ans)
Data Cleaning refers to the process of detecting and correcting (or removing) corrupt or
inaccurate records from a dataset. This process involves identifying errors, inconsistencies, and
duplicates in the data, as well as handling missing values and standardizing formats.

Importance of Data Cleaning:

Improved Data Quality: Clean data is essential for accurate analysis and reliable insights. High-
quality data ensures that decisions made based on the data are well-informed.

Enhanced Analysis: Clean data supports better statistical analysis and modeling, leading to more
accurate predictions and reports.

Efficiency: Clean data reduces the time and effort needed for analysis since users spend less time
checking for errors or inconsistencies.

Trustworthiness: In an environment where decisions are data-driven, stakeholders trust data more
when they know it has been cleaned and validated.

Regulatory Compliance: In several industries, maintaining accurate records is a legal


requirement, making data cleaning vital for compliance with regulations.

Q2. What are the Steps Involved in Data Cleaning? Enlist the Data Cleaning Techniques.
Steps Involved in Data Cleaning:
Ans)
Data Profiling: Assess the dataset to understand its structure, statistics, and quality.
Identifying Errors: Look for anomalies, duplicates, and inconsistencies in the data.
Handling Missing Values: Address gaps by either removing, imputing, or leaving them as-is,
depending on the situation.
Standardization: Normalize data formats, such as date format, categorical labels, etc.
Validation: Ensure that the data complies with defined rules and constraints (e.g., valid range,
data types).
Deduplication: Identify and remove duplicate records.
Correcting Errors: Fix identified issues based on predefined logic or through manual inspection.
Data Cleaning Techniques:

Outlier Detection: Identifying and managing values that are significantly different from the rest
of the data.
Normalization/Standardization: Scaling values to be on a similar range or format.
Imputation: Filling in missing data with replacing values (mean, median, mode, etc.).
Text Parsing and Tokenization: Breaking down text data into simpler, manageable pieces.
Aggregation: Summarizing data at a higher level to reduce a large volume of data.
Transformation: Applying functions to change the data's structure or format.

Q3. What are the Steps Involved in Data Pre-processing? Enlist the Data Pre-processing
Techniques.
Steps Involved in Data Pre-processing:
Ans)
Data Collection: Gather raw data from various sources.
Data Integration: Combine data from different sources into a unified dataset.
Data Cleaning: (as detailed above) Remove noise and inconsistencies.
Data Transformation: Convert the data into an appropriate format or structure suited for analysis.
Data Reduction: Reduce the volume of data without significant loss of information, either
through sampling or dimensionality reduction.
Data Discretization: Convert continuous attributes into categorical ones by creating bins or
intervals.
Data Scaling: Normalize or standardize the data to ensure uniformity across various data points.
Data Pre-processing Techniques:

Normalization: Scaling the feature values to a range between 0 and 1.


Standardization: Centering the data around the mean and scaling to unit variance.
Encoding Categorical Variables: Using techniques like one-hot encoding, label encoding, etc., to
convert categorical data into a numerical format.
Feature Selection: Selecting a subset of relevant features for use in model construction.
Feature Engineering: Creating new features based on existing data to improve model
performance.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or t-SNE for
reducing the number of features.
Binning: Grouping values into bins to reduce the influence of minor observation errors.
These processes ensure that the dataset is prepared adequately for analytics or machine learning
processes, leading to more reliable outcomes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy