Ds Unit 3 Notes
Ds Unit 3 Notes
Big Data refers to extremely large datasets that are complex and grow exponentially over
time. These datasets often exceed the capabilities of traditional data processing tools and
require specialized technologies to manage and analyze. The key traits of Big Data are often
described using the "5 Vs":
Big Data is characterized by several traits that differentiate it from traditional data.
Understanding these traits is crucial for managing and leveraging big data effectively. Here’s
a detailed breakdown of each trait with examples:
1. Volume
Definition:
Volume refers to the sheer amount of data that is generated, stored, and processed within a
system. When we talk about Volume, we’re essentially discussing the quantity of data, which
can be massive in scale.
Explanation:
In the context of data, Volume is one of the key characteristics, especially when dealing with
Big Data. Big Data is all about handling enormous datasets that traditional data processing
tools can’t manage efficiently. Volume is significant because:
Example: Consider a social media platform like Facebook. Every day, users upload billions
of photos, status updates, and comments. The sheer volume of this data is enormous and
continues to grow exponentially. For instance, Facebook users upload approximately 100
million photos daily. Storing and analyzing this massive volume of data requires scalable
storage solutions and powerful processing capabilities.
Implications:
Challenges:
When dealing with large datasets, storing and analyzing them can be quite challenging.
Traditional databases might struggle with the sheer amount of data, leading to performance
issues and slow processing times. To manage this, organizations need to adopt advanced
database management systems that can efficiently handle large volumes of data. Additionally,
scalable storage solutions are necessary to ensure that the data can be stored without running
out of space or compromising on speed.
Technologies:
2. Variety
Definition: The different forms that data can take. Think of it like different types of food:
some are solid (like a sandwich), some are liquid (like soup), and some are a mix of things
(like a salad). Similarly, data comes in different "flavors" or types:
Structured Data:
Semi-Structured Data:
Unstructured Data:
The complexity arises from integrating and analyzing these diverse data types to gain
meaningful insights. For instance, combining purchase data with social media sentiment
analysis can provide a more comprehensive view of customer behavior.
Prepared by Mrs. Shah S. S.
3
Implications: Tools and techniques like data integration platforms, data lakes, and advanced
analytics are used to handle and extract value from this variety of data.
Data Integration Platforms are tools that help combine data from various sources into a
unified view. They allow you to:
• Collect Data: Aggregate data from multiple sources like databases, APIs, and files.
• Transform Data: Clean, filter, and format the data to make it suitable for analysis.
• Load Data: Place the cleaned data into a central repository for easier access and analysis.
Example: Imagine a company collects data from customer feedback forms, sales records, and
social media interactions. A data integration platform would help merge this data into a single
system where it can be analyzed collectively.
2. Data Lakes
A Data Lake is a storage repository that holds vast amounts of raw data in its native format
until needed. Unlike traditional databases, data lakes are designed to store structured, semi-
structured, and unstructured data. Here’s why they’re useful:
• Scalability: Data lakes can handle huge volumes of data, scaling up as needed.
• Flexibility: Since the data is stored in its raw form, you can perform different types of
analyses without predefined structures.
• Cost-Effective: Storing data in its native format can be more cost-effective than traditional
database systems.
Example: Suppose a company collects sensor data from manufacturing equipment, social
media posts, and text documents. A data lake can store all this diverse data until it’s needed
for analysis.
3. Advanced Analytics
Advanced Analytics involves using sophisticated techniques and tools to analyze complex
data sets. This includes:
• Machine Learning: Algorithms that can learn from data and make predictions or decisions.
For instance, predicting customer churn based on historical data.
• Data Mining: Discovering patterns and relationships within large datasets. For example,
identifying purchasing patterns among different customer segments.
• Statistical Analysis: Applying statistical methods to understand trends and make inferences.
For example, assessing the impact of a marketing campaign on sales.
Example: After aggregating data in a data lake, a company might use advanced analytics to
predict which products will be popular in the next season based on historical trends and social
media sentiment.
3. Velocity
Prepared by Mrs. Shah S. S.
4
Definition: Velocity, in the context of data, refers to how quickly data is generated,
processed, and shared. Imagine a stream of data coming from various sources, like social
media, sensors, or financial transactions. This data doesn't just arrive in bulk but flows in
real-time or near-real-time, meaning it's constantly being created and updated.
For example, think about the data generated by a social media platform. Every second, people
are posting updates, likes, comments, and photos. This constant influx of new data creates a
"high velocity" of information. To make sense of this data, it has to be processed and
analyzed quickly. If companies want to use this data to track trends or monitor user behavior,
they need to analyze it almost as soon as it's generated.
Example: working with a stock trading platform. This platform collects data on stock prices
and trading volumes all the time—every second, the data is updated as people buy and sell
stocks.
1. Continuous Data Generation: Every time a trade happens or a new piece of news is
announced about a company, the stock price and trading volume data change. For
instance, if Company X just released a groundbreaking product, its stock price might
start rising quickly as more people buy its shares.
2. Real-Time Analysis: Traders need to keep an eye on this constantly updating data to
make quick decisions. If they see the stock price of Company X rising due to the
news, they might decide to buy more shares before the price goes even higher.
3. Reacting to News: Let’s say there's unexpected news about Company Y—maybe
they’re facing a lawsuit. Traders need real-time data to see how this news impacts the
stock price. If the price starts to drop rapidly, they might decide to sell their shares to
avoid losing money.
Implications: Technologies like Apache Kafka and real-time analytics platforms are used to
handle high-velocity data streams, ensuring timely processing and decision-making.
Apache Kafka
1. What is Apache Kafka? Apache Kafka is a distributed event streaming platform designed
to handle high-throughput, low-latency data feeds. It’s like a messaging system that allows
different applications to send and receive large volumes of data efficiently.
2. Key Concepts:
• Topics: Kafka organizes data into topics, which act like channels where different types of
messages are sent. Each topic can have multiple partitions to help with scaling.
• Producers: These are the applications that send data to Kafka topics.
• Consumers: These are the applications that read data from Kafka topics.
• Brokers: Kafka runs on a cluster of servers called brokers. Each broker is responsible for
storing and managing a portion of the data.
3. How It Works:
1. What are Real-Time Analytics Platforms? These platforms are designed to analyze data
as it arrives, rather than after it's been stored. They provide insights and decision-making
capabilities on-the-fly.
2. Key Features:
• Low Latency: They process data almost immediately after it’s generated. This is crucial for
applications where timely insights are needed.
• Scalability: They handle large volumes of data and can scale up to accommodate growing
data streams.
• Complex Event Processing: They can identify patterns and anomalies in data streams, which
helps in making quick decisions based on real-time information.
• Data Ingestion: Real-time analytics platforms ingest data from various sources, often
including Kafka topics.
• Stream Processing: The platform processes the data in real-time, applying algorithms and
rules to extract insights or trigger actions.
• Visualization: Results are often visualized in dashboards, providing users with live updates
and insights.
• Integration: They integrate with other systems to act on the insights, such as triggering alerts
or adjusting operational parameters.
4. Veracity
Definition: Veracity refers to the quality, accuracy, and reliability of data and data sources. It
involves dealing with data that may be incomplete, inconsistent, or uncertain.
Example: Consider a healthcare system where patient data is collected from various sources,
such as electronic health records, wearable devices, and patient surveys. Some data may be
incomplete or inaccurate, such as incorrect patient details or missing health records.
Implications: Ensuring data quality requires data cleaning, validation, and verification
processes. Tools and techniques for data quality management and governance help maintain
the accuracy and reliability of the data.
5. Value
Prepared by Mrs. Shah S. S.
6
Definition: Value refers to the usefulness and potential insights that can be derived from
data. It focuses on extracting actionable insights that can drive business decisions and create
value.
Implications: Extracting value from data involves using advanced analytics, machine
learning models, and data visualization tools. The goal is to transform raw data into
actionable insights that drive business strategies and decisions.
Summary
• Volume: The amount of data generated and stored. For example, Facebook's daily uploads.
• Variety: The types of data (structured, semi-structured, unstructured). For example, data from
a retail company's transactions, reviews, and support logs.
• Velocity: The speed at which data is generated and processed. For example, real-time stock
price data in trading.
• Veracity: The quality and reliability of data. For example, ensuring accuracy in healthcare
records.
• Value: The insights and benefits derived from data. For example, personalized marketing
strategies based on customer data.
Understanding these traits helps in selecting the right tools and approaches for managing big
data effectively and deriving meaningful insights.
1. Crawler:
o Definition: A crawler, also known as a spider or bot, is an algorithm that navigates
the web by following links from one page to another.
o Function: It collects URLs and gathers data from the pages it visits.
o Example: A crawler might start at a home page and follow links to product pages,
collecting URLs for all the products listed.
2. Scraper:
o Definition: A scraper is a tool or script that extracts specific information from a web
page.
o Function: It parses the HTML content of a page to retrieve data based on the user’s
requirements.
o Example: A scraper might extract product names, prices, and descriptions from a
product listing page.
1. Specify URLs:
o Process: You provide the URLs of the web pages you want to scrape.
o Example: If you want to scrape product information from an online bookstore, you
might provide URLs of product category pages.
2. Load HTML Content:
o Process: The scraper retrieves the HTML content of the provided URLs.
o Example: The scraper downloads the HTML code of a product category page, which
includes product listings in HTML format.
3. Extract Data:
o Process: The scraper parses the HTML to find and extract the required information.
o Example: The scraper identifies and extracts product names, prices, and descriptions
from the HTML.
4. Output Data:
o Process: The extracted data is saved in a structured format like CSV, JSON, or Excel.
o Example: The data is saved in a CSV file with columns for product names, prices,
and descriptions.
Let’s say you want to scrape product details from an online bookstore.
1. Request URL:
o Example URL: https://examplebookstore.com/category/fiction
2. Get HTML:
o Process: The scraper requests the HTML content of the page.
o Example HTML: The HTML might include tags like <div class="product">
containing product details.
3. Extract Data:
Summary:
This structure helps visualize how the information is laid out on the webpage and how it can
be systematically extracted.
1. Price Monitoring:
o Example: An e-commerce company monitors competitors’ prices to adjust its own
pricing strategy.
2. Market Research:
o Example: A company analyzes customer reviews from various websites to
understand market trends and customer preferences.
3. News Monitoring:
o Example: A news aggregator collects headlines and articles from multiple news
sources to provide comprehensive news coverage.
4. Sentiment Analysis:
o Example: A brand collects social media mentions to analyze customer sentiment
towards its products.
5. Email Marketing:
o Example: A marketing agency collects email addresses from industry-specific
websites to build a mailing list for promotional campaigns.
Key Components:
1. Data Collection:
o Description: Gathering data from various sources relevant to the problem you are
trying to solve. This could involve surveys, sensors, databases, or public datasets.
o Example: If you’re studying the impact of study habits on student grades, you collect
data through a survey asking students about their study hours and their grades.
2. Data Cleaning:
o Description: Preparing the data for analysis by correcting or removing errors,
inconsistencies, or missing values. This ensures that the data is accurate and reliable.
o Example: If some students reported missing grades or study hours as zero, you either
correct these entries if possible or remove them from the dataset to avoid skewing
results.
Prepared by Mrs. Shah S. S.
10
3. Exploratory Data Analysis (EDA):
o Description: Investigating data sets to summarize their main characteristics and
discover patterns or anomalies. EDA often involves visualizations and basic statistical
analysis.
o Example: You create histograms to see the distribution of study hours and grades or
scatter plots to explore the relationship between study hours and grades.
4. Modeling:
o Description: Applying statistical or machine learning models to analyze data. This
involves selecting appropriate algorithms, training models on the data, and testing
their performance.
o Example: You use a linear regression model to understand how changes in study
hours affect grades, fitting the model to the collected data.
5. Interpretation:
o Description: Understanding the results from the models and analyses, and translating
these findings into meaningful insights or conclusions.
o Example: You interpret the results of your linear regression analysis to determine
that each additional hour of study is associated with a certain increase in grades.
• Statistical Analysis: Includes measures such as mean (average), median (middle value),
mode (most frequent value), and standard deviation (measure of variability).
• Machine Learning Models: Algorithms such as linear regression (predicts a value), decision
trees (classifies data), and clustering (groups similar data points).
• Visualization Tools: Software like Matplotlib or Seaborn in Python, or Excel for creating
charts such as scatter plots, bar charts, and heatmaps.
Key Components:
1. Summary:
o Description: A brief overview of the main findings from the analysis. It highlights
the most important insights without going into too much detail.
o Example: You summarize that students who study more tend to have higher grades
based on the data analysis.
2. Visualization:
o Description: Creating charts, graphs, and tables to visually represent the data and
findings. Visualizations help in understanding complex information quickly and
clearly.
o Example: You create a bar chart showing the average grades for different ranges of
study hours, making it easy to see the trend.
3. Narrative:
o Description: Writing a clear and engaging explanation of the findings, including
context, methodology, and implications. This helps readers understand the
significance of the results.
• Reports and Dashboards: Tools like Microsoft Excel, Google Sheets, or business
intelligence software like Tableau for creating interactive dashboards.
• Presentations: Software like PowerPoint or Google Slides for presenting findings to an
audience.
• Documentation: Writing detailed reports or summaries that can be shared with stakeholders
or published.
1. Analysis:
• Data Collection: Collect survey data from 100 students about their weekly study hours and
their grades.
• Data Cleaning: Check for missing or inconsistent data. For example, if some entries have
study hours listed as negative values, correct or remove these entries.
• Exploratory Data Analysis: Create a scatter plot of study hours versus grades. Calculate
summary statistics like the mean and standard deviation of study hours and grades.
• Modeling: Apply a linear regression model to predict grades based on study hours. The
model might show that an increase in study hours is associated with an increase in grades.
• Interpretation: The regression results might indicate that for every additional hour spent
studying, the grade improves by 2 points on average.
2. Reporting:
• Summary: “Our analysis of 100 students shows that those who study more hours tend to
achieve higher grades.”
• Visualization: Present a bar chart showing average grades for different ranges of study hours.
Include a scatter plot with a regression line to illustrate the relationship.
• Narrative: “The data suggests a positive correlation between study hours and grades.
Students who study more tend to perform better academically.”
• Recommendations: “Students should aim to increase their study time to improve their
academic performance. Educators might also consider encouraging study habits and providing
resources to support extended study periods.”
Data manipulation in data science involves several key tasks to ensure that the data is in the
best possible shape for analysis. Just as you might organize your school notes into folders or
1. Data Cleaning:
o Handling Missing Values: Missing data can be problematic. You might need
to fill in missing values, drop rows or columns with missing data, or use
algorithms that handle missing values gracefully.
o Removing Duplicates: Sometimes data can have duplicate entries. Identifying
and removing these ensures that your analysis is accurate and not skewed by
repetitive data.
o Correcting Errors: Data might contain errors like incorrect data types (e.g.,
numbers stored as text) or outliers. Correcting these ensures data integrity.
2. Data Transformation:
o Normalization and Scaling: Data often needs to be scaled or normalized to
ensure that different features contribute equally to analysis or modeling. For
example, normalizing test scores so that they fit within a 0 to 1 range.
o Data Aggregation: Summarizing data, such as computing the average score
for each subject, to provide meaningful insights.
o Reshaping Data: Changing the structure of data (e.g., pivoting from long to
wide format) to suit the requirements of specific analyses or visualizations.
3. Data Integration:
o Merging Data: Combining datasets from different sources. For example,
merging student grades with attendance records to get a comprehensive view
of performance.
o Concatenation: Stacking datasets together, which is useful when you have
data split across multiple files or tables.
Tools Used
1. Python:
o What It Is: Python is a versatile programming language that is widely used in
data science. It is known for its readability and simplicity, making it a popular
choice for both beginners and experienced programmers.
o Why It’s Useful: Python has a vast ecosystem of libraries that facilitate data
manipulation, analysis, and visualization. Its syntax is designed to be intuitive,
allowing you to write efficient code with fewer lines.
2. Pandas:
o What It Is: Pandas is a powerful Python library specifically designed for data
manipulation and analysis. It provides two primary data structures:
▪ Series: A one-dimensional array-like object that can hold any data
type.
▪ DataFrame: A two-dimensional, tabular data structure similar to a
spreadsheet or SQL table. It is particularly useful for handling large
datasets.
o Key Functions:
▪ read_csv() and read_excel(): Functions for loading data from CSV or
Excel files into DataFrames.
▪ dropna(): Method for removing missing values from your dataset.
▪ fillna(): Method for filling in missing values with a specified value or
method.
Prepared by Mrs. Shah S. S.
13
▪ merge() and concat(): Methods for combining multiple DataFrames.
▪ groupby(): Method for grouping data and performing aggregate
operations.
▪ pivot_table(): Method for creating pivot tables that summarize data.
What is an IDE?
Here are some popular IDEs for Python that are user-friendly for beginners:
1. Download PyCharm:
o Go to the PyCharm website.
o Download the Community version (free) for your operating system (Windows,
macOS, or Linux).
2. Install PyCharm:
o Open the downloaded installer file and follow the on-screen instructions to install
PyCharm.
3. Open PyCharm:
o Launch PyCharm from your applications list.
4. Create a New Project:
o Click on "New Project."
o Choose a location for your project and click "Create."
5. Create a New Python File:
o Right-click on your project folder in the "Project" pane.
o Select "New" > "Python File."
o Name your file (e.g., my_script.py).
6. Write Your Code:
o Type your Python code into the new file. For example:
print("Hello, World!")
1. Download VS Code:
o Go to the Visual Studio Code website.
o Download the installer for your operating system.
print("Hello, World!")
print("Hello, World!")
Let’s say you have a CSV file with student grades and attendance records, and you want to
clean and analyze this data.
This process transforms raw data into a structured, clean format, making it ready for analysis
or modeling.
Data Analysis
What It Means
Data analysis is about making sense of data to draw meaningful conclusions. After cleaning
and preparing your data, you use analysis to uncover insights, identify trends, and make data-
driven decisions. It's similar to analyzing your test results to understand which subjects you
excel in and which need improvement.
Tools Used
1. Python
Python is a versatile programming language used for a variety of tasks in data analysis. Its
extensive libraries and simple syntax make it ideal for performing complex calculations and
analyses.
• Why Python?
o Readable Syntax: Easy to write and understand code.
o Versatile Libraries: Rich ecosystem for data manipulation, analysis, and
visualization.
2. NumPy
Let’s consider a dataset of student test scores to illustrate how Python and NumPy can be
used for data analysis:
1. Import Libraries:
o pandasis used to manipulate the CSV file data.
o matplotlib.pyplot is used to visualize the student scores.
Prepared by Mrs. Shah S. S.
18
2. Load Data:
o The code reads the CSV file using pd.read_csv(). Replace the file path with the actual
path to your CSV file.
3. Descriptive Statistics:
o data.describe() prints basic statistics about the scores and attendance.
4. Correlation Analysis:
o data.corr() shows how the variables (Score and Attendance) are related.
5. Visualization:
o A line graph is plotted using plt.plot() to visualize student names and scores.
3. Data Visualization
Data Visualization is the process of representing data visually through charts and graphs.
This helps make complex data more understandable and accessible. Just as you might draw a
graph to track your progress in various subjects, data scientists use visualizations to uncover
patterns, trends, and insights from data.
Tools Used
Matplotlib
• What It Is: Matplotlib is a powerful and flexible Python library used for creating
static, animated, and interactive visualizations. It’s like having a versatile set of tools
to draw any kind of graph you need.
• Key Features:
o Basic Plot Types: You can create line plots, bar charts, histograms, scatter plots, pie
charts, and more.
o Customization: You can adjust colors, labels, legends, and titles to make your graphs
clear and visually appealing.
o Integration: Works well with other libraries like NumPy and Pandas to visualize
data directly from data structures.
• Example Use Case:
# Sample data
plt.xlabel('Subjects')
plt.title('Subject Scores')
plt.show()
• In this example, plt.bar() creates a bar chart showing scores in different subjects. The
plt.xlabel(), plt.ylabel(), and plt.title() functions add labels and a title to the chart.
Seaborn
• What It Is: Seaborn is built on top of Matplotlib and provides a high-level interface
for drawing attractive and informative statistical graphics. It’s like an upgraded
version of Matplotlib that makes it easier to create complex visualizations with less
code.
• Key Features:
o Predefined Themes: Comes with several built-in themes that make your plots look
good without needing much customization.
o Statistical Plots: Includes functions for creating complex visualizations like
heatmaps, violin plots, and pair plots, which are useful for statistical analysis.
o Ease of Use: Simplifies the process of creating plots with a more intuitive API
compared to Matplotlib.
• Install Seaborn on jupyter notebook
data = {
plt.xlabel('Subjects')
plt.ylabel('Scores')
plt.title('Subject Scores')
plt.show()
In this example, sns.barplot() creates a bar plot with a color palette applied. The
palette='viridis' argument changes the colors of the bars.
4.Building models
For example, if you have historical data on how many hours students studied and their test
scores, you can build a model to predict a student's future test score based on their study
hours. The model learns from the past data and tries to generalize this knowledge to make
predictions about new, unseen data.
Tools Used
• Scikit-learn:
o Overview:Scikit-learn is a widely-used Python library that provides a range of tools
for building machine learning models.
o Features:
▪ Algorithms: Includes many algorithms for classification (e.g., logistic
regression, decision trees), regression (e.g., linear regression), clustering
(e.g., k-means), and more.
▪ Preprocessing: Tools for preparing data, such as normalization and encoding
categorical variables.
▪ Model Evaluation: Functions to assess model performance, including
metrics like accuracy and confusion matrices.
▪ Model Selection: Utilities for splitting data, cross-validation, and
hyperparameter tuning.
Let’s say you want to build a model to predict test scores based on study hours. Here’s a
simplified workflow using Scikit-learn:
To run Python code and use libraries like Matplotlib, NumPy, Scikit-learn, and NLTK, you
need Python installed on your computer.
1. Matplotlib
What It Is:
• Matplotlib is a tool used to create graphs and charts. It helps you visualize data so you can
understand patterns and trends.
• Before you can use Matplotlib, you need to install it. This is usually done with a command
like pip install matplotlib (but don't worry about this right now; it's often set up for you in
courses).
Explanation:
• Importing: You first import the matplotlib.pyplot module, which contains functions to create
plots.
• Data: Define what data you want to plot. In this case, x is the list of values for the x-axis, and
y is for the y-axis.
• Plotting: Use plt.plot(x, y) to create a line plot.
• Customizing: Add a title and axis labels to make the plot clear.
• Displaying: plt.show() displays the plot on your screen.
2. NumPy
What It Is:
• NumPy is a tool used for numerical calculations. It helps you perform mathematical
operations on large amounts of data quickly.
import numpy as np
# Perform operations
Prepared by Mrs. Shah S. S.
25
mean = np.mean(data) # Calculate the mean
sum = np.sum(data) # Calculate the sum
print('Mean:', mean)
print('Sum:', sum)
Output-
Explanation:
3. Scikit-learn
What It Is:
• Scikit-learn is a toolkit for machine learning. It helps you build models that can make
predictions or classify data.
# Make a prediction
prediction = model.predict([[6]])
print('Predicted test score for 6 hours of study:', prediction[0])
Explanation:
What It Is:
• NLTK is used for processing and analyzing text data. It helps with tasks like tokenizing
(breaking text into words) and finding patterns in language.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
# Sample text
text = "NLTK is a great library for performing text processing tasks,
such as tokenization, stemming, and lemmatization."
# Step 1: Tokenization
words = word_tokenize(text)
# Step 3: Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
# Step 4: Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in
filtered_words]
# Results
print("Original Words:", words)
print("Filtered Words (no stopwords):", filtered_words)
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)
1. Remove Stopwords
Stopwords are common words in a language that are often filtered out before processing text.
Examples include “the,” “is,” “in,” “and,” etc. These words are usually removed because they
don’t carry significant meaning and can clutter the analysis.
2. Stemming
Stemming is the process of reducing words to their base or root form. For example,
“running,” “runner,” and “ran” might all be reduced to “run.” The goal is to treat different
forms of a word as the same term. However, stemming can sometimes produce non-words
(e.g., “running” becomes “run”).
3. Lemmatization
Lemmatization is similar to stemming but more sophisticated. It reduces words to their base
or dictionary form, known as a lemma. For example, “running” would be reduced to “run,”
and “better” would be reduced to “good.” Lemmatization ensures that the resulting words are
valid and meaningful.
Summary