DS - UNIT - IV - QB & Ans
DS - UNIT - IV - QB & Ans
PART – A
1. What is Matplotlib?
Matplotlib is a popular plotting library for Python, widely used for creating static, animated,
and interactive visualizations in a variety of formats. It provides a flexible and powerful way
to create a wide range of plots, including line graphs, bar charts, histograms, scatter plots, and
more.
1. Prepare Your Data: You need your main data points along with the values that
represent the error (e.g., standard deviation, standard error, or confidence intervals).
2. Choose a Plot Type: Decide on the type of plot that best represents your data (e.g.,
line plot, bar plot, scatter plot).
3. Use Matplotlib: You can create plots with error bars using the yerr parameter for
vertical error bars or xerr for horizontal error bars.
A density plot is a type of data visualization that displays the distribution of a continuous
variable. It is a smoothed version of a histogram and provides a visual representation of
the probability density function of the variable. Instead of showing counts or frequencies,
a density plot shows the relative likelihood of different values occurring.
Legends in data visualization are essential components that provide context and clarity to
a chart or graph. They serve as a guide to help viewers understand the meaning of
various elements within the visualization, such as colors, shapes, lines, or patterns that
represent different data series or categories.
1. Enhances Understanding
2. Improves Readability
3. Conveys Meaning
4. Organizes Information
5. Visual Appeal
1. Figure
A figure is the overall window or container that holds one or more plots (axes). It is
essentially the entire canvas on which everything is drawn. Each figure can contain multiple
axes, and you can customize the figure's size, background color, and other properties.
2. Axes
Axes (note the plural) refer to the individual plots within a figure. Each axes can contain
various elements such as lines, markers, text, and more. When you create a plot, it is drawn
on an axes.
import numpy as np
import matplotlib.pyplot as plt
# Show plot
plt.show()
OUTPUT
Seaborn is a powerful and user-friendly Python data visualization library built on top of
Matplotlib. It provides a high-level interface for drawing attractive and informative statistical
graphics. Seaborn simplifies the process of creating complex visualizations and makes it
easier to generate beautiful and informative plots with less code compared to Matplotlib
alone.
17. What is the difference between Matplotlib and Seaborn?
# Create a figure
fig = plt.figure()
# Create an axes at a specific position [left, bottom, width, height]
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # values are in the range [0, 1]
# Plot some data
ax.plot([1, 2, 3], [1, 4, 9])
A line plot is a type of data visualization that displays information as a series of data points
called "markers" connected by straight line segments. It's commonly used to represent trends
over time or to compare different sets of data.
PART – B
Definition: A line plot displays data points connected by straight lines. It’s often used
to show trends over time or continuous data.
Structure: The x-axis typically represents the independent variable (like time), and
the y-axis represents the dependent variable (like temperature, sales, etc.).
Usage: Useful for illustrating changes and trends, such as stock prices over months or
temperature changes over a week.
Example: If you plot monthly sales for a year, the points represent sales for each
month, and the lines connect these points to show how sales change over the year.
Definition: A scatter plot shows individual data points plotted on two axes to
represent the relationship between two variables.
Structure: Each point represents a pair of values (x, y). The x-axis is one variable,
and the y-axis is the other variable.
Usage: Great for identifying correlations, trends, or patterns between the two
variables. For instance, you might plot hours studied (x) against exam scores (y) to
see if more studying correlates with higher scores.
Example: If you have data on people's heights and weights, each point on the scatter
plot represents a person's height and weight, helping visualize any relationship
between the two.
Density Plots
Contour Plots
Structure
Axes: The x-axis represents the range of values (data bins), while the y-axis
represents the frequency (number of occurrences) of data points in each bin.
Bars: Each bin is represented by a bar, with the height indicating the frequency of
data points that fall within that range.
Usage
Example
If you have test scores from a class, you can create a histogram with bins such as 0-10, 11-20,
etc. Each bar shows how many students scored within each range. This allows you to see the
overall performance distribution at a glance.
Key Points
Bin Width: The choice of bin width can significantly affect the histogram's
appearance and interpretation. Too wide may obscure details; too narrow may create
noise.
Continuous vs. Discrete Data: Histograms are typically used for continuous data, but
they can also represent discrete data effectively.
Purpose
Clarification: They clarify the meaning of colors, shapes, or lines used in a chart,
allowing viewers to interpret the data accurately.
Organization: Legends help organize complex information, making it easier to
compare different datasets or categories within a single visualization.
Accessibility: They enhance the accessibility of the chart, especially for viewers who
may not be familiar with the data or its context.
Components of a Legend
1. Labels: Each item in the legend has a corresponding label that describes what it
represents (e.g., "Sales," "Profit," "Temperature").
2. Symbols/Colors: The legend shows the specific colors, patterns, or symbols
associated with each label. For example, a line graph may use different colored lines
to represent various categories, and the legend will match these colors with their
respective categories.
3. Formatting: Legends can vary in formatting, including font size, style, and
background color. Proper formatting ensures readability and accessibility.
Placement
Location: Legends can be placed in various locations relative to the chart: above,
below, to the left, or to the right. The best placement often depends on the type of
visualization and the amount of space available.
Interactive Legends: In some interactive visualizations (like those in web
applications), legends may allow users to toggle the visibility of specific data series
by clicking on the legend items.
Examples
1. Bar Chart: In a bar chart comparing sales figures across different regions, the legend
might differentiate between regions using different colors for each bar (e.g., blue for
North, red for South).
2. Scatter Plot: In a scatter plot showing the relationship between two variables, the
legend might indicate different categories of data points (e.g., circles for one category,
squares for another).
3. Line Graph: In a line graph depicting temperature changes over a year, the legend
could identify different lines representing various cities, each in distinct colors.
Best Practices
Keep It Simple: Use concise labels and avoid cluttering the legend with unnecessary
information.
Match Colors: Ensure that colors in the legend accurately match those in the
visualization for easy identification.
Readable Fonts: Use legible fonts and sizes to ensure that the legend is easy to read.
Consistent Positioning: Place legends in a consistent location across similar charts to
help viewers easily locate them.
Use White Space: Incorporate adequate white space around the legend to enhance
clarity and readability.
Matplotlib is a powerful plotting library in Python that allows for extensive customization of
visualizations. Customizing plots enhances their clarity and effectiveness. Here are key
aspects of customization in Matplotlib:
Creating Figures: Use plt.figure() to create a new figure. You can set the figure size
using figsize=(width, height).
Adding Subplots: Use plt.subplot() to add multiple plots within a single figure,
adjusting layout with plt.subplots_adjust().
Tick Customization: Use plt.xticks() and plt.yticks() to set custom tick locations and
labels.
Rotating Ticks: Rotate tick labels for better readability using the rotation parameter
in plt.xticks() or plt.yticks().
Line Styles: Customize line styles with parameters like linestyle, linewidth, and color
in plotting functions (e.g., plt.plot()).
Markers: Add markers to lines using the marker parameter (e.g., marker='o' for
circles).
Color Customization: Set colors directly in plotting functions using named colors,
hex codes, or RGB values.
Colormaps: Use colormaps for heatmaps or scatter plots to represent data intensity
(e.g., plt.scatter(x, y, c=data, cmap='viridis')).
6. Legends
Adding Legends: Use plt.legend() to add a legend that identifies different data series.
You can customize its location and appearance.
Legend Title: Use the title parameter in plt.legend() to add a title to the legend.
Grid Lines: Add grid lines with plt.grid(), customizing their appearance with
parameters like color, linestyle, and alpha (transparency).
Background Color: Set the figure or axes background color using
plt.gcf().set_facecolor() or ax.set_facecolor().
8. Annotations
9. Saving Figures
Matplotlib provides functionality for creating three-dimensional (3D) plots, which are
particularly useful for visualizing complex data that has three variables. The
mpl_toolkits.mplot3d module extends Matplotlib's capabilities to enable 3D plotting.
1. Setting Up a 3D Plot
To create a 3D plot, you first need to import the necessary modules and create a 3D axis.
Here's how to get started:
python
Copy code
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
2. Basic 3D Plots
python
Copy code
ax.scatter(x, y, z, c='r', marker='o') # x, y, z are your data points
3D Line Plots: Show the relationship between three variables connected by lines.
python
Copy code
ax.plot(x, y, z, label='3D Line', color='b')
python
Copy code
X, Y = np.meshgrid(x_range, y_range)
Z = f(X, Y) # Define your function
ax.plot_surface(X, Y, Z, cmap='viridis')
3D Wireframe Plots: Similar to surface plots, but show only the grid lines, which can
be useful for emphasizing the structure.
python
Copy code
ax.plot_wireframe(X, Y, Z, color='black')
3. Customizing 3D Plots
python
Copy code
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
python
Copy code
ax.set_title('3D Plot Example')
View Angle: Change the view angle using the view_init() method.
python
Copy code
ax.view_init(elev=20, azim=30) # Elevation and azimuthal angle
4. Example Code
python
Copy code
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Set labels
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
# Set title
ax.set_title('3D Scatter Plot Example')
5. Limitations
While 3D plots can be visually appealing, they can also become cluttered and difficult to
interpret, especially with large datasets. Additionally, perspective can distort the
representation of data, making it harder to extract insights compared to 2D plots.
Key Features
1. Statistical Plots: Seaborn includes several built-in functions for creating a variety of
statistical plots, such as:
o Scatter Plots: sns.scatterplot() visualizes relationships between two variables.
o Line Plots: sns.lineplot() can display trends over time or ordered categories.
o Bar Plots: sns.barplot() summarizes data using bars to represent means and
confidence intervals.
o Box Plots: sns.boxplot() visualizes distributions through their quartiles and
highlights outliers.
2. Built-in Datasets: Seaborn comes with several built-in datasets (like Titanic and Iris),
which are useful for practice and demonstration.
3. Styling: Seaborn provides beautiful default styles and color palettes, enhancing the
aesthetics of plots without much effort. You can set the style using:
python
Copy code
sns.set_style("whitegrid")
4. Color Palettes: Seaborn offers a variety of color palettes (like deep, muted, pastel,
etc.) that can be applied to visualizations for better visual appeal:
python
Copy code
sns.set_palette("pastel")
5. Facet Grids: Seaborn’s FacetGrid allows for creating multi-plot grids based on the
values of one or more categorical variables, making it easy to compare distributions
across subsets of data:
python
Copy code
g = sns.FacetGrid(data, col="column_name")
g.map(sns.histplot, "variable")
6. Heatmaps: The sns.heatmap() function is excellent for visualizing data matrices and
correlation matrices, providing an intuitive way to see patterns and relationships.
Example Usage
Here's a brief example demonstrating how to use Seaborn for visualizing data:
python
Copy code
import seaborn as sns
import matplotlib.pyplot as plt
Advantages of Seaborn
PART C
Basemap is a Matplotlib toolkit that allows for plotting 2D data on maps. It provides a
flexible interface for creating static geographic plots and visualizing spatial data in various
projections. Although Basemap is somewhat older and has been largely replaced by newer
libraries like Cartopy, it is still widely used in many applications.
Installation
To use Basemap, you may need to install it separately since it’s not included with Matplotlib
by default. You can install it using pip:
bash
Copy code
pip install basemap basemap-data-hires
Here’s a simple example demonstrating how to create a basic geographic map using
Basemap:
python
Copy code
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
You can plot geographic data points on the map. For instance, if you have latitude and
longitude data:
python
Copy code
# Sample latitude and longitude data
latitudes = [34.05, 36.16, 40.71] # Los Angeles, San Francisco, New York
longitudes = [-118.24, -115.15, -74.00]
# Convert latitude and longitude to map projection coordinates
x, y = m(longitudes, latitudes)
Advanced Features
1. Shapefiles: You can load and display shapefiles to represent geographic boundaries
or features:
python
Copy code
m.readshapefile('path_to_shapefile', 'name', color='blue')
2. Data Visualization: Basemap can be combined with other libraries (like NumPy or
Pandas) to visualize data. For example, you might plot temperature data over
geographic locations using color coding.
3. Animation: Although less common, you can create animated maps using Matplotlib's
animation capabilities in conjunction with Basemap.
An enormous amount of textual data is generated over the internet every day. According to a
Statista study, Nearly 9 billion SMS were sent in the year 2023 in Portugal alone. Another
study suggests that In the first four months of 2024, about 10 billion emails were sent daily in
the US
Textual data is important for businesses as it helps them analyze and make better decisions.
For example, capturing company names and line item data from invoices or understanding
the customer's emotion behind a product or service offering can help you process documents
faster and analyze customer feedback appropriately.
The large amount of textual data generated over the Internet is primarily unstructured data. A
paper published by Seagate suggests that 163 zettabytes of data on the Internet will be
unstructured by 2025, which nearly amounts to 80% of the data on the Internet.
Text annotation helps label and classify unstructured data generated across public Internet
domains. By tagging and classifying textual data, text annotation can help businesses
automate their services in various ways. One example is a bank's application of a smart
chatbot that can understand customers’ text queries and provide appropriate automated
responses.
What is text annotation?
Text annotation involves adding footnotes and comments, highlighting parts of text, and
classifying them into large parts of the text. It helps to summarize texts and highlight
important points within the large parts of texts making it easy for readers to digest complex
information.
The meaning of text annotation slightly differs between artificial intelligence and machine
learning. It refers to a process wherein large parts of text are labeled to train data for machine
learning. Highlighting and understanding the grammar structure, parts of speech, keywords,
emotions, sentiments, and so on is the core reason to annotate textual information.
Natural language processing (NLP) combines interpreting textual data with pre-processing
methods. NLP helps contextually understand and interpret textual information accordingly,
making it readable for machines.
What text annotation types are designed for different use cases? These methods consider how
the extracted data has to be labeled and interpreted.
Named Entity Recognition (NER) is a text annotation method that plays a vital role in various
natural language processing applications. This method involves identifying and labeling
various named entities, such as places, people, dates, company names, etc.
By classifying and labeling these named entities accurately, the NER-enabled machine can
extract crucial information from the documents and better understand the extracted text. The
Parts-of-Speech (POS) Tagging text annotation method can also support the NER by
understanding name entities with the context of the sentence or a phrase.
Part of Speech (POS) tagging is a text annotation method that grammatically labels words in
a text or phrase. It categorizes text as a noun, verb, adjective, adverb, etc. Through POS
Tagging, machines can better understand a phrase or sentence's grammar structure and
meaning.
This resolves the issue of surface-level data extraction wherein data is captured not at face
value but by understanding the deeper context of grammar structure.
c. Sentiment Analysis
Sentiment analysis is a text annotation method that determines the emotional tone of the text.
Text is labeled as positive, negative, neutral, and so on. Businesses use sentiment analysis to
gauge people's attitudes toward their product or service.
Sentiment analysis is important in brand monitoring and reputation management. It helps you
understand public opinion, social media trends, and feedback on offerings.
d. Intent Recognition
The intent recognition text annotation method determines the intent behind a text—whether it
is a command, request, complaint, suggestion, or feedback.
Intent recognition takes a given query as input and associates the text data and expression
with a given intent. For example, during a telephone prompter in an automated call, the
model learns from speech data based on key terms—-what the customer is looking for, such
as “Pay my bills” or “speak to a representative.”
e. Relation Extraction
Relation extraction is a method of text annotation that determines the relationship between
two named entities. It helps to understand the data of the named entity contextually and
determines how the two named entities are related to one another.
For example, the phrase “New York is in the US” states a “is in” relationship between New
York and the US. This can also be denoted in triples - New York is in, the US. Let’s take
another example: "John Doe works at XYZ Inc.” states a “work at” relationship between
John Doe and XYZ Inc.
a. Manual Annotation
Humans add labels or tags to certain text parts in manual text annotation. This technique is
considered to be more precise than other text annotation techniques. It uses predefined
standards and rules to apply the labels to the text, which can be used for various natural
language processing (NLP) and machine learning tasks.
b. Active Learning
In the active learning text annotation technique, machine learning models select data samples
to annotate. A small subset of large and challenging data samples is used to learn and label
parts of these texts.
Active learning is scalable and can be replicated for large projects with limited resources
while maintaining the accuracy of labeling data.
c. Crowdsourcing
It is an efficient way to scale and annotate data that is simple and easy to categorize using
specific guidelines.
How does Text Annotation Work?
The first step in the text annotation process is to choose relevant textual data that must be
interpreted through machine learning.
The textual data that needs to be annotated must be relevant to the domain for which you
need to analyze textual information. The data is cleaned by removing unwanted texts and
symbols, such as punctuations, emoticons, and so on.
It is important to have textual data selected and prepared in advance to clarify the main
objective of text annotation and its application.
The second step is to define the type of annotation needed. There are numerous types of text
annotations, such as sentiment analysis, which determines the emotion of a text (anger, sad,
happy, sarcastic, etc.), or named entity recognition, which can label text into different
categories (person, place, date, etc.).
Different text annotation methods impact the classification of texts as they will label the text
based on contextual understanding of the defined text annotation method.
c. Annotation process
The third step in the text annotation process is to label the parts of the text with the right
interpretations and contextual understanding.
Keyphrasing, language identification, and document classification are different ways to label
texts. Other text parts are tagged and classified based on the type of text annotation method
defined.
d. Quality control
Quality check and control is the last and most crucial step in the text annotation process. The
accuracy of text annotation on selected textual data is cross-checked, reviewed, and validated
through various validation and review methods such as if condition methods.
Benefits of Text Annotation in Data Extraction
Surface-level data extraction without understanding what the textual data means on a
document can lead to many errors, increasing human intervention and reducing the software's
reliability in getting the job done automatically.
Improves accuracy and efficiency: One benefit of text annotation for data extraction
is that it allows for more precise information. By marking up specific elements such
as entities, relationships, and so on, algorithms can better understand exactly what
information is to be extracted.
Enables targeted data capture: Text annotation takes a very targeted approach to
decide what type of entity needs to be captured and labelled. Name entities such as
supplier name, vendor name, address, phone number, and only other line item
numbers required by the organization will be extracted, improving the relevancy of
the extracted data.
Enhances data quality: Text annotation also improves data quality by providing
structure to unstructured data. This is possible through a framework of organizing and
standardizing extracted data. Data ambiguity can be reduced by defining clear
guidelines, and consistent annotation can make it easy to verify extracted data. This
can improve accuracy and maintain data quality during text annotation.
Data Ambituty
Words, phrases, or sentences can have many meanings. With contextual information, the
meaning of such texts can be consistent, but errors can occur. Different annotators can
interpret such text differently, and the chances of such errors occurring at scale are high.
Let’s take an example of the phrase, “I saw the person with the camera.” This can be
interpreted in two ways: the speaker saw a person with a camera, or the speaker saw the
person through the camera to see the man. Such misinterpretations can lead to inaccuracy
while training the machines.
Scalability
Text annotation at scale is cumbersome, highly time-consuming, and labour-intensive.
Collecting, organizing, cleaning, and tagging the data takes the most time and effort. As the
volume increases, the requirement for data annotators also increases, making it quite
challenging for organizations to scale their text annotation efforts.
Data Quality
Text annotators are sourced from different parts of the world. Even with standard guidelines,
there can be situations where data quality while labeling text is compromised. This can be
because different people interpret the text differently if the context is missing.
For example, “fare” can be misunderstood as a synonym for justice as it sounds similar to the
actual synonym “fair.” Such errors in data quality can lead to plenty of errors while
processing data at scale.
Cost
High-quality text annotators come at a high cost and still may need help to meet your desired
targets. Balancing accurate and consistent text annotation technologies while maintaining a
reasonable fee structure to provide such services to other vendors remains an unresolved
challenge for many businesses.
Annotation Guidelines
Annotation guidelines act as standard rules that should be followed during the text
annotation. The book mentions many things, such as clearly defining the rationale and
purpose behind each label, providing examples of how it can be applied, and addressing
common scenarios. Annotators should use this guideline as a rule of thumb to ensure quality
is not compromised.
A high IAA score means that the annotators agree, whereas a low IAA indicates
disagreement between the annotators. The agreement or disagreement can be based on
interpretations of the text, the amount of ambiguity on tasks, how clear the guidelines are to
them, and so on.
IAA resolves the challenge of data quality and ambiguity as it is an objective method to
annotate text.
Active Learning
In active learning, the text annotation process is optimized by selecting the most informative
samples from a large set of unstructured textual data. It tackles the scalability issue as active
learning uses a small data set from a large pool to classify text, which can then be replicated
to a large data set using machine learning algorithms.
Leveraging Automation
Various automation text annotation tools are available to annotate text efficiently. If your
organization needs to promptly label large volumes of data, these annotation tools are the best
solution.
Applications of Text Annotation in Various Industries
a. Customer Service
In customer service, text annotation helps build smarter customer support systems. The
customer’s intent, entities, and sentiment are better understood using different types of text
annotation.
Chatbots use text annotation to understand customer queries based on the key phrases and
provide personalized recommendations or guide them to support agents depending on the
text's tone.
b. Finance
One of the most prominent use cases of text annotation in banking and finance is fraud
detection. Machine learning models can detect fraud and alert customers by scanning and
understanding the texts exchanged over messaging apps.
The finance industry uses text annotation during data extraction from documents given for
loan applications. Information such as name entities, loan rates, type of assets, and bank
statements is captured and labelled easily. This reduces the overall time spent processing loan
applications, as human intervention at the documentation level is minimal.
c. Healthcare
Many research papers are published annually in healthcare and medical research, with
discoveries that help us live healthier lives. Text annotation is used in the medical field to
analyze text from these research papers.
Information from medical literature needs to be structured and organized so that medical
professionals can make important, life-saving decisions accordingly.
Text annotation can also process electronic health records, treat patients, or record data at
healthcare organizations. Patient data is not identified while annotating the text in compliance
with HIPAA privacy regulations.
d. Legal
The field of law is filled with paperwork and documents. Lawyers, paralegals, and their
teams have to search through boxes of documents to make an argument for their clients in
court. Text annotation can help structure these datasets so lawyers can easily find crucial and
valuable case information. NER-related machines can come in handy for law firms to go
through documents swiftly.
Text annotation allows legal firms to digitally record their cases over the cloud.
e. Marketing
Public opinion toward the company or brand, feedback on social media on ad campaigns, and
reviews of products or services are all important elements for a brand to grow and nurture.
Through the sentiment analysis method of text annotation, you can analyze the public
perception of your brand. This can improve the positioning strategy and create advertising
campaigns to generate and increase brand equity.
Subplots are a powerful feature in Matplotlib that allow you to create multiple plots within a
single figure. This is useful for comparing different datasets or visualizations side by side
without creating multiple figures.
Creating Subplots
You can create subplots using the plt.subplot() function or the plt.subplots() function. The
plt.subplots() function is generally preferred because it provides a more flexible interface for
creating a grid of subplots.
1. Using plt.subplots()
The plt.subplots() function creates a grid of subplots and returns a figure and an array of axes
objects, which you can use to plot data on each subplot.
The subplot grid is the arrangement of subplots within a figure created using
Matplotlib.pyplot.subplots(). It consists of rows and columns, each cell representing a
subplot. The user can specify user can select the number of rows and columns
depending on the desired layout.
The subplots are accessed using indexing, similar to accessing elements in a matrix. For
example, to access the subplot in the first row and second columns, the indexing would
be [0, 1]. This allows users to modify and customize individual subplots within the grid
quickly.
The subplot grid can be created using the subplots() function, which returns a Figure
object and an array of Axes objects. The Figure object represents the entire figure, while
the Axes objects represent each subplot. These Axes objects can be used to modify the
properties of each subplot, such as the title, labels, and data.
Example: