Unit2 - M. Abdul Mateen
Unit2 - M. Abdul Mateen
WEB-MINING
Unit: 2
Social Media
Analytics(ACSAI0622N)
Mr. M. Abdul Mateen Siddiqui
(Assistant Professor)
B. Tech. 6th Sem Department of CSE (Cyber Security)
UNIT-I:SENTIMENT MINING
UNIT-II:WEB MINING
Web Mining Overview, Web Structure Mining, Search
Engine, Web Analytics, Machine Learning for extracting
knowledge from the web, Inverted indices and Boolean
queries. PLSI, Query optimization, SEO, page ranking,
Social Graphs (Interaction, Latent and Following Graphs),
Ethics of Scraping, Static data extraction and Web Scraping
using Python
1.Security
2. Digital Advertising
3. E-Commerce
4. Publishing
5. Massively Multiplayer Online Games
6. Backend Services and Messaging
7. Project Management & Collaboration
8. Real time Monitoring Services
9.Live Charting and Graphing
10. Group and Private Chat
Apply state of the art mining tools and libraries on realistic data sets as a basic
for business decisions and applications.
PO8 : Ethics
PO10 : Communication
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 2 2 3 3 - - - - - - -
CO2 3 2 3 2 3 - - - - - - -
CO3 3 2 3 2 3 - - - - - - -
CO4 3 2 3 2 3 - - - - - - -
CO5 3 2 3 3 3 - - - - - - -
Program Specific
S. No. PSO Description
Outcomes (PSO)
4/27/2025 16
COs - PSOs Mapping
CO1 3 - - -
CO2 3 2 - -
CO3 3 3 - -
CO4 3 3 - -
CO5 3 3 - -
Program Educational
PEOs Description
Objectives (PEOs)
• To produce graduates with a strong foundation of basic science, Statistics &
Engineering and ability to use modern tools and technologies to solve real-
world complex problems/to address ever changing industrial requirements
PEOs globally.
• To produce graduates who can inculcate life-long learning for up-skilling and
re-skilling and get a successful career as data scientist, entrepreneur and
PEOs bureaucrat for goodwill of the society.
• To produce graduates who can exhibit professional ethics and moral values
PEOs with capability of working as an individual and as a team to contribute
M. Abdul
towards theMateen
need Siddiqui Social
of industry andMedia Analytics
society.
Unit 2
• Student should have knowledge of Knowledge of Data Analysis Tools and Web Technology.
• Students should have good knowledge of Python Programming and Python coding experience.
• https://www.youtube.com/watch?v=KjWu1
• dZn00https://www.youtube.com/watch?v=ntOaoW0T604
4/27/2025 25
Unit Content
• Web Mining
• Web Structure Mining
• Search Engine
• Web Analytics
• Machine Learning for extracting knowledge from the web
• Inverted indices and Boolean queries.
• PLSI
• Query optimization
• Page ranking
• Social graphs
• Ethics of Scraping
• Static Data Extraction
• Web Scraping using Python
One example of web mining is to analyze website traffic and user behavior. By
analyzing clickstream data and other user interactions with a website,
organizations can gain insights into how users navigate their site, what content is
most popular, and where users are dropping off. This information can be used to
optimize website design and improve user experience.
Web Structure Mining is one of the three different types of techniques in Web
Mining. In this article, we will purely discuss about the Web Structure Mining.
Web Structure Mining is the technique of discovering structure information
from the web. It uses graph theory to analyze the nodes and connections in the
structure of a website. Web Structure Mining: Web structure mining is
the application of discovering structure information from the web. The
structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Structure mining
basically shows the structured summary of a particular website. It
identifies relationship between web pages linked by information or
direct link connection. To determine the connection between two
commercial websites, Web structure mining can be very useful.
Depending upon the type of Web Structural data, Web Structure Mining
can be categorized into two types:
1.Extracting patterns from the hyperlink in the Web: The Web works
through a system of hyperlinks using the hyper text transfer protocol
(http). Hyperlink is a structural component that connects the web page
according to different location. Any page can create a hyperlink of any
other page and that page can also be linked to some other page. the
intertwined or self-referral nature of web lends itself to some unique
network analytical algorithms. The structure of Web pages could also be
analyzed to examine the pattern of hyperlinks among pages.
2. Mining the document structure: It is the analysis of tree like structure of web
page to describe HTML or XML usage or the tags usage . There are different
terms associated with Web Structure Mining :
Web Graph: Web Graph is the directed graph representing Web.
Node: Node represents the web page in the graph.
Edge: Edge represents the hyperlinks of the web page in the graph (Web graph)
In degree: It is the number of hyperlinks pointing to a particular node in the
graph.
Out Degree:
Degree: Degree is the number of links generated from a particular node. These
are also called the Out Degrees.
4/27/2025 M. Abdul Mateen Siddiqui Social Media Analytics Unit 2 37
Search Engine
The search engines are the software program that provides information
according to the user query. It finds various websites or web pages that are
available on the internet and gives related results according to the search.
On Page SEO
Involves optimization do on your side
Changes are made to elements that appear on the different pages of your
website
Title, Tags, Keyword placement, Indexing, Content planning, Display ads,
Internal links, Visuals.
Product managers, data scientists, UX designers and others can use web
analytics if they’re looking to enhance their website or product experience
to meet customer needs. They need to know which website metrics to
track while also being mindful of the shortcomings of web analytics.
◼ Machine Learning-based algorithms autonomously develop their knowledge thanks to the data
patterns received, without the need to have specific initial inputs from the developer. In these
models, the machine can establish by itself the patterns to follow to obtain the desired result,
therefore, the real factor that distinguishes artificial intelligence is autonomy. In the learning
process that distinguishes these algorithms, the system receives a set of data necessary for
training, estimating the relationships between the input and output data: these relationships
represent the parameters of the model estimated by the system.
• Latent Variable model for general co-occurrence data Associate each observation (w,d) with a class
variable z Є Z{z_1,…,z_K}
•Generative Model • Select a doc with probability P(d) • Pick a latent class z with probability P(z|d) •
Generate a word w with probability p(w|z)
On Page SEO
Involves optimization do on your side
Changes are made to elements that appear on the different pages of your
website
Title, Tags, Keyword placement, Indexing, Content planning, Display ads,
Internal links, Visuals.
• SEO stands for “search engine optimization.” In simple terms, SEO means the
process of improving your website to increase its visibility in Google, Microsoft
Bing, and other search engines whenever people search for. Ultimately, the goal
of search engine optimization is to help attract website visitors who will become
customers, clients or an audience that keeps coming back.
• Social SEO is the practice of adding text-based features like captions, alt-text,
and closed captions to your posts to help people browsing social platforms
easily find your content.
• To understand social SEO, you need to understand the basics of traditional SEO. In
digital marketing, SEO stands for search engine optimization. Search engines like
Google or Bing allow you to search for information and then serve up a list of web
results that point you to the content you’re looking for. (Or, at least, the content
algorithms think you would want to see based on the search phrase you used,
your location, previous searches, etc.)
"Scraping" in the web context refers to the process of using automated software
(bots) to extract data or content from a website, essentially "collecting" information
from a webpage by analyzing its underlying HTML code to retrieve specific details like
product prices, news articles, or contact information, which can then be stored and
used for various purposes like market research or price comparison
Web scraping is an automatic method to obtain large amounts of data from websites.
Most of this data is unstructured data in an HTML format which is then converted
into structured data in a spreadsheet or a database so that it can be used in various
applications. There are many different ways to perform web scraping to obtain data
from websites. These include using online services, particular API’s or even creating
your code for web scraping from scratch. Many large websites, like Google, Twitter,
Facebook, Stack-Overflow, etc. have API’s that allow you to access their data in a
structured format. This is the best option, but there are other sites that don’t allow
users to access large amounts of data in a structured form or they are simply not
that technologically advanced. In that situation, it’s best to use Web Scraping to
scrape the website for data.
Web scraping requires two parts, namely the crawler and the scraper. The crawler is
an artificial intelligence algorithm that browses the web to search for the particular
data required by following the links across the internet. The scraper, on the other
hand, is a specific tool created to extract data from the website. The design of the
scraper can vary greatly according to the complexity and scope of the project so that
it can quickly and accurately extract the data.
• According to Matt Hartman, one way to think about the complex social graphs of
different platforms is to break them down to four fundamental components:
• The social graph is a graph that represents social relations between entities. In short, it
is a model or representation of a social network, where the word graph has been taken
from graph theory. The social graph has been referred to as "the global mapping of
everybody and how they're related".
• Social graph is an effective and widely used mathematical tool to represent the
relationships among users, which benefits the analysis of social interactions and user
behavior characterization. Usually, social networks can be modeled as undirected
graphs (e.g., friendship graph, interaction graph) or directed graphs (e.g., latent graph,
following graph). Below figure lists four different types of social graphs. Based on these
graph types, we discuss the connectivity and interaction among users. Moreover, the
huge size of the social graph challenges the effectiveness of analysis. Thus, graph
sampling and crawling techniques have been proposed to deal with this problem. In this
section, we investigate several measurement, analysis, and modeling works related to
the social graph.
• CENTRALIZED
• Today, the most popular social media sites are run by gigantic tech companies like Meta, Google,
Twitter, ByteDance, with Facebook getting the lion’s share. These platforms are centralized since
all of your interactions are hosted in the company’s servers.
• Pros:
• Production and running costs are covered by platform’s owners in order to attract users in the
first place
• If users forget their account credentials, they can ask for a password reset
• Cons:
• Users don’t get a say in how the platform should be run or how profit is shared
• DECENTRALIZED
• Unlike the centralized design, the decentralized networks operate on independently run servers
and are usually powered by blockchain. Users can choose a server (service provider) to sign up
with, and then have access to the entire network across many different servers. Case in point
for the federated design is email protocol. You can sign up with Gmail and still can communicate
with a Yahoo user or with anyone with an email address.
• Pros:
• Enable users to move seamlessly across platforms without rebuilding their social graph at each
destination.
• Cons:
• The production and running costs are split amongst a number of actors.
• Burden of responsibility when it comes to recovering a lost or stolen password
• Interaction Graph:
• This graph explicitly depicts observable interactions between users, like "likes",
comments, or direct messages on a social media platform. It shows who directly
interacts with whom, based on recorded actions.
• Latent Graph:
• This graph represents potential or hidden relationships between users that may not be
explicitly visible through direct interactions but can be inferred based on shared
interests, similar behavior, or other latent factors. It aims to uncover underlying
connections that might not be readily apparent in the observed interaction data.
• Following Graph:
• This graph specifically captures the "follow" relationships between users, where one
user actively chooses to see updates from another. This is particularly relevant on
platforms like Twitter where users follow others to see their content in their feed.
• Applications:
• Recommendation systems: Analyzing interaction and latent graphs
can be used to recommend content or people users might be
interested in based on their connections and behavior.
• Social influence analysis: Studying the structure of a following graph
can help identify influential users within a network.
• Link prediction: By identifying potential latent connections,
algorithms can predict future interactions between users who might
not be directly connected yet.
• Key Features:
1.Nodes: Represent individuals (users, employees, customers, etc.).
2.Edges: Represent interactions such as messages, comments, likes,
meetings, or shared activities.
3.Edge Weights: Often indicate the frequency or intensity of interactions
(e.g., number of messages exchanged, duration of calls).
4.Temporal Aspect: Some interaction graphs evolve over time, capturing
changing relationships.
4/27/2025 M. Abdul Mateen Siddiqui Social Media Analytics Unit 2 91
Social Graphs
• Applications:
• Social Media Analysis: Understanding real engagement beyond just
followers or friends.
• Cybersecurity: Detecting unusual or suspicious communication
patterns.
• Organizational Analysis: Identifying key influencers or
communication bottlenecks in a company.
• Recommender Systems: Suggesting friends, collaborators, or content
based on past interactions.
• Example
• If the social graph consists of users A, B, and C, and:
• A follows B
• B follows C
• C does not follow anyone
2. Which one of the following refers to querying the unstructured textual data?
A. Information access
B. Information update
C.Information retrieval
D. None of these
3. Which of the following is an essential process in which the intelligent methods are applied to extract data
patterns?
A. Warehousing
B.Data Mining
C.Text Mining
4/27/2025 M. Abdul Mateen Siddiqui Social Media Analytics Unit 2 97
D.Data Selection
Daily Quiz
4. For what purpose, the analysis tools pre-compute the summaries of the huge amount of data?
A. In order to maintain consistency
B. For authentication
C. For data access
D. To obtain the queries response
9. Which of the following is the local method for improving recall of an information retrieval system?
a) Query expansion
b) Relevance feedback
c) Ontology based model
d) None of the above
15. The process of removing most common words (and, or, the, etc.) by an
information retrieval system before indexing is known as
a) Lemmatization
b) Stop word removal
c) Inverted indexing
d) Normalization
20. What will be the sum of degrees of each vertices for undirected graph G if
it has n vertices and e edges?
A) 2e
B) 2ne
C) ne
D) none of these
2) Which of the following process is not involved in the data mining process?
A) Data exploration
B) Data transformation
C) Data archaeology
D) Knowledge extraction
11) In any directed graph if all edges are reciprocal, can have maximum of |E|=
A)1
B)0
C)2
D)None of the above
20) ________________is cross-platform user friendly tool that allows you to draw social
network
A) VOSViewer
B) Social Network Visualizer
C) Commetrix
D) Cuttlefish
• Data Mining: the process of discovering hidden and actionable patterns from
data
• Aggregation – It is performed when multiple features need to be combined into a
single one or when the scale of the features change
• A decision tree is learned from the dataset – (training data with known classes) •
The learned tree is later applied to predict the class attribute value of new data –
(test data with unknown classes) – Only the feature values are known
• A search engine is a software system designed to carry out web searches. The
most productive way to conduct a search on the internet is through a search
engine
• Vector Space Model In the vector space model, we are given a set of documents
D. Each document is a set of words.