SMA Unit 2
SMA Unit 2
WEB-MINING
Unit: 2
Name
Atul Pratap Singh
Qualification
B.Tech, M.Tech, Ph. D
Designation Assistant Professor
Department CSE[AI]
Teaching
Experience 16.6 years.
UNIT-I:SENTIMENT MINING
UNIT-II:WEB MINING
Web Mining Overview, Web Structure Mining, Search
Engine, Web Analytics, Machine Learning for extracting
knowledge from the web, Inverted indices and Boolean
queries. PLSI, Query optimization, SEO, page ranking,
Social Graphs (Interaction, Latent and Following Graphs),
Ethics of Scraping, Static data extraction and Web Scraping
using Python
1.Security
2. Digital Advertising
3. E-Commerce
4. Publishing
5. Massively Multiplayer Online Games
6. Backend Services and Messaging
7. Project Management & Collaboration
8. Real time Monitoring Services
9.Live Charting and Graphing
10. Group and Private Chat
Apply state of the art mining tools and libraries on realistic data sets as a basic
for business decisions and applications.
PO10 : Communication
PO11 : Project management and
finance
PO12 : Life-long learning
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 2 2 3 3 - - - - - - -
CO2 3 2 3 2 3 - - - - - - -
CO3 3 2 3 2 3 - - - - - - -
CO4 3 2 3 2 3 - - - - - - -
CO5 3 2 3 3 3 - - - - - - -
Program Specific
S. No. PSO Description
Outcomes (PSO)
06/19/2025 15
COs - PSOs Mapping
CO1 3 - - -
CO2 3 2 - -
CO3 3 3 - -
CO4 3 3 - -
CO5 3 3 - -
Program Educational
PEOs Description
Objectives (PEOs)
To produce graduates with a strong foundation of basic
science, Statistics & Engineering and ability to use modern
tools and technologies to solve real-world complex
PEOs problems/to address ever changing industrial requirements
globally.
• Student should have knowledge of Knowledge of Data Analysis Tools and Web Technology.
• Students should have good knowledge of Python Programming and Python coding experience.
• https://www.youtube.com/watch?v=KjWu1
• dZn00https://www.youtube.com/watch?v=ntOaoW0T604
06/19/2025 23
Unit Content
• Web Search
• Data Mining
• and Machine Learning for extracting knowledge from the web,
• Inverted indices and Boolean queries.
• PLSI,
• Query optimization,
• page ranking,
• Essentials of Social graphs,
• Social Networks,
• Models,
• Information Diffusion in social media.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 24
Unit Objective
• A search engine is a software system designed to carry out web searches. The
most productive way to conduct a search on the internet is through a search
engine. A web search engine is a software system designed to search for
information on the World Wide Web. The search results are generally presented
in a line of results often referred to as search engine results pages (SEROs).
The information may be a mix of web pages, images, and other types of files.
Some search engines also mine data available in databases or open directories.
• There are a number of various search engines available and some of them may
seem familiar to you. The top web search engines are Google, Bing, Yahoo,
Ask.com, and AOL.com. For the purpose of this course, we will be searching
using the Google Chrome web browser, and search first with the Google search
engine and then Microsoft’s Bing search engine.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 27
Data Mining(CO2)
Data mining is the process of sorting through large data sets to identify
patterns and relationships that can help solve business problems through data
analysis. Data mining techniques and tools enable enterprises to predict future
trends and make more-informed business decisions.
It typically involves the collection, processing, and analysis of raw data obtained
from social media platforms such as Facebook, Instagram, Twitter, TikTok,
LinkedIn, YouTube, and others, to uncover meaningful patterns and trends, draw
conclusions, and provide insightful and actionable information.
Social media data mining harvests various types of social data that are either
publicly available (e.g., age, gender, job profession, geographic location, etc.) or
are generated on a daily basis on social media platforms (e.g., comments, likes,
clicks, etc.).
Typically, the data represents people’s attitudes, connections, behavior, and
feelings towards a certain topic, product, or service. Depending on the social media
platform in question, this data may include the number of followers, comments,
likes, or shares, if the targeted social media data comes from Facebook, Twitter’s
retweets or the number of impressions, or Instagram’s engagement rates and
06/19/2025 hashtag usage. Dr. Atul Pratap Singh Social Media Analytics Unit 2 28
Data Mining(CO2)
In computing, data is information that has been translated into a form that is efficient for
movement or processing.
• For each feature type, there exists a set of permissible operations (statistics) using the feature
values and transformations that are allowed.
• Nominal (categorical). These features take values that are often represented as strings. For
instance, a customer’s name is a nominal feature. In general, a few statistics can be computed
on nominal features. Examples are the chi-square statistic (χ 2 ) and the mode(most common
feature value).
For example, one can find the most common first name among customers. The only possible
transformation on the data is comparison. For example, we can check whether our customer’s
name is John or not. Nominal feature values are often presented in a set format.
• Ordinal. Ordinal features lay data on an ordinal scale. In other words, the feature values have
an intrinsic order to them. In our example, Money Spent is an ordinal feature because a High
value for Money Spent is more than a Low one.
• Vector Space Model In the vector space model, we are given a set of documents
D. Each document is a set of words. The goal is to convert these textual
documents to [feature] vectors. We can represent document i with vector di , di
= (w1,i , w2,i , . . . , wN,i), (5.1) where wj,i represents the weight for word j
that occurs in document i and N is the number of words used for vectorization.2
To compute wj,i , we can set it to 1 when the word j exists in document i and 0
when it does not. We can also set it to the number of times the word j is
observed in document i. A more generalized approach is to use the term
frequency-inverse document frequency (TF-IDF) weighting scheme. In the TF-
IDF scheme, wj,i is calculated as wj,i = t fj,i × id fj ,
• where t fj,i is the frequency of word j in document i. id fj is the inverse TF-IDF
frequency of word j across all documents, id fj = log2 |D| |{document ∈ D | j ∈
document}|,
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 30
Data Mining(CO2)
Data Quality When preparing data for use in data mining algorithms, the following
four data quality aspects need to be verified:
• Noise is the distortion of the data. This distortion needs to be removed or its
adverse effect alleviated before running data mining algorithms because it may
adversely affect the performance of the algorithms. Many filtering algorithms are
effective in combating noise effects.
• Outliers are instances that are considerably different from other instances in the
dataset. Consider an experiment that measures the average number of followers of
users on Twitter. A celebrity with many followers can easily distort the average
number of followers per individuals. Since the celebrities are outliers, they need to
be removed from the set of individuals to accurately measure the average number
of followers. Note that in special cases, outliers represent useful patterns, and the
decision to removing them depends on the context of the data mining problem.
• Missing Values are feature values that are missing in instances. For
example, individuals may avoid reporting profile information on
social media sites, such as their age, location, or hobbies. To solve this
problem, we can (1) remove instances that have missing values, (2)
estimate missing values (e.g., replacing them with the most common
value), or (3) ignore missing values when running data mining
algorithms.
• Duplicate data occurs when there are multiple instances with the exact
same feature values. Duplicate blog posts, duplicate tweets, or profiles
on social media sites with duplicate information are all instances of
this phenomenon
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 32
Data Mining(CO2)
Data Preprocessing Often, the data provided for data mining is not immediately ready. Data
preprocessing (and transformation , prepares the data for mining. Typical data
preprocessing tasks are as follows:
• Aggregation. This task is performed when multiple features need to be combined into a
single one or when the scale of the features change. For instance, when storing image
dimensions for a social media website, one can store by image width and height or
equivalently store by image area (width × height). Storing image area saves storage space
and tends to reduce data variance; hence, the data has higher resistance to distortion and
noise.
• Discretization. Consider a continuous feature such as money spent in our previous
example. This feature can be converted into discrete values – High, Normal, and Low –
by mapping different ranges to different discrete values. The process of converting
continuous features to discrete ones and deciding the continuous range that is being
assigned to a discrete value is called discretization.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 34
Data Mining(CO2)
• Feature Selection. Often, not all features gathered are useful. Some may be irrelevant,
or there may be a lack of computational power to make use of all the features, among
many other reasons. In these cases, a subset of features are selected that could ideally
enhance the performance of the selected data mining algorithm. In our example,
customer’s name is an irrelevant feature to the value of the class attribute and the task
of predicting whether the individual will buy the given book or not.
• Feature Extraction. In contrast to feature selection, feature extraction converts the
current set of features to a new set of features that can perform the data mining task
better. A transformation is performed on the data, and a new set of features is extracted.
The example we provided for aggregation is also an example of feature extraction
where a new feature (area) is constructed from two other features (width and height).
• Sampling. Often, processing the whole dataset is expensive. With the massive growth
of social media, processing large streams of data 142 is nearly impossible
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 35
Data Mining(CO2)
• Latent Variable model for general co-occurrence data Associate each observation
(w,d) with a class variable z Є Z{z_1,…,z_K}
•Generative Model • Select a doc with probability P(d) • Pick a latent class z with
probability P(z|d) • Generate a word w with probability p(w|z)
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 42
Query Optimization(CO2)
• Query optimization is a process of defining the most efficient and optimal way and
techniques that can be used to improve query performance based on rational use of system
resources and performance metrics. The purpose of query tuning is to find a way to
decrease the response time of the query, prevent the excessive consumption of resources,
and identify poor query performance.
• In the context of query optimization, query processing identifies how to faster retrieve data
from SQL Server by analyzing execution steps of the query, optimization techniques, and
other information about the query.
• Query optimization tips for better performance
• Monitoring metrics can be used to evaluate query runtime, detect performance pitfalls, and
show how they can be improved. For example, they include:
• Execution plan: A SQL Server query optimizer executes the query step by step, scans
indexes to retrieve data, and provides a detailed overview of metrics during query
execution.
• Input/Output statistics: Used to identify the number of logical and physical reading
operations during the query execution that helps users detect cache/memory capacity issues.
2. Which one of the following refers to querying the unstructured textual data?
A. Information access
B. Information update
C.Information retrieval
D. None of these
3. Which of the following is an essential process in which the intelligent methods are applied to extract data patterns?
A. Warehousing
B.Data Mining
C.Text Mining
06/19/2025 D.Data Selection Dr. Atul Pratap Singh Social Media Analytics Unit 2 46
Daily Quiz(CO2)
4. For what purpose, the analysis tools pre-compute the summaries of the huge amount of data?
A. In order to maintain consistency
B. For authentication
C. For data access
D. To obtain the queries response
9. Which of the following is the local method for improving recall of an information retrieval system?
a) Query expansion
b) Relevance feedback
c) Ontology based model
d) None of the above
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 48
Daily Quiz(CO2)
10. ___________ social network is considered the most popular for business to business
marketing?
a). Facebook
b) .Orkut
c) .Ryze
d). LinkedIn
15. The process of removing most common words (and, or, the, etc.) by an information retrieval
system before indexing is known as
a) Lemmatization
b) Stop word removal
c) Inverted indexing
d) Normalization
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 50
Daily Quiz(CO2)
16. PageRank is a metric for ________documents based on their quality
A. ranking hypertext
B. ranking document structure
C. ranking web content
D. None of these
17. The main purpose for structure mining is to extract previously unknown
relationships between
A. Web pages
B. Web hyperlinks
C. Web data
D. Web contents
18. Web structure mining is the process of discovering ____ information from the web
A. Semi structured
B. Unstructured
C. Structured
D. None of the above
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 51
Daily Quiz(CO2)
20. What will be the sum of degrees of each vertices for undirected graph G if it has n
vertices and e edges?
A) 2e
B) 2ne
C) ne
D) none of these
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 52
Essentials of Social graphs(CO2)
Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph.
The entities are the nodes, and an edge connects two nodes if the nodes are related by the
relationship that characterizes the network. If there is a degree associated with the relationship,
this degree is represented by labeling the edges. Often, social graphs are undirected, as for the
Facebook friends graph. But they can be directed graphs, as for example the graphs of followers
on Twitter or Google+.
• Degree and Degree Distribution : The number of edges connected to one node is the degree of
that node. Degree of a node vi is often denoted using di . In the case of directed edges, nodes
have in-degrees (edges pointing toward the node) and out-degrees (edges pointing away from the
node). These values are presented using d in i and d out i , respectively. In social media, degree
represents the number of friends a given user has. For example, on Facebook, degree represents
the user’s number of friends, and on Twitter in-degree and out-degree represent the number of
followers and followees, respectively. In any undirected graph, the summation of all node degrees
is equal to twice the number of edges.
• Theorem 2.1. The summation of degrees in an undirected graph is twice the number of edges, X i
di = 2|E|. (2.3) Proof. Any edge has two endpoints; therefore, when calculating the degrees di and
dj for any connected nodes vi and vj , the edge between them contributes 1 to both di and dj ;
hence, if the edge is removed, di and dj become di − 1 and dj − 1, and the summation P k dk
becomes P k dk − 2. Hence, by removal of all m edges, the degree summation becomes smaller
by 2m. However, we know that when all edges are removed the degree summation becomes zero;
therefore, the degree summation is 2 × m = 2|E|.
• Graph Representation
• Adjacency Matrix A simple way of representing graphs is to use an adjacency
matrix (also known as a sociomatrix). Figure 2.4 depicts an example of a graph
and its Sociomatrix corresponding adjacency matrix. A value of 1 in the adjacency
matrix indicates a connection between nodes vi and vj , and a 0 denotes no
connection between the two nodes. When generalized, any real number can be
used to show the strength of connections between two nodes.
• Adjacency List In an adjacency list, every node is linked with a list of all the
nodes that are connected to it. The list is often sorted based on node order or some
other preference.
• Edge List Another simple and common approach to storing large graphs is to
save all edges in the graph. This is known as the edge list representation
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 55
Essentials of Social graphs(CO2)
• Types of Graphs
In general, there are many basic types of graphs. In this section we discuss several basic types of
graphs. Null Graph.
A null graph is a graph where the node set is empty (there are no nodes). Obviously, since there are no
nodes, there are also no edges. Formally, G(V, E), V = E = ∅. (2.11) Empty Graph. An empty or
edgeless graph is one where the edge set is empty: G(V, E), E = ∅. (2.12) Note that the node set can be
non-empty. A null graph is an empty graph but not vice versa.
Directed/Undirected/Mixed Graphs. Graphs that we have discussed thus far rarely had directed edges.
As mentioned, graphs that only have directed edges are called directed graphs and ones that only have
undirected ones are called undirected graphs.
Mixed graphs have both directed and undirected edges. In directed graphs, we can have two edges
between i and j (one from i to j and one from j to i), whereas in undirected graphs only one edge can
exist. As a result, the adjacency matrix for directed graphs is not in general symmetric (i connected to j
does not mean j is connected to i, i.e., Ai,j , Aj,i), whereas the adjacency matrix for undirected graphs is
symmetric (A = A T ).
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 56
Essentials of Social graphs(CO2)
Weighted Graphs. A weighted graph is one in which edges are associated with
weights. For example, a graph could represent a map, where nodes are cities and
edges are routes between them. The weight associated with each edge represents the
distance between these cities. Formally, a weighted graph can be represented as G(V,
E, W), where W represents the weights associated with each edge, |W| = |E|
Adjacent Nodes and Incident Edges.
Two nodes v1 and v2 in graph G(V, E) are adjacent when v1 and v2 are connected via
an edge:
v1 is adjacent to v2 ≡ e(v1, v2) ∈ E. (2.13)
Two edges e1(a, b) and e2(c, d) are incident when they share one endpoint (i.e., are
connected via a node):
e1(a, b) is incident to e2(c, d) ≡ (a = c) ∨ (a = d) ∨ (b = c) ∨ (b = d). (2.14)
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 57
Essentials of Social graphs(CO2)
• Traversing an Edge.
• An edge in a graph can be traversed when one starts at one of its end-nodes, moves along the
edge, and stops at its other endnode. So, if an edge e(a, b) connects nodes a and b, then visiting e
can start at a and end at b.
• Alternatively, in an undirected graph we can start at b and end the visit at a. Walk, Path, Trail,
Tour, and Cycle.
• A walk is a sequence of incident edges traversed one after another. In other words, if in a walk one
traverses edges e1(v1, v2),e2(v2, v3),e3(v3, v4), . . . ,en(vn, vn+1), we have v1 as the walk’s
starting node and vn+1 as the walk’s ending node. When a walk does Open Walk and not end
where it started (v1 , vn+1) then it is called an open walk. When Closed Walk a walk returns to
where it was started (v1 = vn+1), it is called a closed walk. Similarly, a walk can be denoted as a
sequence of nodes, v1, v2, v3, . . . , vn. In this representation, the edges that are traversed are
e1(v1, v2), e2(v2, v3), . . . ,en−1(vn−1, vn). The length of a walk is the number of edges traversed
during the walk and in our case is n − 1. A trail is a walk where no edge is traversed more than
once; therefore, all walk edges are distinct.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 58
Essentials of Social graphs(CO2)
• A closed trail (one that ends where it started) is called a tour or circuit. A walk where nodes
and edges are distinct is called a path, and a closed path is called a cycle. The length of a
path or cycle is the number of edges traversed in the path or cycle. In a directed graph, we
have directed paths because traversal of edges is only allowed in the direction of the edges.
In Figure 2.7, v4, v3, v6, v4, v2 is a walk; v4, v3 is a path; v4, v3, v6, v4, v2 is a trail; and
v4, v3, v6, v4 is both a tour and a cycle. A graph has a Hamiltonian cycle if it has a cycle
such that all the nodes in the graph are visited. It has an Eulerian tour if all the edges are
traversed only once
• Special Graphs Using general concepts defined thus far, many special graphs can be
defined. These special graphs can be used to model different problems. We review some
well-known special graphs and their properties in this section.
• Trees and Forests Trees are special cases of undirected graphs. A tree is a graph structure
that has no cycle in it. In a tree, there is exactly one path between any pair of nodes. A
graph consisting of set of disconnected trees is called a forest
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 59
Essentials of Social graphs(CO2)
• Special Subgraphs Some subgraphs are frequently used because of their properties. Two such
subgraphs are discussed here.
• Spanning Tree:- For any connected graph, the spanning tree is a subgraph and a tree that
includes all the nodes of the graph. Obviously, when the original graph is not a tree, then its
spanning tree includes all the nodes, but not all the edges. There may exist multiple spanning
trees for a graph. For a weighted graph and one of its spanning trees, the weight of that
spanning tree is the summation of the edge weights in the tree. Among the many spanning
trees found for a weighted graph, the one with the minimum weight is called the minimum
spanning tree (MST) .
• Complete Graphs:- A complete graph is a graph where for a set of nodes V, all possible edges
exist in the graph. In other words, all pairs of nodes are connected with an edge. Hence, |E| =
|V| 2 ! . Complete graphs with n nodes are often denoted as Kn. K1, K2, K3, and K4 .
• Planar Graphs:- A graph that can be drawn in such a way that no two edges cross each other
(other than the endpoints) is called planar. A graph that is not planar is denoted as nonplanar.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 60
Essentials of Social graphs(CO2)
• Bipartite Graphs A bipartite graph G(V, E) is a graph where the node set can be partitioned into two sets such
that, for all edges, one endpoint is in one set and the other endpoint is in the other set. In other words, edges
• Regular Graphs A regular graph is one in which all nodes have the same degree. A regular graph where all
nodes have degree 2 is called a 2-regular graph. More generally, a graph where all nodes have degree k is
called a k-regular graph.
• we discuss two traversal algorithms:
• depth-first search (DFS) and breadth-first search (BFS). Depth-First Search (DFS) Depth-first search (DFS)
starts from a node vi , selects one of its neighbors vj ∈ N(vi), and performs DFS on vj before visiting other
neighbors in N(vi). In other words, DFS explores as deep as possible in the graph using one neighbor before
backtracking to other neighbors. Consider a node vi that has neighbors vj and vk ; that is, vj , vk ∈ N(vi). Let
vj(1) ∈ N(vj) and vj(2) ∈ N(vj) denote neighbors of vj such that vi , vj(1) , vj(2). Then for a depth-first search
starting at vi , that visits vj next, nodes vj(1) and vj(2) are visited before visiting vk . In other words, a deeper
node vj(1) is preferred to a neighbor vk that is closer to vi . Depth-first search can be used both for trees and
graphs, but is better visualized using trees
Algorithm 2.4
• Dijkstra’s Shortest Path Algorithm Require:
• Start node s, weighted graph/tree G(V, E, W)
• return Shortest paths and distances from s to all other nodes.
• for v ∈ V do
• distance[v] = ∞;
• predecessor[v] = −1;
• end for
• distance[s] = 0;
• unvisited = V;
• while unvisited , ∅ do
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 63
Essentials of Social graphs(CO2)
• smallest = arg minv∈unvisited distance(v);
• if distance(smallest)==∞ then
• break;
• end if
• unvisited = unvisited \ {smallest};
• currentDistance = distance(smallest);
• for adjacent node to smallest: neighbor ∈ unvisited do
• newPath = currentDistance+w(smallest, neighbor);
• if newPath < distance(neighbor) then
• distance(neighbor)=newPath;
• predecessor(neighbor)=smallest;
• end if
• end for
• end while
• Return distance[] and predecessor[] arrays
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 64
Essentials of Social graphs(CO2)
Edges can have directions. A directed edge is sometimes called an arc Edges are represented
using their end-points e(v2,v1). In undirected graphs both representations are the same
Facebook 72.00%
Pinterest 31.00%
Instagram 28.00%
LinkedIn 25.00%
Twitter 23.00%
• In social media, many social networks contain millions of nodes and billions of
edges. These complex networks have billions of friendships, the reasons for
existence of most of which are obscure. Humbled by the complexity of these
networks and the difficulty of independently analyzing each one of these friendships,
we can design models that generate, on a smaller scale, graphs similar to real-world
networks. On the assumption that these models simulate properties observed in real-
world networks well, the analysis of real-world networks boils down to a cost-
efficient measuring of different properties of simulated networks. In addition, these
models • allow for a better understanding of phenomena observed in realworld
networks by providing concrete mathematical explanations and • allow for controlled
experiments on synthetic networks when rea l world networks are not available.
• We discuss three principal network models in this chapter: the random graph model,
the small-world model, and the preferential attachment model.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 71
Network Models (CO2)
• Random Graphs : We start with the most basic assumption on how friendships can be formed:
Edges (i.e., friendships) between nodes (i.e., individuals) are formed randomly. The random graph
model follows this basic assumption. In reality Degrees of Separation friendships in real-world
networks are far from random.
• By assuming random friendships, we simplify the process of friendship formation in real-world
networks, hoping that these random friendships ultimately create networks that exhibit common
characteristics observed in real-world networks. Formally, we can assume that for a graph with a fixed
number of nodes n, any of the n 2 edges can be formed independently, with probability p. G(n, p)
This graph is called a random graph and we denote it as the G(n, p) model.
• This model was first proposed independently by Edgar Gilbert [100] and Solomonoff and Rapoport
[262]. Another way of randomly generating graphs is to assume that both the number of nodes n and
the number of edges m are fixed. However, we need to determine which m edges are selected from
the set of n 2 possible edges. Let Ω denote the set of graphs with n nodes and m edges. To generate a
random graph, we can uniformly select one of the graphs in Ω. The number of graphs with n nodes
and m 1 edges (i.e., |Ω|) is |Ω| = n 2 m ! . (4.3) The uniform random graph selection probability is 1 |
Ω|
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 74
Network Models (CO2)
• Small-World Model : The assumption behind the random graph model is that
connections in real-world networks are formed at random. Although unrealistic,
random graphs can model average path lengths in real-world networks properly,
but underestimate the clustering coefficient. To mitigate this problem, Duncan J.
Watts and Steven Strogatz in 1997 proposed the small-world model. In real-world
interactions, many individuals have a limited and often at least, a fixed number of
connections. Individuals connect with their parents, brothers, sisters, grandparents,
and teachers, among others. Thus, instead of assuming random connections, as we
did in random graph models, one can assume an egalitarian model in real-world
networks, where people have the same number of neighbors (friends). This again is
unrealistic; however, it models more accurately the clustering coefficient of real-
world networks. In graph theory terms, this assumption is equiva- Regular Ring
Lattice lent to embedding individuals in a regular network.
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 75
Network Models (CO2)
• Require: Graph G(V0, E0), where |V0| = m0 and dv ≥ 1 ∀ v ∈ V0, number of expected connections
m ≤ m0, time to run the algorithm t
• : return A scale-free network
• : //Initial graph with m0 nodes with degrees at least 1
• : G(V, E) = G(V0, E0);
• : for 1 to t do
• : V = V ∪ {vi}; // add new node vi
• : while di , m do
• : Connect vi to a random node vj ∈ V, i , j ( i.e., E = E ∪ {e(vi , vj)} ) with probability P(vj) = dj P k
dk
• : end while
• : end for
• : Return G(V, E)
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 77
Information Diffusion in social media. (CO2)
• Diffusion is the process by which information is spread from one place to another
through interactions. It is a field that encompasses techniques from a plethora of
sciences and techniques from different fields such as sociology, epidemiology, and
ethnography. Of course, everyone is interested in not getting infected by a contagious
disease. The diffusion process involves three main elements as follows:
• Sender. A sender (or a group of senders) is responsible for initiating the diffusion
process.
• Receiver. A receiver (or a group of receivers) receives the diffusion information from
the sender. Commonly, the number of receivers is higher than the number of senders.
• Medium. This is the channel through which the diffusion information is sent from the
sender to the receiver. This can be TV, newspaper, social media (e.g., a tweet on
Twitter), social ties, air (in the case of a disease spreading process), etc.
• A diffusion starts with an adopter (or a few number of adopters) who spreads the
innovation to others. Innovation typically represents newness, it is not the same thing
as invention, it is both a process and an outcome, and it involves discontinuous
change.
2) Which of the following process is not involved in the data mining process?
A) Data exploration
B) Data transformation
C) Data archaeology
D) Knowledge extraction
3) Which of the following process uses intelligent methods to extract data patterns?
A) Data mining
B) Text mining
C) Warehousing
D) Data selection
11) In any directed graph if all edges are reciprocal, can have maximum of |E|=
A)1
B)0
C)2
D)None of the above
14) Which of the following is not an appropriate measure for securing social networking accounts?
A) Strong passwords
B) Link your account with a phone number
C) Never write your password anywhere
D) Always maintain a soft copy of all your passwords in your PC
15) ________________ is a popular tool to block social-media websites to track your browsing activities.
A ) Fader
B) Blur
C) Social-Media Blocker
D) Ad-blocker
06/19/2025 Dr. Atul Pratap Singh Social Media Analytics Unit 2 85
MCQs(CO2)
16) Increase your security for social media account by always ____________ as you step away from the
system.
A) signing in
B) logging out
C) signing up
D) logging in
20) ________________is cross-platform user friendly tool that allows you to draw social
network
A) VOSViewer
B) Social Network Visualizer
C) Commetrix
D) Cuttlefish
• Data Mining: the process of discovering hidden and actionable patterns from data
• Aggregation – It is performed when multiple features need to be combined into a
single one or when the scale of the features change
• A decision tree is learned from the dataset – (training data with known classes) •
The learned tree is later applied to predict the class attribute value of new data –
(test data with unknown classes) – Only the feature values are known
• A search engine is a software system designed to carry out web searches. The
most productive way to conduct a search on the internet is through a search
engine
• Vector Space Model In the vector space model, we are given a set of documents
D. Each document is a set of words.