0% found this document useful (0 votes)
3 views8 pages

Web Mining Analyzing Websites and Collec

The article discusses web mining, a technique for extracting hidden information from websites, which can be categorized into web content mining, web structure mining, and web usage mining. It highlights the significance of page ranking algorithms, such as Page Rank and HITS, in improving search engine results by analyzing web pages and their interconnections. The authors emphasize the challenges of information retrieval from the vast and dynamic nature of the web and the need for effective mining techniques to enhance user experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

Web Mining Analyzing Websites and Collec

The article discusses web mining, a technique for extracting hidden information from websites, which can be categorized into web content mining, web structure mining, and web usage mining. It highlights the significance of page ranking algorithms, such as Page Rank and HITS, in improving search engine results by analyzing web pages and their interconnections. The authors emphasize the challenges of information retrieval from the vast and dynamic nature of the web and the need for effective mining techniques to enhance user experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Indian Journal of Natural Sciences www.tnsroindia.org.

in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
REVIEW ARTICLE

Web Mining: Analyzing Websites and Collecting Knowledge from the


Internet

S. Jaiganesh1* and L.R. Arvind Babu2

Department of Computer Application, Annamalai University, Annamalai nagar, Tamil Nadu, India.
1

Department of Computer and Information Science, Annamalai University, Annamalai nagar, Tamil
2

Nadu, India.

Received: 18 Oct 2023 Revised: 25 Oct 2023 Accepted: 30 Oct 2023

*Address for Correspondence


S. Jaiganesh
Department of Computer Application,
Annamalai University,
Annamalai nagar, Tamil Nadu, India.

This is an Open Access Journal / article distributed under the terms of the Creative Commons Attribution License
(CC BY-NC-ND 3.0) which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited. All rights reserved.

ABSTRACT
Yashoda
With the growth of the WWW, it has become more challenging for online search engines to provide users
with useful information. Web mining, one of the data mining techniques, is defined as the extraction of
hidden information from web sites and services. Based on the information that is buried, web mining
may be divided into three categories: web content mining, web structure mining, and web use mining.
The most common application of web mining is in search engines. To rank their search results, they use a
number of page ranking algorithms that are either based on the content of websites or on the web's link
structure. An examination of page ranking algorithms. Information Retrieval, Page Rank, Search Engines,
Web Mining, Web Page Ranking, User Profile & the World Wide Web are Index Terms.

Keywords: Web Mining, Web Page Ranking, User Profiles, Page rank.

INTRODUCTION

The www has billions of web pages, each containing a vast quantity of information. Based on their individual
structures, search engines carry out a variety of operations to extract necessary information from the www. These
procedures can be challenging and time consuming. Each search engine's procedure starts with crawling, followed by
indexing, searching, and information sorting/ranking. A crawler accesses the website, downloads all its web pages,
and then uses those pages to get the necessary information. The data given by Crawler must be organized in some
way before the search engine can access it; the data is indexed to cut down on the amount of needed to search
through it.

1
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu

The www is a common and interactive structure for sharing data nowadays. The Web is vast, varied, and constantly
changing. The Web offers access to a huge quantity of information from any location at any time. A large number of
individuals utilize the internet to find information. However, even after clicking on multiple links, people frequently
only find a large number of pointless and useless papers. Web mining techniques are employed to obtain data from
the Web.

Overview of Web Mining


Web mining is the automatic discovery and extraction of knowledge from the Web using data mining techniques.
The following activities are included in web mining: Finding resources is the process of locating desired Web
documents. [10][7][11][16] Preprocessing, information selection: automatically choosing and pre processing a
specified piece of information from retrieved Web resources. Generalization: automatically identifies broad trends on
both a single Web site and a network of sites. Validation and/interpretation of the extracted patterns constitute
analysis. Web mining is divided into three categories: (WC) Web Content, (WU) Web Usage , and (WS) Web
Structure.

Web content (WC)


The method of obtaining valuable information from the text of web documents is referred to as web content mining.
Text, photos, music, video, and structured data like tables and lists May all be found on online pages. Mining is a
technique that may be used on both web publications and search engine results pages. Agent-based method and
database-based approach are the two main types of content mining approaches. The three different sorts of agents are
personalized online agents, information filtering and categorization agents, and intelligent search agents. Intelligent
Search agents utilize domain knowledge to conduct an automated search for the information in response to a specific
question. user profiles and attributes. Information agents employed a variety of methods to filter data in accordance
with the predetermined rules. Web agents that are specifically tailored to each user's preferences find documents that
have significance to their user profiles. A well-formed database with specified domains, schemas, and properties
makes up the database approach. It becomes challenging when mining unstructured, structured, semi-structured,
and multimedia data from the web. [10] [16].

Techniques for Mining Unstructured Data: Text is an example of unstructured data that may be used for content
mining. Unknown information is obtained through data mining. Text mining is the process of obtaining information
from various text sources that was previously unknown. Data mining and text mining methods must be used in
content mining. Text mining includes basic mining content. Among the text mining techniques used are extraction of
data, topic tracking, a summary, classification, grouping, and information visualization. Techniques for Structured
Data Mining: Three techniques for mining structured data include using web crawlers, creating wrappers, and
mining page content. semi-structured data mining methods Semi-structured data mining methods include Object
Exchange Model (OEM), Top Down Extraction, and Web Data Extraction Language. Multimedia Data Mining
Methods Multimedia Miner, color Histogram Matching, and shot boundary recognition are a few techniques for
multimedia data mining.

Web Usage (WU)


Web use mining is the practice of taking secondary data produced from user interactions while browsing the Web
and turning it into valuable information. It presses information from client-side cookies, user profiles, referrer logs,
agent logs, server access logs, and metadata. [7] [16].

2
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997

Jaiganesh and Arvind Babu

Three steps can be used to categorize the difficulties associated with web usage mining:
1. Processing before. The given data typically exhibits noise, inconsistency, and incompleteness. The available data in
this phase should be handled in accordance with the needs of the following phase. It comprises data integration,
reduction, transformation, and cleansing.
2. Identifying patterns. To discover user patterns, a variety of techniques and algorithms, including statistics, data
mining, machine learning, and pattern recognition, might be used.
3. Pattern identification. This procedure seeks to comprehend, portray, and analyze these patterns. [13] [16].

Mining Web Structure


Generating a structural overview of the website and web page is the aim of web structure mining. It aims to identify
the inter-document link structure of the hyperlinks. Web structure mining will classify the Web pages based on the
architecture of the hyperlinks and produce data such as similarities and connections between various Websites.
The document level (intra page) or hyperlink level (inter-page) of this sort of mining can be used. Understanding the
Web data structure is crucial for information retrieval. In contrast to conventional collections of text documents, the
Web comprises a range of items with essentially no common structure and far greater variances in authoring style
and content.

Web pages are the objects of the WWW, and links are in, out, and co-citations, which refer to two pages that are
connected to the same page. The following list of link mining jobs that may be used for web structure mining is not
exhaustive. [2] [13] [16]

1. Link-based Classification: - The most recent improvement of a traditional data mining task to linked Domains is
classification. The aim is to forecast the category of a web page using terms that appear on the page, links
between pages, anchor text, html elements, and other potential web page properties.
2. Cluster analysis based on links. Finding naturally existing sub-classes is the aim of cluster analysis. Similar items
are grouped together and dissimilar objects are divided up into various groups when the data is split into groups.
Link-based cluster analysis, which is unsupervised and may be used to find hidden patterns in data, is different
from the preceding position.
3. Link Format. There are many different tasks that may be done in order to anticipate the existence of connections,
such as predicting the kind of link that will exist between two things or predicting the function of a link.
4. Link Power. Weights may be connected to links.
5. Cardinality Link. Predicting how many relationships there will be between the items is the fundamental problem
at hand. Web structure mining has several applications, including the following:
a. used to place the user's search,
b. Choosing which page will be added to the collection, classifying the page, and locating related pages,
c. Finding duplicate websites as well as comparing them to one another.

Page Ranking Methodologies


Effective query word searching heavily relies on effective query word ranking. The ranking of websites is
complicated by a number of issues, including the fact that certain websites are just built for navigation and that other
web pages lack the ability to be self-descriptive. Several methods have been .presented in the literature for ranking
web sites.[7][16]

The following three algorithms are crucial:


1. Page Rank
2. Page Rank that is weighted (weighted Page Rank)
3. Hyperlink-Induced Topic Search, or HITS

3
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu

Google Page Rank, an algorithm introduced by Brin and Page in 1998, and Kleinberg's hypertext-induced topic
selection (HITS), an algorithm proposed by Kleinberg in 1998, are two graph-based page ranking algorithms that are
effectively and conventionally employed in the field of web structure mining. All connections are given identical
weights by each of these algorithms for determining the rank score.

LITERATURE REVIEW

The user interface required to allow the user to query the information is represented by the online search engine. It is
the channel via which the user and the information repository are connected. There are a huge number of web pages
important to a certain query that are available when a user submits a search engine query. However, the user just
need a few web pages in order to function properly. Even still, this amount (in millions) is enormous. A ranking
algorithm is used by search engines to sort the results that are shown. In this manner, the user will see the most
significant and beneficial consequence first. Numerous algorithms have been created for rating websites; a few of
them include Page Rank, HITS, SALSA, RANDOMZE HITS, SUBSPACE HITS, and SIMRANK.

Page Rank Algorithm


More than 25 billion web pages on the WWW have a Page Rank score allocated to them by the Page Rank algorithm.
For the purpose of to determine an overall ranking score for each web page, Google's search algorithm combines
precomputed Page Rank scores with text-matching scores. Although numerous other criteria are taken into account
when determining overall ranking, Google asserts that Page Rank is the core of their search engine software. The
following definition of Page Rank is condensed.
PR 𝑢 = PV(v)/Nv
𝑣∈𝑏(𝑢)
B(u) is the collection of pages that point to u when u stands for a web page. The rank scores for pages u and v are
PR(u) and PR(v), respectively. Nv stands for the number of outgoing connections on page v, and c is a normalization
factor. The rank score of a page, p, in Page Rank is distributed equally across its outbound connections. The rankings
of the pages that page p is referring to are determined using the values assigned to page p's outbound links. Later,
Page Rank was changed in response to the observation that not all people click on direct links on the web. The
following equation contains the changed form.

PR 𝑢 = (1 − 𝑑) + 𝑑 PV(v)/Nv
𝑣∈𝑏(𝑢)
where the dampening factor, d, is typically set at 0.85. One may use (1 d) as the page rank distribution from non-
directly linked pages and think of d as the likelihood of visitors clicking on the links.

Weighted Page Rank Algorithm


Weighted Page Rank (WPR), a modification to conventional Page Rank suggested by Ali Ghorbani and Wenpu Xing,
is used. It is predicted that prominent online pages tend to have more links to them or link back to them, and vice
versa. Instead of uniformly distributing a page's rank value across its outbound linked sites, this method gives higher
rank values to pages that are more significant.

Each outline page is assigned a value based on how popular it is. By counting the inbound and outbound
connections, popularity is calculated. The popularity is expressed as W in (v, u) and W out (v, u), respectively, based
on the quantity of inbound and outbound links. The link's weight, W in (v, u), is determined by using
𝑖𝑛
𝑊(𝑣,𝑛) = 𝐼𝑢/ IP
𝑃𝜖𝑅 (𝑣)

4
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu

where Iu and Ip stand for the quantity of inbound links from page u. respectively, page p. The reference page is
indicated by R (v). W out (v, u) is the weight of the link (v, u) on page v. depending on the amount of links that go
out from page u and the total amount of outbound from page v's reference pages.
𝑜𝑢𝑡
𝑊(𝑣,𝑛 ) = 𝑂𝑢 / Op
𝑃𝜖𝑅 (𝑣)
where Iu and Ip stand for the quantity of inbound links from page u. respectively, page p. The reference page is
indicated by R (v). W out (v, u) is the weight of the link (v, u) on page v. depending on the amount of links that go
out from page u and the total amount of outbound from page v's reference pages.
PR 𝑢 = (1 − 𝑑) + 𝑑 in
PV(v)W(vn out
) W(v.n)
𝑣∈𝑏(𝑢)
A weighted version of the Page Rank algorithm, introduced by Wenpu Xing and Ali Ghorbani, is known as the
Weighted Page Rank algorithm. The more significant pages are given higher rank values by this method instead of
Each of the page's outbound links receives a value proportionate to its relevance by dividing the rank value of the
page equally among its outgoing connected pages. Weight is given to both the forward link and the backlink in this
method. An incoming link is any link that points to a certain page, while an outgoing link is any link that points
away from that page. This method employs two factors, namely backlinks and forward links, making it more
effective than the PageRank algorithm. the popularity determined by the quantity of in- and out-links respectively
listed as Win and Wout. Win (v, u) is the link's weight, which is determined by the number of in-links to page u and
the total number of in-links to all of page v's reference pages. [2][3][16].

HITS (Hyperlink Induced Topic Search)


Kleinberg presented this method in 1997. The gathering of the root set comes first in this algorithm. The search
engine returned hits for that root set. Creating the base set, which contains the full page that refers to that root set,
comes next. The size should range from 1000 to 5000. The focused graph is built in the third stage using the base set's
graph structure. The intrinsic link, or the connection between related domains, is removed. The hub and authority
scores are then calculated iteratively. He distinguishes two categories of pages from the Web's hyperlink structure in
the HITS concept: authority (pages with reliable sources of material) and hubs (pages with reliable sources of links).

HITS will discover authorities and hubs for a certain query. He claims that a good authority is a page that is pointed
to by many excellent hubs, and a good hub is a page that is pointed to by many good authorities. Despite offering
excellent search results for a variety of queries, HITS does not perform well in any situation because of the following
three factors: [1][13] [16]
1 Host-host relationships that are mutually supportive. A single document on one host may occasionally point to a
collection of documents on a different host, or a set of documents on one host may occasionally point to a single
document on a different host.
2. Links created automatically. Links that were added by the tool are frequently seen in web documents created by
tools.
3. Nodes that are irrelevant. Sometimes websites link to other pages that are unrelated to the topic of the search.

ANALYSIS OF THE ISSUE


All of the algorithms, including Page Rank (PR), Weighted Page Rank (WPR), and Hyperlink-Induced Topic Search
(HITS), etc., may occasionally function satisfactorily, but frequently the user may not find the information they are
looking for. When utilizing a search engine like Google to look up a topic, we are all faced with the issue of being
presented with millions of search results. It is not practicable to manually search through all of these millions of web
pages for the necessary information [1] [16].

5
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu

It's possible that we won't find the necessary information when we click on the first few links in the search results.
Consequently, we perceive a need for a system so that we can obtain the pertinent information in response to the
inquiry filed by us. "relevant search," we imply that indexing should be done based on the intrinsic meaning of the
query, which must be understood. The main information source is the internet, which presents another set of
challenges. Reading and analyzing manually extracted real necessary information is challenging. Many search
engines provide a lengthy list of documents, the majority of which are unrelated. Therefore, rating online pages to
enhance search engine results is the major issue. There are still certain restrictions on how well the PageRank
algorithm can reflect the relationships between links between web sites on the Internet and how well it can further
uncover the significance of Web pages.

PROPOSED WORK

Due to the growing amount of content available online, the World Wide Web has evolved into one of the most
important platforms for knowledge discovery and information retrieval. Standard search engines frequently return a
large number of pages in responses to user queries, but users always want the best outcomes quickly. Data mining
and deep learning techniques must thus be used to web data and documents as they are essential for locating the
correct web page. The effectiveness with which a web page meets the user's informational needs after being accessed
is referred to as relevancy. When applied to search results, ranking strategies make it easier for users to navigate the
result list. ranking of pages the top sites in the end list that are most relevant to the user's information demands are
returned using based on Visits to Links, which makes use of user browsing information and the link structure of
pages. Placing the most important web pages or information in front of people is the aim of Page Ranking based on
Visits to Links. Links that have a high likelihood of being clicked on help websites rank higher overall. The rank
value of any page will be the same whether or not the user sees it because the Page Rank technique is solely reliant on
the link structure of the Web graph.

Visits to Links have more definite objectives when pages are arranged using links. Based on in-page ranking on Links
Visits, Since the page's rank is determined by the likelihood of visits (not the quantity of visits) on websites that have
back links to it, a user cannot intentionally increase a page's rank by constantly visiting it. The regular crawling of
web servers to compile an accurate and up-to-date visit count of websites is the main issue. Specialized crawlers must
be developed in order to retrieve the relevant data from pages.

CONCLUSION

As the years went by, the World Wide Web grew more and more packed with information, making it challenging to
get the information you need. Search engines want to meet the demands of its users by giving them relevant
information. Finding Web content and recovering user interests and demands are therefore becoming more and more
crucial. The various link analysis methods, such as Page Rank, Weighted Page Rank and (HITS) algorithms, are
covered. Page Ranking based on link visits determines a web page's rank value based on user visits to its inbound
links. This sorting of the pages makes them more relevant and, as a result, gives the user better search results.

REFERENCES

1. Ashish Jain, Rajeev Sharma, Gireesh Dixit, Varsha Tomar ,” Page Ranking Algorithms in Web Mining,
Limitations of Existing methods and a New Method for Indexing Web Pages”, 2013 IEEE International
Conference on Communication Systems and Network Technologies.

6
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997

Jaiganesh and Arvind Babu

2. Seifedine Kadry , Ali Kalakech ,” On the Improvement of Weighted Page Content Rank”, Journal of Advances in
Computer Networks, Vol. 1, No. 2, June 2013.
3. Rashmi Rani, Vinod Jain ,” Weighted PageRank using the Rank Improvement” International Journal of Scientific
and Research Publications Volume 3, Issue 7, July 2013.
4. Preeti Chopra, Md. Ataullah ,”A Survey on Improving the Efficiency of Different Web Structure Mining
Algorithms”, International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-
2, Issue-3, February 2013.
5. B.Aysha Banu, Dr.M.Chitra.,” A Novel Ensemble Vision Based Deep Web Data Extraction Technique for
WebMining Applications”, 2012 IEEE Intemational Conference on Advanced Communication Control and
Computing Technologies (ICACCCT).
6. P.Sudhakar, G.Poonkuzhali, R.Kishore Kumar,” Content Based Ranking for Search Engines”,Proceedings of the
International MultiConference of Engineers and Computer Scientists 2012 Vol I, Hong Kong.
7. Dilip Kumar Sharma, A. K. Sharma, “A Comparative Analysis of Web Page Ranking Algorithms”, (IJCSE)
International Journal on Computer Science and Engineering, Vol. 02, No. 08, 2010, 2670-2676.
8. Mohamed-K HUSSEIN, Mohamed-H MOUSA ,” An Effective Web Mining Algorithm using Link Analysis”,
(IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 1 (3) , 2010, 190-197.
9. Shesh Narayan Mishra, Alka Jaiswal, Asha Ambhaikar ,” Web Mining Using Topic Sensitive Weighted
PageRank”, International Journal of Scientific & Engineering Research Volume 3, Issue 2, February-2012 , ISSN
2229-5518.
10. Faustina Johnson , Santosh Kumar Gupta,” Web Content Mining Techniques: A Survey”, International Journal of
Computer Applications (0975 – 888) Volume 47– No.11, June 2012.
11. Shesh Narayan Mishra ,Alka Jaiswal,Asha Ambhaikar ,” An Effective Algorithm for Web Mining Based on Topic
Sensitive Link Analysis ”, International Journal of Advanced Research in Computer Science and Software
Engineering, Volume 2, Issue 4, April 2012 ISSN: 2277 128X.
12. V. Lakshmi Praba , T. Vasantha,” EVALUATION OF WEB SEARCHING METHOD USING A NOVEL WPRR
ALGORITHM FOR TWO DIFFERENT CASE STUDIES “Ictact Journal on Soft Computing, April 2012, Volume:
02, Issue: 03.
13. Miguel Gomes da Costa, Júnior Zhiguo Gong,” Web Structure Mining: An Introduction”, Proceedings of the 2005
IEEE International Conference on Information Acquisition June 27 - July 3, 2005, Hong Kong and Macau, China.
14. Neelam Tyagi, Simple Sharma,” Comparative study of various Page Ranking Algorithms in Web Structure
Mining (WSM)” International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-
3075, Volume-1, Issue-1, June 2012.
15. Ms.N.Preethi , Dr.T.Devi,” New Integrated Case And Relation Based (CARE) Page Rank Algorithm” 2013
International Conference on Computer Communication and Informatics (ICCCI -2013), Jan. 04 – 06, 2013,
Coimbatore, INDIA.
16. Nilima V. Pardakhe, Prof. R. R. Keole,” Enhancement of the Web Search Engine Results using Page Ranking
Algorithm,”International Journal of Innovative Research in Computer Science & Technology (IJIRCST) ISSN:
2347-5552, Volume-2, Issue-2, March-2014.

7
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS

Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997

Jaiganesh and Arvind Babu

Web page Mining

Web Content Web Structure Web Usage/ Web Log

Customized
Web page Content Search Result General Access usage Pattern
Pattern Tracking Tracking

Categorizes Understand access Analyzes access


Identifies patterns and trends patterns of a user to
documents using
Information within to improve structure improve response
phrases in titles and
web page
snippets

Fig 1.Web Mining

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy