Web Mining Analyzing Websites and Collec
Web Mining Analyzing Websites and Collec
in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
REVIEW ARTICLE
Department of Computer Application, Annamalai University, Annamalai nagar, Tamil Nadu, India.
1
Department of Computer and Information Science, Annamalai University, Annamalai nagar, Tamil
2
Nadu, India.
This is an Open Access Journal / article distributed under the terms of the Creative Commons Attribution License
(CC BY-NC-ND 3.0) which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited. All rights reserved.
ABSTRACT
Yashoda
With the growth of the WWW, it has become more challenging for online search engines to provide users
with useful information. Web mining, one of the data mining techniques, is defined as the extraction of
hidden information from web sites and services. Based on the information that is buried, web mining
may be divided into three categories: web content mining, web structure mining, and web use mining.
The most common application of web mining is in search engines. To rank their search results, they use a
number of page ranking algorithms that are either based on the content of websites or on the web's link
structure. An examination of page ranking algorithms. Information Retrieval, Page Rank, Search Engines,
Web Mining, Web Page Ranking, User Profile & the World Wide Web are Index Terms.
Keywords: Web Mining, Web Page Ranking, User Profiles, Page rank.
INTRODUCTION
The www has billions of web pages, each containing a vast quantity of information. Based on their individual
structures, search engines carry out a variety of operations to extract necessary information from the www. These
procedures can be challenging and time consuming. Each search engine's procedure starts with crawling, followed by
indexing, searching, and information sorting/ranking. A crawler accesses the website, downloads all its web pages,
and then uses those pages to get the necessary information. The data given by Crawler must be organized in some
way before the search engine can access it; the data is indexed to cut down on the amount of needed to search
through it.
1
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu
The www is a common and interactive structure for sharing data nowadays. The Web is vast, varied, and constantly
changing. The Web offers access to a huge quantity of information from any location at any time. A large number of
individuals utilize the internet to find information. However, even after clicking on multiple links, people frequently
only find a large number of pointless and useless papers. Web mining techniques are employed to obtain data from
the Web.
Techniques for Mining Unstructured Data: Text is an example of unstructured data that may be used for content
mining. Unknown information is obtained through data mining. Text mining is the process of obtaining information
from various text sources that was previously unknown. Data mining and text mining methods must be used in
content mining. Text mining includes basic mining content. Among the text mining techniques used are extraction of
data, topic tracking, a summary, classification, grouping, and information visualization. Techniques for Structured
Data Mining: Three techniques for mining structured data include using web crawlers, creating wrappers, and
mining page content. semi-structured data mining methods Semi-structured data mining methods include Object
Exchange Model (OEM), Top Down Extraction, and Web Data Extraction Language. Multimedia Data Mining
Methods Multimedia Miner, color Histogram Matching, and shot boundary recognition are a few techniques for
multimedia data mining.
2
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Three steps can be used to categorize the difficulties associated with web usage mining:
1. Processing before. The given data typically exhibits noise, inconsistency, and incompleteness. The available data in
this phase should be handled in accordance with the needs of the following phase. It comprises data integration,
reduction, transformation, and cleansing.
2. Identifying patterns. To discover user patterns, a variety of techniques and algorithms, including statistics, data
mining, machine learning, and pattern recognition, might be used.
3. Pattern identification. This procedure seeks to comprehend, portray, and analyze these patterns. [13] [16].
Web pages are the objects of the WWW, and links are in, out, and co-citations, which refer to two pages that are
connected to the same page. The following list of link mining jobs that may be used for web structure mining is not
exhaustive. [2] [13] [16]
1. Link-based Classification: - The most recent improvement of a traditional data mining task to linked Domains is
classification. The aim is to forecast the category of a web page using terms that appear on the page, links
between pages, anchor text, html elements, and other potential web page properties.
2. Cluster analysis based on links. Finding naturally existing sub-classes is the aim of cluster analysis. Similar items
are grouped together and dissimilar objects are divided up into various groups when the data is split into groups.
Link-based cluster analysis, which is unsupervised and may be used to find hidden patterns in data, is different
from the preceding position.
3. Link Format. There are many different tasks that may be done in order to anticipate the existence of connections,
such as predicting the kind of link that will exist between two things or predicting the function of a link.
4. Link Power. Weights may be connected to links.
5. Cardinality Link. Predicting how many relationships there will be between the items is the fundamental problem
at hand. Web structure mining has several applications, including the following:
a. used to place the user's search,
b. Choosing which page will be added to the collection, classifying the page, and locating related pages,
c. Finding duplicate websites as well as comparing them to one another.
3
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu
Google Page Rank, an algorithm introduced by Brin and Page in 1998, and Kleinberg's hypertext-induced topic
selection (HITS), an algorithm proposed by Kleinberg in 1998, are two graph-based page ranking algorithms that are
effectively and conventionally employed in the field of web structure mining. All connections are given identical
weights by each of these algorithms for determining the rank score.
LITERATURE REVIEW
The user interface required to allow the user to query the information is represented by the online search engine. It is
the channel via which the user and the information repository are connected. There are a huge number of web pages
important to a certain query that are available when a user submits a search engine query. However, the user just
need a few web pages in order to function properly. Even still, this amount (in millions) is enormous. A ranking
algorithm is used by search engines to sort the results that are shown. In this manner, the user will see the most
significant and beneficial consequence first. Numerous algorithms have been created for rating websites; a few of
them include Page Rank, HITS, SALSA, RANDOMZE HITS, SUBSPACE HITS, and SIMRANK.
PR 𝑢 = (1 − 𝑑) + 𝑑 PV(v)/Nv
𝑣∈𝑏(𝑢)
where the dampening factor, d, is typically set at 0.85. One may use (1 d) as the page rank distribution from non-
directly linked pages and think of d as the likelihood of visitors clicking on the links.
Each outline page is assigned a value based on how popular it is. By counting the inbound and outbound
connections, popularity is calculated. The popularity is expressed as W in (v, u) and W out (v, u), respectively, based
on the quantity of inbound and outbound links. The link's weight, W in (v, u), is determined by using
𝑖𝑛
𝑊(𝑣,𝑛) = 𝐼𝑢/ IP
𝑃𝜖𝑅 (𝑣)
4
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu
where Iu and Ip stand for the quantity of inbound links from page u. respectively, page p. The reference page is
indicated by R (v). W out (v, u) is the weight of the link (v, u) on page v. depending on the amount of links that go
out from page u and the total amount of outbound from page v's reference pages.
𝑜𝑢𝑡
𝑊(𝑣,𝑛 ) = 𝑂𝑢 / Op
𝑃𝜖𝑅 (𝑣)
where Iu and Ip stand for the quantity of inbound links from page u. respectively, page p. The reference page is
indicated by R (v). W out (v, u) is the weight of the link (v, u) on page v. depending on the amount of links that go
out from page u and the total amount of outbound from page v's reference pages.
PR 𝑢 = (1 − 𝑑) + 𝑑 in
PV(v)W(vn out
) W(v.n)
𝑣∈𝑏(𝑢)
A weighted version of the Page Rank algorithm, introduced by Wenpu Xing and Ali Ghorbani, is known as the
Weighted Page Rank algorithm. The more significant pages are given higher rank values by this method instead of
Each of the page's outbound links receives a value proportionate to its relevance by dividing the rank value of the
page equally among its outgoing connected pages. Weight is given to both the forward link and the backlink in this
method. An incoming link is any link that points to a certain page, while an outgoing link is any link that points
away from that page. This method employs two factors, namely backlinks and forward links, making it more
effective than the PageRank algorithm. the popularity determined by the quantity of in- and out-links respectively
listed as Win and Wout. Win (v, u) is the link's weight, which is determined by the number of in-links to page u and
the total number of in-links to all of page v's reference pages. [2][3][16].
HITS will discover authorities and hubs for a certain query. He claims that a good authority is a page that is pointed
to by many excellent hubs, and a good hub is a page that is pointed to by many good authorities. Despite offering
excellent search results for a variety of queries, HITS does not perform well in any situation because of the following
three factors: [1][13] [16]
1 Host-host relationships that are mutually supportive. A single document on one host may occasionally point to a
collection of documents on a different host, or a set of documents on one host may occasionally point to a single
document on a different host.
2. Links created automatically. Links that were added by the tool are frequently seen in web documents created by
tools.
3. Nodes that are irrelevant. Sometimes websites link to other pages that are unrelated to the topic of the search.
5
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Jaiganesh and Arvind Babu
It's possible that we won't find the necessary information when we click on the first few links in the search results.
Consequently, we perceive a need for a system so that we can obtain the pertinent information in response to the
inquiry filed by us. "relevant search," we imply that indexing should be done based on the intrinsic meaning of the
query, which must be understood. The main information source is the internet, which presents another set of
challenges. Reading and analyzing manually extracted real necessary information is challenging. Many search
engines provide a lengthy list of documents, the majority of which are unrelated. Therefore, rating online pages to
enhance search engine results is the major issue. There are still certain restrictions on how well the PageRank
algorithm can reflect the relationships between links between web sites on the Internet and how well it can further
uncover the significance of Web pages.
PROPOSED WORK
Due to the growing amount of content available online, the World Wide Web has evolved into one of the most
important platforms for knowledge discovery and information retrieval. Standard search engines frequently return a
large number of pages in responses to user queries, but users always want the best outcomes quickly. Data mining
and deep learning techniques must thus be used to web data and documents as they are essential for locating the
correct web page. The effectiveness with which a web page meets the user's informational needs after being accessed
is referred to as relevancy. When applied to search results, ranking strategies make it easier for users to navigate the
result list. ranking of pages the top sites in the end list that are most relevant to the user's information demands are
returned using based on Visits to Links, which makes use of user browsing information and the link structure of
pages. Placing the most important web pages or information in front of people is the aim of Page Ranking based on
Visits to Links. Links that have a high likelihood of being clicked on help websites rank higher overall. The rank
value of any page will be the same whether or not the user sees it because the Page Rank technique is solely reliant on
the link structure of the Web graph.
Visits to Links have more definite objectives when pages are arranged using links. Based on in-page ranking on Links
Visits, Since the page's rank is determined by the likelihood of visits (not the quantity of visits) on websites that have
back links to it, a user cannot intentionally increase a page's rank by constantly visiting it. The regular crawling of
web servers to compile an accurate and up-to-date visit count of websites is the main issue. Specialized crawlers must
be developed in order to retrieve the relevant data from pages.
CONCLUSION
As the years went by, the World Wide Web grew more and more packed with information, making it challenging to
get the information you need. Search engines want to meet the demands of its users by giving them relevant
information. Finding Web content and recovering user interests and demands are therefore becoming more and more
crucial. The various link analysis methods, such as Page Rank, Weighted Page Rank and (HITS) algorithms, are
covered. Page Ranking based on link visits determines a web page's rank value based on user visits to its inbound
links. This sorting of the pages makes them more relevant and, as a result, gives the user better search results.
REFERENCES
1. Ashish Jain, Rajeev Sharma, Gireesh Dixit, Varsha Tomar ,” Page Ranking Algorithms in Web Mining,
Limitations of Existing methods and a New Method for Indexing Web Pages”, 2013 IEEE International
Conference on Communication Systems and Network Technologies.
6
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
2. Seifedine Kadry , Ali Kalakech ,” On the Improvement of Weighted Page Content Rank”, Journal of Advances in
Computer Networks, Vol. 1, No. 2, June 2013.
3. Rashmi Rani, Vinod Jain ,” Weighted PageRank using the Rank Improvement” International Journal of Scientific
and Research Publications Volume 3, Issue 7, July 2013.
4. Preeti Chopra, Md. Ataullah ,”A Survey on Improving the Efficiency of Different Web Structure Mining
Algorithms”, International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-
2, Issue-3, February 2013.
5. B.Aysha Banu, Dr.M.Chitra.,” A Novel Ensemble Vision Based Deep Web Data Extraction Technique for
WebMining Applications”, 2012 IEEE Intemational Conference on Advanced Communication Control and
Computing Technologies (ICACCCT).
6. P.Sudhakar, G.Poonkuzhali, R.Kishore Kumar,” Content Based Ranking for Search Engines”,Proceedings of the
International MultiConference of Engineers and Computer Scientists 2012 Vol I, Hong Kong.
7. Dilip Kumar Sharma, A. K. Sharma, “A Comparative Analysis of Web Page Ranking Algorithms”, (IJCSE)
International Journal on Computer Science and Engineering, Vol. 02, No. 08, 2010, 2670-2676.
8. Mohamed-K HUSSEIN, Mohamed-H MOUSA ,” An Effective Web Mining Algorithm using Link Analysis”,
(IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 1 (3) , 2010, 190-197.
9. Shesh Narayan Mishra, Alka Jaiswal, Asha Ambhaikar ,” Web Mining Using Topic Sensitive Weighted
PageRank”, International Journal of Scientific & Engineering Research Volume 3, Issue 2, February-2012 , ISSN
2229-5518.
10. Faustina Johnson , Santosh Kumar Gupta,” Web Content Mining Techniques: A Survey”, International Journal of
Computer Applications (0975 – 888) Volume 47– No.11, June 2012.
11. Shesh Narayan Mishra ,Alka Jaiswal,Asha Ambhaikar ,” An Effective Algorithm for Web Mining Based on Topic
Sensitive Link Analysis ”, International Journal of Advanced Research in Computer Science and Software
Engineering, Volume 2, Issue 4, April 2012 ISSN: 2277 128X.
12. V. Lakshmi Praba , T. Vasantha,” EVALUATION OF WEB SEARCHING METHOD USING A NOVEL WPRR
ALGORITHM FOR TWO DIFFERENT CASE STUDIES “Ictact Journal on Soft Computing, April 2012, Volume:
02, Issue: 03.
13. Miguel Gomes da Costa, Júnior Zhiguo Gong,” Web Structure Mining: An Introduction”, Proceedings of the 2005
IEEE International Conference on Information Acquisition June 27 - July 3, 2005, Hong Kong and Macau, China.
14. Neelam Tyagi, Simple Sharma,” Comparative study of various Page Ranking Algorithms in Web Structure
Mining (WSM)” International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-
3075, Volume-1, Issue-1, June 2012.
15. Ms.N.Preethi , Dr.T.Devi,” New Integrated Case And Relation Based (CARE) Page Rank Algorithm” 2013
International Conference on Computer Communication and Informatics (ICCCI -2013), Jan. 04 – 06, 2013,
Coimbatore, INDIA.
16. Nilima V. Pardakhe, Prof. R. R. Keole,” Enhancement of the Web Search Engine Results using Page Ranking
Algorithm,”International Journal of Innovative Research in Computer Science & Technology (IJIRCST) ISSN:
2347-5552, Volume-2, Issue-2, March-2014.
7
Indian Journal of Natural Sciences www.tnsroindia.org.in ©IJONS
Vol.14 / Issue 81 / Dec / 2023 International Bimonthly (Print) – Open Access ISSN: 0976 – 0997
Customized
Web page Content Search Result General Access usage Pattern
Pattern Tracking Tracking