A Design of Faceted Search Engine - A Review
A Design of Faceted Search Engine - A Review
Research paper
Abstract
The World Wide Web (WWW) allows the people to share information and data from large database repositories globally. The amount of
information is already in the billions of databases. We need to search the information with specialize tools known generically as search
engine (SE). With the huge data that needs to be handled, search engines need to retrieve meaningful information intelligently, whereby
only information of interest to the searcher needs to be returned. Facets (the particular aspect or feature of something being searched) can
play an important role in helping the user understand an information space better. Queries techniques within faceted search will make the
search results immediate and the interaction between searcher and search engine uninterrupted and focused. They can contribute to the
user‟s understanding of the researched terms or topics. Furthermore, they are more fun and interesting to use because users directly ma-
nipulate the search controls and the results can be displayed through choices of presentation such as text displays, transition animations,
graphs etc. which bring the process closer to an experience in game playing. This paper review the design of faceted search engine.
Keywords: Information Retrieval; Search Engine; Exploratory Search; Faceted Search Engine.
1. Introduction 2. Motivation
Since the advent of the WWW, people have been increasingly We have many SEs that gives information according to the rank
using the Internet as the medium to find, discover/encounter, ex- retrieval (rank list) model. Generally, query response and results
plore, exchange, and make sense of information. Because of this, representing the output are arranged in a rank based on some scor-
people now rely heavily on online resources to fulfill many kinds ing functions that combine different characteristics produced by
of information needs [1]. There has been a shift from only using the documents and queries. However, there are still some con-
the Web for single query-based searches to using it for more com- straints of conventional SE which demands further study as de-
plex and exploratory search to satisfy their information needs. scribed in the question as follows:
However, online SE and other search tools have been primarily “Results are represented according to their rank, one of the main
limited to retrieve information in the form of a set of rank docu- problems is how to rank the results returned by a SE or a combina-
ments for a given query in an effective and efficient manner [2]. tion of SEs? How do searchers think differently about their search
One important aspect that is beyond the present scope of SE is strategy when categorized overviews are available to augment the
analyzing the underlying search process of each user specifically result list and how to achieve a better accuracy in the search [4, 5].
performing an information search task. Although SE have evolved
in smarter ways to keep track of user search history and prefer-
ences to effectively suggest queries and personalize the search
3. Related Work
results, they do not focus on the user‟s information search task.
Thus, they fail to provide search path suggestions such as what This section illustrates some relative works about Infor-
query to execute next, which queries to exclude, which Web pages mation Retrieval, Defining Relevance, Set Retrieval,
offer useful information for their task, or what information to con- Ranked Retrieval.
sider as relevant to achieve the user task goal. Traditional faceted
navigation styles allow one to drill down into a subject matter to 3.1. Information Retrieval
find very specific documents. One limitation to this, however, is
the possibility to obtain a very “narrow” view of the issue, which Information Retrieval (IR) is the process of searching within a
is recognized in Kules and Shneiderman's study [3]. document collection for a particular information need which is
called a query [6]. It is finding materials (usually documents) of
unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
Copyright © 2018 Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original work is properly cited.
490 International Journal of Engineering & Technology
IR typically seeks to find documents in a given collection that are In order to circumvent the difficulties of the Boolean set model, an
about a given topic or that satisfy given information need. The interesting compromise consists of ranking the search results. The
topic or information need is expressed by a query, generated from query could remain fairly loose, but the results returned could be
the user[7]. Documents that satisfy the given query in the judg- ranked according to some metric. In that case a user looking for
ment of the user are said to be relevant. Documents that are not books may enter some keywords related to the book and have
about the given topic are said to be non-relevant. In this section them ordered by popularity, price or location. In the following
we survey related work in classifying and understanding designing section we will cover these called ranked retrieval models.
search interfaces and in techniques to augment search results, but
first we need to know what is considered relevance, in order to 3.4. Ranked Retrieval
focus on improving the relevancy.
The first one is the vector space model approach developed by
3.2. Defining Relevance Salton, Wong, and Yang (1975)[13]. In the vector space model,
each document is represented by a vector. Each index in the vector
Probably the first notion to be defined is the notion of relevance of corresponds to a word or term found in the document collection.
an IR system. That is what it means for a SE to retrieve documents Each component of the vector is a numerical value which reflects
that are relevant to the user [7]. The notion of relevance itself has the importance or the weight of the term in the document. The
been the source of intense debates amongst researchers often disa- query becomes a vector which is then compared to all the other
greeing on how to measure it [8, 9]. However, the general consen- vectors document in the set. A similarity measure, usually the
sus has been to characterize relevance either through a purely cosine angle between vectors, is used to match the query against
cognitive point of view or solely through a benchmarking ap- the documents. The results are then ranked according to how close
proach. The former, which will be addressed in the later, naturally they are to the user‟s query. However, the question of properly
leads to the design of search user interfaces and to evaluation weighting each term within the document and the collection still
methods that favors user studies. In this setting, precision and remains.
recall provide a natural metric of relevance. Another major contribution to ranked retrieval and to the vector
space model is the work on by Sparck Jones [14].
3.3. Set Retrieval Stands for term frequency multiply by inverse document
frequency. Let us assume we have a document collection D of
In a Boolean set retrieval model [10], a user enters a query made documents each containing terms . The term frequency
up of Boolean operators such as AND, NOT, OR and gets docu- ( ) of a term within a document is the number of times
ments that match that query. The documents are returned with an appears in divided by the total number of terms in .
unordered set and the precision, and/or recall, depends on the us-
er‟s ability to write complex Boolean queries. Boolean search ( )
|* +|
⌊ ⌋
(1)
systems could additionally be extended with field operators to
search within specific fields of the document collection. For ex-
ample, a user can find terms within the title, text body, author, and Where * + and | | denotes set definition and cardinality of a set
other areas of the documents of interest. There has been excellent respectively. A high term frequency indicates that a term is more
documentation of the difficulty the general public has with using representative of the document content. On the other hand, we can
Boolean search models [11]. In practice, set retrieval suffers from define the document frequency ( ) of a term within a doc-
a clear trade-off between high precision and high recall. Because ument collection as the ∞document frequency is the logged
the documents returned lacked any ordering, a user can either reciprocal of this expression.
achieve very high precision by formulating a very restrictive query
| |
or, high recall by choosing a very loose one. Users usually have to ( ) ( ( ) ) (2)
|* |
be experts in formulating complex Boolean queries in order to
retrieve the most relevant set. It is important to note, however, that
The inverse document frequency emphasizes rare terms over
if the ranking of documents returned is not required due to the
common ones. The ( ) of a term within a docu-
nature of those documents, and when the domain of interest is
reserved to experts, set retrieval could be a fine approach to search. ment in the collection is the term frequency multiply by the
For example, PubMed (www.Pubmed.com) from the United inverse document frequency.
States National Library of Medicine offers an advanced search
( ) ( ) ( ) (3)
feature to help users build queries made of Boolean expressions.
The user is able to create complex queries restricted to specific
fields and made of AND, OR, NOT operators see Figure 1. This Intuitively, a term with high is a term which is repre-
advanced search feature is helpful to non-expert users, considering sentative of the document content while not being too popular on
that PubMed ranks the articles found by dates only. the whole corpus. This measure will then favour frequent but rare
terms in the document specific terms. The terms in the vector
space model can now be weighted by and a similarity
measure can then be used in order to rank each document accord-
ing to the user‟s query. The vector space model and
proved to be highly successful for ranking results in a set of doc-
uments which had no explicit connections with respect to each
other.
However, with the advent of the WWW and hypertext collections,
researchers started to develop ranking methods based on a notion
of document authority. For example, a hypertext collection could
be modeled as a graph with links as edges and documents as nodes.
That graph can then be harnessed in order to rank documents
based on a certain notion of authority, and independently of the
Fig. 1: The Boolean Search Interface of PubMed[12]
user‟s query. In this respect Jon Kleinberg‟s HITS algorithm [15]
and Larry Page and Sergey Brin‟s PageRank [16] were the two
International Journal of Engineering & Technology 491
most notable measures of authority see Figure 2. The latter meas- The lookup-based model has been identified as best suited for
ure was on the basis of Google‟s search engine. question answering tasks and fact finding [21]. In fact, the process
must start with a carefully specified query, and should end with
precise results. But the results returned, together with their poten-
tial relationships, are not intended to be further analyzed with
more scrutiny.
Fig. 3: The Lookup-Based Model According to Bates [20] The series of steps of the designing process are shown in Table 2
in detail:
492 International Journal of Engineering & Technology
Creating a hierarchy is a similar solution for presenting facets [8] S. Mizzaro, "Relevance: The whole history," JASIS, vol. 48, pp.
(static, dynamic, and grouping) that can achieve static ordering 810-832, 1997.
while simultaneously ranking the facet. Hierarchical facet values [9] T. Saracevic, "Relevance: A review of the literature and a frame-
work for thinking on the notion in information science. Part III: Be-
can be used in grouping even for facets that initially lack order.
havior and effects of relevance," Journal of the American Society
For example, a tree that displays the location of facet values can for Information Science and Technology, vol. 58, pp. 2126-2144,
be formed. The designer can create and enforce any number of 2007.
hierarchical values that are deemed useful. [10] A. Singhal, "Modern information retrieval: A brief overview,"
IEEE Data Eng. Bull., vol. 24, pp. 35-43, 2001.
6.2. Exploration of Various Faceted Search Approaches [11] D. Wolfram, A. Spink, B. J. Jansen, and T. Saracevic, "Vox populi:
The public searching of the web," JASIST, vol. 52, pp. 1073-1074,
FS allows users to explore or navigate within the document collec- 2001.
[12] "PubMed," 2017.
tion. However, most mainstream search systems only feature a [13] G. Salton, A. Wong, and C.-S. Yang, "A vector space model for
fixed mode of interaction. For example, search results are most automatic indexing," Communications of the ACM, vol. 18, pp.
often depicted as a list of text with minimal interactions, such as 613-620, 1975.
sorting or paging. To obtain new understanding of data, allowing [14] K. Sparck Jones, "A statistical interpretation of term specificity and
for multiple interaction modes is necessary. According to White its application in retrieval," Journal of documentation, vol. 28, pp.
and Roth [24], Exploratory Search Engine should increase user 11-21, 1972.
responsibility and control. This feature should include letting the [15] J. M. Kleinberg, "Authoritative sources in a hyperlinked environ-
user select how the data is visualized depending on the task of ment," Journal of the ACM (JACM), vol. 46, pp. 604-632, 1999.
[16] S. Brin and L. Page, "The anatomy of a large-scale hypertextual
interest. web search engine," Computer networks, vol. 56, pp. 3825-3833,
1998.
7. Summary [17] J. Park and S.-H. Yook, "Bayesian Inference of Natural Rankings in
Incomplete Competition Networks," Scientific Reports, vol. 4, p.
6212, 08/28/online 2014.
In this paper, we discusses exploratory search and then focus on [18] K. Järvelin and J. Kekäläinen, "Cumulated gain-based evaluation of
faceted search and beyond the traditional Faceted Search interface. IR techniques," ACM Transactions on Information Systems (TOIS),
First, we review the rank retrieval and their Exploratory Search vol. 20, pp. 422-446, 2002.
Engine. To promote exploration, the interface should provide in- [19] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, "Ensem-
stant feedback on the user‟s potential actions. Also provided a ble selection from libraries of models," in Proceedings of the twen-
design of search user interfaces because the user interface of a SE ty-first international conference on Machine learning, 2004, p. 18.
[20] M. J. Bates, "The design of browsing and berrypicking techniques
forms the first and last impressions made on a user and it is a criti-
for the online search interface," Online review, vol. 13, pp. 407-424,
cal focal point for all users experience at every stage of the search. 1989.
It is through the interface that the queries are formed and convert- [21] G. Marchionini, "Exploratory search: from finding to understand-
ed into informative answers. The recommendations made in this ing," Communications of the ACM, vol. 49, pp. 41-46, 2006.
paper can be a guide for creating an interface that fosters im- [22] E. Goodman, M. Kuniavsky, and A. Moed, "Observing the user ex-
provements to all aspects and stages of the user search. Better perience," Burlington, Massachusetts: Morgan Kaufmann, 2012.
interface designs assist users in articulating better queries, help [23] J. Nielsen, "Usability 101: Introduction to usability," ed, 2003.
them understand the results and facilitate query modifications if [24] R. W. White and R. A. Roth, "Exploratory search: beyond the que-
ry-response paradigm (Synthesis lectures on information concepts,
necessary. FS combines faceted navigation with full text search to
retrieval & services)," Morgan and Claypool Publishers, vol. 3,
help users to work with contents that are semi-structured whilst 2009.
full text search is for non-structured contents. [25] J. Nielsen, "Guerrilla HCI: Using discount usability engineering to
penetrate the intimidation barrier," Cost-justifying usability, pp.
245-272, 1994.
Acknowledgment [26] S. Ben and P. Catherine, "Designing the user inter-
face,")^(Eds.):„Book Designing the user interface‟(Reading, Mass.:
This research was sponsored and supported under the Universiti Addison Wesley Longman, 1998, edn.), 2005.
Tenaga Nasional (UNITEN) internal grant no J510050783 (2018). [27] K. Franzen and J. Karlgren, "Verbosity and interface design," SICS
Many thanks to the Innovation & Research Management Center Research Report, 2000.
(iRMC), UNITEN who provided their assistance and expertise [28] M. Hassenzahl, "The interplay of beauty, goodness, and usability in
interactive products," Human-computer interaction, vol. 19, pp.
during the research.
319-349, 2004.
[29] T. Ben-Bassat, J. Meyer, and N. Tractinsky, "Economic and subjec-
References tive measures of the perceived value of aesthetics and usability,"
ACM Transactions on Computer-Human Interaction (TOCHI), vol.
13, pp. 210-234, 2006.
[1] J. Curran, N. Fenton, and D. Freedman, Misunderstanding the in- [30] A. Aizpurua, M. Arrue, and M. Vigo, "Prejudices, memories, ex-
ternet: Routledge, 2016. pectations and confidence influence experienced accessibility on
[2] A. Selcuk, C. Örencik, and E. Savas, "Private search over big data the Web," Computers in Human Behavior, vol. 51, pp. 152-160,
leveraging distributed file system and parallel processing," 2015. 2015.
[3] B. Kules and B. Shneiderman, "Users can change their web search [31] H. C. L. Hsieh and N. C. Cheng, "A Theoretical Model for the De-
tactics: Design guidelines for categorized overviews," Information sign of Aesthetic Interaction," in International Conference on Hu-
Processing & Management, vol. 44, pp. 463-484, 2008. man-Computer Interaction, 2016, pp. 178-187.
[4] D. Bakrola and S. Gandhi, "Enhancing Web Search Results Using [32] G. Hotchkiss, T. Sherman, R. Tobin, C. Bates, and K. Brown,
Aggregated Search," in Proceedings of International Conference on "Search engine results: 2010," Enquiro Search Solutions, pp. 1-61,
ICT for Sustainable Development, 2016, pp. 675-688. 2010.
[5] N. Ibrahim, A. H. Chaibi, and H. B. Ghézala, "Scientometric re-
ranking approach to improve search results," Procedia Computer
Science, vol. 112, pp. 447-456, 2017.
[6] A. N. Langville and C. D. Meyer, Google's PageRank and beyond:
The science of search engine rankings: Princeton University Press,
2011.
[7] C. D. Manning, P. Raghavan, and H. Schütze, "Introduction to in-
formation retrieval," ed: Cambridge University Press, 2008.