0% found this document useful (0 votes)

6 views29 pages

US9262767

Uploaded by

ibraheemakmal4291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views29 pages

US9262767

Uploaded by

ibraheemakmal4291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

US0092.

62767B2

(12) United States Patent (10) Patent No.: US 9,262,767 B2

Sercinoglu et al. (45) Date of Patent: *Feb. 16, 2016
(54) SYSTEMS AND METHODS FOR 17/30592; G06F 17/30536; G06F 17/30595;
GENERATING STATISTICS FROM SEARCH G06F 17/30463; Y1OS 707/99.935; Y1OS
ENGINE QUERY LOGS 707/99934; Y1OS 707/99.945; Y1OS
r 707/99944; Y1OS 707/99943; Y1OS
(75) Inventors: Olcan Sercinoglu, Mountain View, CA 707/99942 HO4J 1/10
(US); Artem Boytsov, San Francisco, s
CA (US); Jeffrey A. Dean, Palo Alto, USPC
See ..................................
application 707/721, 769
file for complete search history.
CA (US)
(73) Assignee: GOOGLE INC., Mountain View, CA (56) References Cited
(US)
U.S. PATENT DOCUMENTS
(*) Notice: Subject to any disclaimer, the term of this
patent is extended or adjusted under 35 3. R A : 3. 3. ity ...
wal - aler Cal. . . . . . . . . . . . . . . . . . . . . . . . . . . .
U.S.C. 154(b) by 276 days. 5,696,964 A 12/1997 Cox et al.
This patent is Subject to a terminal dis- 5,983.216 A * 11/1999 Kirsch et al. .......................... 1.1
claimer 6,067,541 A 5/2000 Raju et al.
6,108,648 A 8/2000 Lakshmi et al.
(21) Appl. No.: 13/396,511 (Continued)
(22) Filed: Feb. 14, 2012 OTHER PUBLICATIONS
O O Hilbert et al. “Extracting Usability Information from User Interface
(65) Prior Publication Data Events’. ACM 2000.
US 2012/O215765A1 Aug. 23, 2012 (Continued)
Related U.S. Application Data
(63) Continuation of application No. 1 1/746,049, filed on Primary Examiner — Daniel Kuddus
May 8, 2007, now Pat. No. 8,126,874.
(60) Provisional application No. 60/746,886, filed on May (57) ABSTRACT
9, 2006. A computer-implemented method includes calculating first
(51) Int. Cl. statistics about a user-identified event within a first subset of
G06F 7700 (2006.01) a database of events; selecting a second Subset of the database
G06F 7/30 (2006.01) of events based on said first statistics; calculating second
G06O 30/02 (2012.01) statistics about the user-identified event within the second
(52) U.S. Cl. subset of the database of events; merging the first and second
CPC .......... G06Q30/02 (2013.01); G06F 17/30395 statistics as statistics of the user-identified event within the
(2013.01); G06F 17/30693 (2013.01); G06F entire database of events; and generating a result including at
• u. fs I 7/30864 (2013 01) least a portion of the merged statistics of the user-identified
(58) Field of Classification Search event.
CPC .................... G06F 17/30864; G06F 17/30693;
G06F 17/30395; G06F 17/30554; G06F 24 Claims, 15 Drawing Sheets
were
Fa
Rscelve a statistics query from a requestors.g. a Boolean
expression)
704
Generate a listof sub-queries (e.g., duery + Timeframe,
cuary Location, Cuery language
78
kientify a respective subset of databass partitions and their Step
Tsspording safquery servers for each of the sub-querss re
consideringkadbalancing)
Fs
Execute one of the sub-queries at the subset of leafquery
serwers against their associated database partitions
f
Yes tests guary result met
predefined requirements
No 712
dentify a second subsistofadditional databasepartitions and
their corresponding leafquery serves
Execute the sub-cusry at the second susset of leafquery
serwers against their associated database artitions Step
Two

Normalize the aggregated query results

return the normalized query results to

the requestof
US 9,262.767 B2
Page 2

(56) References Cited 2004/0230461 A1* 11/2004 Talib et al. ........................ 705/7
2005/0010564 A1* 1/2005 Metzger et al. . 707/3
U.S. PATENT DOCUMENTS 2006/0074883 A1* 4/2006 Teevan et al. ... 707 3
2006/0253491 A1* 11/2006 Gokturk et al. ............ TO7 104.1
6,115,393 A * 9/2000 Engel et al. ................... 370/469 2007/0038889 A1* 2/2007 Wiggins et al. ................. T14/20
6,144.967 A 11/2000 Nock OTHER PUBLICATIONS
6,278,993 B1 8/2001 Kumar et al.
6,292,830 B1 9/2001 Taylor et al. Dinur et al., “Revealing Information while Preserving Privacy”.
6,477.523 B1 1 1/2002 Chiang
6,687,696 B2 * 2/2004 Hofmann et al. ..................... 11 ACM, 2003.* --
6,766.320 B1* 7/2004 Wang et al. ... 1/1 Srivastava et al. “Web Usage Mining: Discovery and Applications of
6,898,597 B1 * 5/2005 Cook et al. ... ... 1/1 Usage Patterns from Web Data”. SIGKDD, 2000.*
6,947,927 B2* 9/2005 Chaudhuri et al. ... 1/1 Wang et al., Instance-based Schema Matching for Web Databases by
7.240,049 B2* 7/2007 Kapur ................................... 1f1 Domain-specific Query Probing, VLDB conference, 2004.*
7,308,643 B1* 12/2007 Zhu et al. ....... 715,206 Gang et al., “Toward a Progress Indicator for Database Queries';
7,363,289 B2 * 4/2008 Chaudhuri et al. ................... 1/1 ACM 2004.*
7,383,262 B2 * 6/2008 Das et al. .............................. 1f1 Muthuswamy et al., “A Detailed Statistical Model for Relational
7,426,507 B1 * 9/2008 Patterson . ... 1/1 Query Optimization': ACM 1985.*
7,567,959 B2* 7/2009 Patterson ... 1/1
7,580,921 B2* 8/2009 Patterson ... 1f1 WangryetUpal., “Minining
s
Longitudinal Web Queries: Trends and Pat
7.584, 175 B2* 9/2009 Patterson ... 1f1 terns'; Journal of American Society for Information Science and
7,603,345 B2 * 10/2009 Patterson 1/1 Technology; 2003.*
7,617.201 B1 * 1 1/2009 Bedellet al. r 1f1 Srivastava et al. “Web Usage Mining: Discovery and Applications of
7,668,823 B2 * 2/2010 Oldham et al. ............... 707/723 Usage Patterns from Web Data”: Publisher: ACM SIGKDD Explo
2001.0049674 A1 12, 2001 Talib et al. ration Newsletter, Jan. 2000.
2002/0152305 A1 * 10, 2002 Jackson et al. ................ TO9,224 Sercinoglu, et al., International Search Report and Written Opinion
2003/0182276 A1* 9, 2003 Bossman et al. .................. 707/3 for International Application No. PCT/US07/68602, mailed Jul. 18.
2003/0217033 A1* 11, 2003 Sandler et al. . 707/1 2008, 10 pages.
2004, OO64831 A1* 4, 2004 Abbott et al. 725/1
2004/0215607 A1 * 10/2004 Travis, Jr. ......................... 707/3 * cited by examiner
U.S. Patent Feb. 16, 2016 Sheet 1 of 15 US 9.262,767 B2

Query Log
Sampling Process
(Optional)

Sub-Sampled
Query Log
(Optional)

Query Session
Extraction PrOCeSS

Query Session
Partition PrOCeSS
y 120

Partitioned Query Session Records

/ Partition O
Session
Records
122
/
Partition 1
Session
Records

y
Partition Sorting?
Partition L-1

124
Session
Records
/
Process
y 1.
126

SOrted Partitions of Session Records

/ SOrted
Partition O
N
N128
SOrted
Partition 1
SOrted
Partition L-1

FIGURE 1
U.S. Patent Feb. 16, 2016 Sheet 2 of 15 US 9.262,767 B2

210
1.
Query Log
Query 1, ts1, Padr1, lang1, Cookie1/userid1
Query2, ts2, IP adr2, lang2, Cookie2/userid2
QueryN, tsN, PadrN, langN, cookieN/useridN

FIGURE 2A

Session Record
20
Cookie/userid, IP adr, lang
QueryA, tSA
QueryB, tsB
aueryZ, tSZ

FIGURE 2B
230
1.
Session Record

Temporal Values (e.g., Day, Week,

Month, and Year)
Geographical Values (e.g., City, Region,
and Country)
Language Value (e.g., English)
<q>Query String AC/oC
<q>Query String B-/q>

<q>Query String Z-C/q>

FIGURE 2C
U.S. Patent Feb. 16, 2016 Sheet 3 of 15 US 9.262,767 B2

Information Service 300

312 314
f f
Statistics Query Query String
Processing
System Query LOg Tokenspace
340 Query Results Server(s) Inverse Index

302 310
?
/ / EnCOced / /
/ Sorted
/ Session
Encoding?
Decoding
Tokens/ Tokenspace
Repositorv /
/
/ Records / System / O y
Query Session JA A 4 /
Processing
System
320
Lexicon
Builder
W

Mappings Tokens
308

N. on

FIGURE 3
U.S. Patent Feb. 16, 2016 Sheet 4 of 15 US 9.262,767 B2

410
Tokenized Session Records for
Session D Partition M

O TokenS for Session.0

1 TokenS for Session 1

Sessions for USer0

TokenS for SessionA

A+1 TokenS for SessionA+1

Sessions for USer1

Tokens for SessionB

TokenS for Session Y-1

Sessions for USerL

TokenS for SessionZ

FIGURE 4A

)
420
Tokenized Session ReCOrd
Tokenized Temporal Values
Tokenized Geographical Values
Tokenized Language Value
Tokenized Query String A
Tokenized Query String B

Tokenized Query String Z

FIGURE 4B
U.S. Patent Feb. 16, 2016 Sheet 5 Of 15 US 9.262,767 B2

Query Term
(e.g., "Paris" or
"City=Paris") 506
y - 502
TokenSpace Inverse Index
LexicOn ? 504 Token Do: Repository Position List List of
POSitions
Token D Token ID: Repository Position List
to Index
Token ID Fo
ap TokenIDz: Repository Position List

FIGURE 5A

508
J
Session ID to Token Range
List Of Map (TWO-Way Mapping) List Of
POSitions Session0 Rep. Position Sessions
Session1 Rep. Position

Session N Rep. Position

FIGURE 5B

510
J
UserID to Session ID Map
(TWO-Way Mapping)
s eSSOS
of USerO SessionA t SeSOf
USer1 SessionB
USerN SessionZ

FIGURE 5C
U.S. Patent Feb. 16, 2016 Sheet 6 of 15 US 9.262,767 B2

Statistics Query/
620 Statistics Results
1. 602
NeWS Backend Frontend Server ?
Query Server C-D (generate image,
HTML, data, etc.)
Query Log
-4- ServerS 312
604
Query Distributor and Results ?
Aggregator (Root Query Server)

Query Distributor and I-606 Query Distributor and (1 606

ResultS Aggregator / Results Aggregator
(Intermediate Query (Intermediate Query
Server) Server)

608 608 -
N Leaf Query Server. Leaf Query Server.
for Partition O for Partition L-1
610 inverted index '610 || Inverted Index
for Partition O N for Partition L-1

FIGURE 6
U.S. Patent Feb. 16, 2016 Sheet 7 Of 15 US 9.262,767 B2

Overview
702
Receive a statistics query from a requestor (e.g., a Boolean ?
expression)
W 704
Generate a list of Sub-queries (e.g., Query + Timeframe,
Query + Location, Query + Language)
y 706
ldentify a respective subset of database partitions (and their Step
corresponding leaf query servers) for each of the sub-queries One
(considering load balancing)

-
Execute one of the sub-queries at the subset of leaf query
- 708
servers against their associated database partitions
710
Has the query result met the
predefined requirements?
NO
712
ldentify a second subset of additional database partitions (and
their corresponding leaf query servers)
y 714.

Execute the sub-query at the Second subset of leaf query

servers against their associated database partitions
Step
716 TWO
e aggregated query resu NO 718
predefined requirements? 1.
Report insufficient
Data

720
All Sub-queries executed?
Yes
722
Normalize the aggregated query results -
r

y 724
Return the normalized query results to L?
the requestor

FIGURE 7
U.S. Patent Feb. 16, 2016 Sheet 8 of 15 US 9.262,767 B2

ExCuting a Sub-query at a leaf Query Server

802
Receive a Sub-query from a query J
distributor
y 804
ldentify a list of repository positions ?
matching the Sub-query
y 806
Select the first repository position in the ?
list as the current position

Retrieve from the database partition the ? 808

session record containing the current
repository position
y 810
Aggregate the session record by its f
parameter values respectively
814
811 1.
redetermined Select the next
number met? position in the list
Corresponding to a
different user

one in the list of reposito

OOsitions?
YeS

Determine the total number of users that - 816

have been SCanned in the database J
partition

! 818
Return the normalized query results to ?
the query distributor

FIGURE 8
U.S. Patent Feb. 16, 2016 Sheet 9 Of 15 US 9.262,767 B2

TWO-paSS Query proCeSS

902
Execute a query at a subset of leaf query servers
against their associated database partitions
904 PaSS
Select from the query result a subset of parameter One
values meeting predefined criteria
u906
Generate a set of Subsequent queries based on the f
Subset of parameter values
y 908
ldentify a respective second Subset of database ?
partitions (and their Corresponding leaf query
servers) for each of the Subsequent queries
y
Execute each of the Subsequent dueries at the 910 PaSS
second Subset of leaf query servers against their TWO
asSociated database partitions
912
Return the query results

FIGURE 9
U.S. Patent US 9.262,767 B2

§§§
U.S. Patent 262,767 B2

pod?
U.S. Patent 262,767 B2

§ § §§§§§
U.S. Patent Feb. 16, 2016 Sheet 13 of 15 US 9.262,767 B2

Root or Intermediate Query Server

1100 Memory
1110
- Operating System
N -1
1114
1102 Y Communications Module -3.
Sub-Ouery Distributor -1
CPU U 120
Sub-Query Result Aggregator
1112 N 1122
Sub-Query Generator -1
1124
Sub-Query Executor N-1
1126
Load Balance Analyzer N -1

1104
& Communication
interface(s)
FIGURE 11A
Leaf Query Server
1150 Memory
1160
- Operating System
N -1
1164
1152 Communications Module U- 166
N -1
1168
Sub-Query Executor
CPU 1170
Database Partition N-1
1172
1162 N. Token Repository 1174
Inverse Index N-1
1176
Session ID Range Map 1178
User ID Range Map -/

1154
& Communication
interface(s)

FIGURE 11B
U.S. Patent Feb. 16, 2016 Sheet 14 of 15 US 9.262,767 B2

Partition Server Memory

1200 1210
1214
\ Operating System -
a 1216
Communications MOdule 1218
1202 - OuerV LOC Sampler -
D y LOg O 1220
CPU Query Session Extractor -1
1222
1212 - Query Session Partitioner -
1224
Query Session Sorter -
Database Partitions 226
1228
Token Repository -
1230
Inverse Index -
1232
Session ID Range Map u
1204 1234
Communication User ID Range Map u? 3
interface(s)

FIGURE 12
U.S. Patent Feb. 16, 2016 Sheet 15 Of 15

(1.
US 9,262,767 B2
1. 2
SYSTEMS AND METHODS FOR Small groups of users (e.g., fewer than a predefined number of
GENERATING STATISTICS FROM SEARCH distinct users, such as twenty, one hundred or two hundred
ENGINE QUERY LOGS distinct users). It is therefore important that any log record
data mining tool include safeguards for preventing the disclo
RELATED APPLICATION sure of information that may be traced back to individuals or
Small groups of users.
This application is a continuation of U.S. application Ser.
No. 1 1/746,049, filed May 8, 2007 now U.S. Pat. No. 8,126, SUMMARY
874, which claims priority to U.S. Provisional Application
Ser. No. 60/746,886, filed May 9, 2006, which is incorporated 10 In a first aspect of the present invention, a computer-imple
herein by reference in its entirety. mented method comprises calculating first statistics about a
user-identified event within a first subset of a database of
FIELD OF THE INVENTION events; selecting a second Subset of the database of events
based on said first statistics; calculating second statistics
The present invention relates generally to search engine 15 about the user-identified event within the second subset of the
and in particular to systems and methods for generating sta database of events; merging the first and second statistics as
tistics from search engine query logs. statistics of the user-identified event within the entire data
base of events; and generating a result including at least a
BACKGROUND OF THE INVENTION portion of the merged statistics of the user-identified event.
In a second aspect of the present invention, a computer
Data aggregation is a process in which information is gath implemented method comprises identifying in an inverse
ered and expressed in a Summary form for purposes Such as index file a sequence of positions associated with an indexed
statistical data analysis. It often reveals useful information item, wherein the sequence of positions corresponds to a set
hidden in a large Volume of original data records. For of events from a sequence of randomly-arranged events, each
example, from a database containing millions of sales records 25 event including at least one occurrence of the indexed item
generated by an on-line store, a marketing analyst can learn and having a respective sequence number; selecting in the
information about a particular group of consumers such as sequence of randomly-arranged events an event correspond
trends and patterns in their shopping habits by aggregating the ing to a predefined position in the sequence of positions and
related sales records based on specific variables Such as prod its sequence number, determining a number of positions in
uct type information, product pricing information, customer 30 the sequence of positions that precede the predefined posi
age, customer gender, geographic location (e.g., Store loca tion; and determining an occurrence frequency for the
tion or purchaser's address) and any other customer and/or indexed item in the sequence of randomly-arranged events
product information available in the database. using the selected events sequence number and the deter
As another example, a web search engine may receive mined number of positions.
millions of queries per day from users around the world. For 35
each query, the search engine generates a query record in its BRIEF DESCRIPTION OF THE DRAWINGS
query log. The query record may include one or more query
terms, a timestamp indicating when the query is received by The aforementioned features and advantages of the inven
the search engine, an IP address identifying a unique device tion as well as additional features and advantages thereof will
(e.g., a PC or a cell phone) from which the query terms are 40 be more clearly understood hereinafter as a result of a detailed
submitted, and an identifier associated with a user who sub description of preferred embodiments of the invention when
mits the query terms (e.g., a user identifier in a web browser taken in conjunction with the drawings.
cookie; in some cases the user identifier may also be associ FIG. 1 is a flowchart illustrating a process of converting a
ated with a toolbar or other application or service to which the query log into multiple sorted partitions of session records in
user has subscribed). Appropriate aggregation of these query 45 accordance with some embodiments of the present invention.
records can also unveil interesting or useful information FIG. 2A is a block diagram of a data structure for storing a
about the web search engine users. For instance, a publisher query log in accordance with some embodiments of the
can gauge the popularity of a newly released bookina specific present invention.
city from the frequencies of relevant queries submitted by FIG. 2B is a block diagram of a data structure for storing a
users from that city within a given time period. 50 query session record in accordance with some embodiments
For the same query log, social Scientists, marketers, and of the present invention.
politicians may have dramatically different interests and FIG. 2C is a block diagram of an alternative data structure
therefore require different types of data aggregations to meet for storing the query session record in accordance with some
their needs. Some types of "data mining of a search engine's embodiments of the present invention.
log records may be useful only if the statistical inquiries 55 FIG. 3 is a block diagram of an exemplary information
receive Substantially instantaneous responses (e.g., in less service that generates statistics in response to a query in
than five seconds). But most of the conventional data aggre accordance with some embodiments of the present invention.
gation techniques are incapable of deriving reliable statistical FIG. 4A is a block diagram of a data structure for storing
information from a large number of query records Substan tokenized session records within a database partition inaccor
tially instantaneously. 60 dance with some embodiments of the present invention.
Another concern with data mining search engine query FIG. 4B is a block diagram of a data structure for storing a
logs or commercial transaction logs is the protection of user tokenized session record in accordance with Some embodi
privacy. Even if the log records do not contain user names or ments of the present invention.
the like, returning statistical information or trends informa FIG. 5A is a flow diagram of a process of identifying a list
tion based on very Small numbers of users or transactions 65 of repository positions in response to one or more query terms
(e.g., less than twenty transactions) may inadvertently dis in accordance with some embodiments of the present inven
close information that can be traced back to an individuals or tion.
US 9,262,767 B2
3 4
FIG. 5B is a block diagram of a data structure for translat (also herein called log records or transaction records) in the
ing or mapping repository positions to session record identi query log. As long as the statistical information is derived
fiers and vice versa in accordance with some embodiments of from a Sufficient number of samples in the query log, the
the present invention. information is as reliable as information derived from all the
FIG.5C is a block diagram of a data structure for translat log records. Moreover, it takes less time and computer
ing or mapping session record identifiers to user identifiers resources to Survey a Sub-sampled query log. Therefore, a
and vice versa in accordance with some embodiments of the query log sampling process 110 can be employed to Sub
present invention. sample the query log 108 and produce a Sub-sampled query
FIG. 6 is a block diagram of an exemplary hierarchical log 112. For example, the Sub-sampled query log 112 may
structure of multiple query servers in accordance with some 10 contain ten percent or twenty percent of the log records in the
embodiments of the present invention. original query log 108. Note that the sampling process is
FIG. 7 is a flowchart illustrating a process of generating optional. In some embodiments, the entire query log 108 is
statistics from session records in response to a statistics query used to generate statistical information.
in accordance with some embodiments of the present inven The query log sampling process 110 may utilize any of a
tion. 15 number of different random (or pseudo-random) sampling
FIG. 8 is a flowchart illustrating a process of executing a schemes to produce a Sub-Sampled, but diversified and robust,
statistics Sub-query at a leaf query server in accordance with query log112. In some embodiments, the query log sampling
Some embodiments of the present invention. process 110 performs a uniform Sub-Sampling of the query
FIG. 9 is a flowchart illustrating a two-pass or multi-pass log, e.g., selecting every fifth query record as the Sub-sampled
process of generating statistics related to a specific parameter query log. In some other embodiments, the query log sam
of the session records in accordance with some embodiments pling process 110 first separates all the query records by their
of the present invention. associated geographical regions (which are based on the IP
FIGS. 10A, 10B and 10C are exemplary screenshots of addresses associated with the records). For each geographical
webpages containing trends statistics and geographical dis region, the query log sampling process 110 randomly selects
tribution statistics for queries concerning a particular query 25 a certain percentage of query records as the Sub-sampled
term in accordance with some embodiments of the present query log. This sampling scheme ensures that query records
invention. from different geographical regions are proportionally
FIG. 11A is a block diagram of an exemplary query distri included in the Sub-Sampled query log, based on their ratios in
bution and aggregation server in accordance with some the original query log. These sub-sampling schemes can be
embodiments of the present invention. 30 based other predefined criteria. There are also many other
FIG. 11B is a block diagram of an exemplary leaf query sampling schemes known to one skilled in the art that may be
server in accordance with some embodiments of the present used here to produce the sub-sampled query log112.
invention. In some embodiments, the query log sampling process 110
FIG. 12 is a block diagram of an exemplary partition server implements a sampling strategy to achieve a Sub-sampled
in accordance with some embodiments of the present inven 35 query log 112 that satisfies one or more predefined diversity
tion. requirements. Exemplary requirements include limiting the
FIG. 13 is a diagram of a bell-shaped curve representing a number of query records in the query log 112 having a par
normal distribution of random data samples. ticular IP address for a given time period. These requirements
Like reference numerals refer to corresponding parts can effectively increase the Sub-sampled query log 112's
throughout the several views of the drawings. 40 diversity and prevent the Sub-sampled query log 112 from
being corrupted by bogus query data associated with mali
DESCRIPTION OF EMBODIMENTS cious operations such as query spam.
Very often, a user may submit multiple related queries to
In order to generate statistics from a large database effi the search engine 106 within a short timeframe in order to find
ciently, Some data pre-processing operations may be neces 45 information of interest. For example, the user may first submit
sary. FIG. 1 is a flowchart illustrating Such a data pre-process a query “French restaurant, Palo Alto, Calif”, looking for
ing procedure for converting a web search engine's query log information about French restaurants in Palo Alto, Calif. Sub
into multiple sorted partitions of session records. Many users sequently, the same user may submit a new query “Italian
submit queries from clients 102 to a search engine 106 restaurant, Palo Alto, Calif”, looking for information about
through a communication network 106 (e.g., the Internet). 50 Italian restaurants in Palo Alto, Calif. These two queries are
For each query, the search engine 106 generates a query logically related since they both concern a search for restau
record in a query log file 108. As shown in FIG. 2A, a query rants in Palo Alto, Calif. This relationship may be demon
log 210 contains many query records. Each query record has strated by the fact that the two queries are submitted closely in
multiple attributes, including one or more query terms of a time or the two queries share some query terms (e.g., “res
search query, a timestamp, an IP address, and a web cookie or 55 taurant” and “Palo Alto').
other user identifier. In some embodiments, each query record In some embodiments, these related queries are grouped
includes a language identifier, identifying the language asso together into a query session to characterize a user's search
ciated with the search query. The timestamp indicates when activities more accurately. A query session is comprised of a
the search query is received by the search engine, and the IP one or more queries from a single user, including either all
address maps to a unique device (e.g., a personal computer, 60 queries Submitted over a short period of time (e.g., ten min
cell phone, or other client device) from which the search utes), or a sequence of queries having overlapping or shared
query terms are Submitted. In some embodiments, the web query terms that may extend over a somewhat longer period
cookie uniquely identifies a user of the web search engine and of time (e.g., queries Submitted by a single user over a period
therefore can be used as a proxy for the user's identifier. of up to two hours). Queries that concerning different topics
Alternately, each query log record contains a user identifier. 65 or interests are assigned to different sessions, unless the que
To get reliable statistical information from the query log ries are submitted in very close Succession and are not other
108, it is not always necessary to survey all the query records wise assigned to a session that includes other similar queries.
US 9,262,767 B2
5 6
The same user looking for Palo Alto restaurants may submit rants'), or (B) for all sessions that contain at least one query
a query "iPod Video' later for information about the new that contains a user-specified Boolean combination of query
product made by Apple Computer. This new query is related terms. The first exemplary request ignores the query bound
to a different interest or topic that Palo Alto restaurants, and is ary markers in the session records, while the second request
therefore not grouped into the same session as the restaurant only matches sessions that have query terms, not separated by
related queries. Therefore the queries from a single user may any query boundary markers, which satisfy the user-specified
be associated with multiple sessions. Two sessions associated Boolean combination of query terms.
with the same user will share the same cookie, but will have In most cases, the number of query session records 116 is
different session identifiers. too large to be processed by a single computer server effi
As shown in FIG. 1, a query session extraction process 114 10 ciently. Accordingly, a query session partition process 118
is invoked to classify the individual query records into differ (FIG. 1) is employed to divide the query session records into
ent query session records 116. A query session record many database partitions 120, each database partition corre
includes queries closely spaced in time and/or queries that are sponding to a portion of the session records. In some embodi
related to the same user interest. In some embodiments, the ments, the query session records 116 are divided or distrib
query session extraction process is based on heuristics. For 15 uted into L partitions 122. Each partition of session records is
example, consecutive queries belong to the same session if assigned to a respective computer server. Each partition 12
they share some query terms or if they are submitted within a includes a non-overlapping Subset of the query session
predefined time period (e.g., ten minutes) even though there is records 116. In some embodiments, the session records con
no common query term among them. FIG. 2B depicts one stituting the partition are randomly chosen from the query
data structure 220 for storing a session record. Since the query session records 116 by the query session partition process
records with the session record belong to the same user, they 118. Alternately, the users identified in the query log entries
must have the same cookie, which is used here as a proxy for are randomly assigned to the L partitions, and then all the
the user's identifier, as well as the same IP address. Each query session records of the users are distributed to the par
query record includes one or more query terms and a times titions to which the corresponding users have been assigned.
tamp. In some embodiments, a query record 220 also includes 25 In some embodiments, the data aggregation occurs at the
a language identifier. query session record level. For example, in order to compute
FIG. 2C illustrates an alternative data structure 230 for the percentage of query sessions containing the query term
storing the session record. In this embodiment, the times "iPod among all query sessions associated with the city of
tamps associated with the query records in the session record London, a computer server only needs to check the inverse
are translated into certain temporal values (e.g., day, week, 30 index to determine a first number of session records contain
month, and year) and the IP addresses are converted into ing both the term “iPod” and the term “CITY=London” and a
certain geographical values (e.g., city, region, and country). second number of session records containing the term
The query record 230 may also include a language value (e.g., "CITY-London’’. Dividing the first total number by the sec
English, French, Spanish, etc.). In some embodiments, the ond total number gives the percentage of the query sessions
temporal values are represented in the form of absolute values 35 from London that contain the term "iPod. It may be noted
that represent the interval from a predefined moment in the that this percentage may be somewhat different from the
history, e.g., Jan. 1, 2000, to the time represented by the query percentage of users from London who have Submitted at least
records timestamp. For example, the month value “75’ cor one query having the term "iPod.
responds to March, 2006. This absolute expression is conve The reason is because one London user may be associated
nient for data aggregation operations, and facilitates the 40 with multiple query sessions including the term "iPod'. As
aggregation of information from Session records for particu noted above, the query session extraction process 114 is a
lar days, weeks, months or years. In some embodiments, the heuristic procedure. It does not guarantee that all the queries
temporal values for each session record are determined in corresponding to one user interest always fall into the same
accordance with the timestamp for the first query in the ses session. For example, the user may submit one iPod-related
Sion. In other embodiments the timestamp of the last query in 45 query in Day One followed by some other queries unrelated to
the session oran average of the timestamps in the session are iPods, and then another iPod-related query in Day Two. In this
used to determine the temporal values. case, the two iPod-related queries will be assigned to two
In some embodiments, a temporal or geographical value different query sessions, and the aforementioned approach of
also includes a corresponding parameter name, and thus is calculating the query session-based percentage would count
represented as a name-value pair. For example, the city value 50 both query sessions.
of a session record is expressed as "CITY-London' and the In order to compute a user-based percentage, as opposed to
year value is expressed as “YEAR=2005”. This type of a query session-based percentage, a partition Sorting process
expression can be easily distinguishable from regular query 124 is required. The partition Sorting process 124 sorts the
terms like “London” or “2005”. As will be explained below, query sessions within a database partition by users. As a
this name plus value expression corresponds to a single token 55 result, each partition 122 of session records in the database
in the lexicon of the log records database and the token is 120 becomes a sorted partition 128 in the database 126. The
different from the tokens corresponding to the query terms query sessions associated with one particular user are
“London” and “2005'. Therefore, the database of logs grouped together as one contiguous set. Different sets asso
records can be searched for instances of a single token in ciated with different users are arranged in a random order
order to identify session records having a particular temporal 60 Such that any portion of the sorted partition does not have any
parameter value or a particular geographic parameter value. biased data distribution. Similarly, the query sessions associ
Another characteristic of the session record data structure ated with each respective user are also arranged in a random
230 is that each of the queries in the session record is sepa order for similar purposes. But since one user may have
rated from the others by a query boundary marker, Such as multiple query sessions matching the query term "iPod.
<q>. In this way, a set of session records can be searched 65 having a total number of matching sessions from the inverse
either (A) for all sessions that contain a user-specified Bool index is not enough to calculate the percentage of users who
ean combination of query terms (e.g., “Paris' and “restau have submitted queries that include the term "iPod.” Rather, it
US 9,262,767 B2
7 8
is necessary to access all the query session records in a sorted able than query session-based statistical information.
database partition 128 to avoid counting more than once for Accordingly, the query session records within a database
those users having more than one matching query session partition are sorted by user to achieve better performance
record. Although this percentage estimate is more computa when aggregated. FIG. 4A depicts an exemplary data struc
tionally expensive, it is clearly a more accurate indication of ture 410 for storing the tokenized query session records asso
a query term's popularity among users of the search engine. A ciated with different users within a database partition M, each
more detailed description of the sorted session records within query session record having a unique session ID (sometimes
a partition is provided below in connection with FIG. 4A. called a Doc ID). In some embodiments, integer numbers
FIG. 3 is a block diagram of an exemplary information starting from Zero are used as the session IDs. Therefore, the
service 300. The information service 300 generates statistics 10
session ID of a particular session record also indicates the
about one or more query terms in response to a query string number of query session records preceding it in the database
containing the query terms. The information service 300 partition. The query session records are sorted by user Such
includes a query session processing system 320 and a statis
tics query processing system 340. that the query session records associated with one user have
The query session processing system 320 generally 15 continuous session IDs. For example, the first (A+1) query
includes one or more database partitions 302 of sorted session session records correspond to a first user having a unique ID
records, a lexicon builder 306, an encoding/decoding system “UserO', the next (B-A) query session records correspond to
304 and a tokenspace repository 310 containing tokenized a second user having an ID “User1, and the last (Z-Y) query
session records. The encoding/decoding system 304 retrieves session records correspond to the last user having an ID
session records from the one or more database partitions 302, “UserL'. In some embodiments, integer numbers starting
parses the session records into tokens, encodes the tokens into from Zero are also used for representing user IDs. Similarly,
a compressed format using the lexicon mappings 308 from the user ID associated with a particular session record also
the lexicon builder 306, and then stores the encoded tokens indicates the number of users corresponding to the query
into the tokenspace repository 310. The lexicon builder 306 sessions preceding the current one. As will be explained
generates the lexicon mappings 308 used for encoding a set of 25 below, this naming convention makes it easier to normalize
query session records by parsing the query session records. data aggregation results.
A “token can be any object typically found in a query FIG. 4B is a data structure 420 for storing an individual
session record, including but not limited to terms, phrases, tokenized session record in the data structure 410 of FIG. 4A.
temporal values, geographical values, language value and the This data structure has the same set of components as the data
like. After parsing, a set of query session records is repre 30 structure 230 of FIG. 2C except that every component in the
sented as a sequence of tokens. In some embodiments, every data structure 230 has been encoded (tokenized) to occupy
token has the same fixed size (e.g., 32bits). Furthermore, each less space. When a repository position identified in the
token in the sequence of tokens has a token position in the tokenspace inverse index.314 corresponds to a location within
tokenspace repository 310. The token position of a token also the data structure 420, the query log servers 312 retrieve the
represents the position of the token in the set of query session 35 tokenized session record and aggregate its parameter values
records. For example, the first token in the set of query session accordingly. For example, if the data aggregation is to calcu
records may be assigned a position of 0, the second token in late the total number of users from London and the tokenized
the set of query session records may be assigned a position of geographical values of a respective session record include the
1, and so on. It is noted that in Some implementations, a tokenized version of "CITY=London, the counter corre
completely different set of computer servers are used for 40 sponding to the total number will increase by one.
encoding query session records than the computer servers As noted above, the tokenspace inverse index.314 contains
used for decoding query session records. the repository positions for different token IDs. But the typical
The statistics query processing system 340 includes one or goal of a statistics query is to find out the percentage of users
more query log servers 312 coupled to the encoding/decoding Submitting certain queries among a specific group of users
system 304 and a tokenspace inverse index 314. The tokens 45 meeting certain criteria. Some mapping mechanisms are nec
pace inverse index 314 maps all the tokenIDs in the set of essary to bridge the gap between the domain of repository
query session records to their positions within the tokenspace positions and that of users submitting queries. FIGS. 5A-5C
repository 310 (which contains the query session records). are block diagrams illustrating an exemplary process of gen
Conceptually, the inverse index 314 contains a list of token erating statistical information in response to a statistics query.
positions for each tokenID. For efficiency, the list of token 50 FIG. 5A is a block diagram of an embodiment of the first
positions for each tokenID may be encoded so as to reduce the stage of the exemplary process with a tokenspace repository.
amount of space occupied by the inverse index 314. The process involves a lexicon 502, a tokenspace inverse
In some embodiments, the query log servers 312 parse a index 506, and a tokenID to index record map 504. Query
statistics query into multiple query terms that are transformed terms or strings are received by the lexicon502 that translates
by the query log servers 312 into a query expression (e.g., 55 the query terms into tokenIDS using a translation table or
Boolean tree expression). A lookup operation is performed so mapping built from entries of the lexicon 502. A map 504 then
as retrieve from the tokenspace inverse index 314 the token maps the token IDs to index records stored in the inverse index
positions in the tokenspace repository 310 of the query terms, 508. Each index record identified using the map 504 contains
as described below in connection with FIG.5A. The query log a list of token positions, which directly correspond to token
servers 312 use the token positions to retrieve relevant token 60 positions in the tokenspace repository 310.
ized query session records from the tokenspace repository The token positions in the tokenspace inverse index 506
310, perform data aggregation operations on the retrieved refer to the exact locations of the query terms in different
query session records, and return statistical information tokenized query session records in the tokenspace repository
based on the data aggregation operations as the query result. 310. In order to aggregate the parameter values of each query
As noted above, multiple query session records may cor 65 session record containing one of the token positions, the
respond to the same user within a database partition and query log servers 312 need to identify the starting position of
user-based statistical information is generally more prefer the query session record in the tokenspace repository 310.
US 9,262,767 B2
10
FIG. 5B is a data structure for storing a two-way look-up In order to increase the throughout of the statistical query
map 508 between the session IDs of the query session records processing system 340 of FIG. 3, the query log servers are
in the tokenspace repository and the starting positions of the arranged in a hierarchical structure (e.g., tree). FIG. 6 is a
query session records in the repository. Each entry in the block diagram of Such an exemplary hierarchical structure in
two-way look-up map 508 includes a session ID and a starting accordance with some embodiments of the present invention.
repository position for the corresponding query session The query log servers 312 are coupled to a frontend server
record. The last token in any query session record is the 602. The frontend server 602 is coupled to a news backend
position immediately prior to the starting position identified server 620. The news backend server 620 is responsible for
by the next entry in the look-up map 508. generating news-related Statistical information responsive to
Assuming that a repository position corresponding to the 10 a statistics query. For example, given a statistics query that
query term "iPod is 113, a lookup of the map 508 indicates contains only one query term, "iPod, the news backend
server 620 counts the number of occurrences of the term
that this position value is greater than the starting position of within a particular week's news coverage. For each week, the
SessionM and smaller than the starting position of Ses news backend server also selects a piece of news from a
sionM+1. Therefore, the query session record SessionM must 15 credible news source as the representative news item or story
contain the query term "iPod. One of the query log servers of that week. In some embodiments, the representative news
312 then visits the data structure 410 of FIG. 4A to retrieve the story having the “most representative headline.” And the
tokenized session record for SessionM and aggregates the “most representative headline' may be defined as the head
parameter value of the retrieve session record. line whose terms appear most frequently among all the
FIG. 5C is a data structure for storing a two-way look-up matching news items (i.e., news items matching the statistics
map 510 between the user IDs and the starting session IDs of query) during that week. A more detailed description of the
the corresponding users. Each entry in the two-way look-up news backend server can be found in the U.S. patent applica
map 510 includes a user ID and a starting session ID for the tion Ser. No. 1 1/239,684, entitled “Labeling Events in His
corresponding user. The session ID of the last session record toric News' filed Sep. 30, 2005, which is incorporated herein
of any user is the value immediately prior to the starting 25 by reference in its entirety. The frontend server 602 is respon
session ID of the next entry in the look-up map 510. For sible for initially parsing the statistics query and is also
example, the starting session ID of the user “User0” in FIG. responsible for preparing query results (e.g., generating
4A is 0, the starting session ID of the user “User1' is A+1, and images and HTML files, etc.) based on the query results
the starting session ID of the user “UserL' is Y--1. generated by the news backend server 620 and the query log
If the goal is to generate user-based statistical information, 30 servers 312, respectively.
after aggregating one session record associated with a par Within the hierarchical structure of the query log servers
ticular user, a query log server must skip the remaining query 312, there is a root query server 604. The root query server
sessions associated with the same user and move to the 604 is responsible for generating a set of Sub-queries in
matching session record associated with the next user. Sup response to a statistics query, identifying a Subset of interme
pose that the list of repository positions matching the query 35 diate query servers 606 for a selected subset of sub-queries,
term “iPod identified in the inverse index 506 includes 113, distributing the selected sub-queries to the identified interme
134, 153, and 178, and they are found in the query sessions M, diate query servers 606, and aggregating the query results
M+2, M+6, and M--8, respectively. In other words, the first returned by different intermediate query servers 606. An
instance of the query term "iPod is in the query session M, intermediate query server 606 is responsible for identifying a
the second instance is in the session M+2, the third instance is 40 respective subset of leafquery servers 608 for each sub-query,
in the query session M--6, and the forth instance is in the query distributing the Sub-query to the leafquery servers, and aggre
session M+8. Further, suppose that the starting session IDs gating the query results returned by different leaf query serv
for users A and A+1 are M-2 and M-7, respectively. The ers 606. A leafquery server 608 is responsible for aggregating
query log server first aggregates the query session M. Since data or information from its associated partition of Sorted
the ID of session M is larger than that of session M-2 and 45 query session records 610 in accordance with the Sub-query
smaller than that of session M+7, session Mbelongs to user A. assigned to the leaf query server. A more detailed description
By the same token, both sessions M+2 and M--6 belong to of the data aggregation process is provided below in connec
user A. Therefore, after aggregating session M, the query log tion with FIG. 8.
server skips the other two instances of the term "iPod in the The number of layers in the hierarchical structure may
sessions M+2 and M+6 since they are associated with the 50 depend on the Volume of query session records in the data
same user A. The next query session visited by the query log base. The example shown in FIG. 6 is only for illustrative
server is the query session M+8, which belongs to a different purposes. In some embodiments, if the Volume of the query
user having a user ID “A+1. The user ID “A+1 also indi session records is limited, the intermediate layer of query
cates the total number of users that have been scanned for the servers 606 is not necessary. In these embodiments, the root
current data aggregation result. As will be explained below in 55 query server 604 is directly responsible for distributing sub
connection with FIG. 8, this value can be used to normalize queries to different leaf query servers 608 and aggregating
the data aggregation result. their query results accordingly. In some other embodiments,
As noted above, the sheer Volume of the query session more intermediate layers of query servers 606 may be neces
records is often beyond the capacity of a single computer sary to process a large Volume of query session records. One
server. Accordingly, the query session records are broken into 60 skilled in the art will appreciate that a specific hierarchical
multiple partitions, each partition being assigned to a respec structure design is chosen to maximize the systems through
tive query log server. In some embodiments, a small number put.
of partitions, e.g., two, three or four, may be assigned to the For the purposes of explaining the operation of an exem
same query log server. For purposes of explaining the opera plary embodiment of the system shown in FIG. 6, the exem
tion of a statistics query processing system, each partition 65 plary embodiment will be assumed to have 1024 database
may be treated as a distinct entity executing on a distinct partitions, each of which contains data for at least one million
query log server. query sessions. Thus, when a Sub-query is executed against
US 9,262,767 B2
11 12
25 database partitions, about 2.5 percent of the database is data samples are far from the mean and a small standard
being sampled by the execution of that Sub-query. In addition, deviation indicates that they are clustered closely around the
while the number of leaf query servers 108 may be the same mean. For a random variable having a normal distribution, the
as the number of database partitions, in Some embodiments, more data points that are sampled, the more likely we are to
some or all of the leaf query servers 608 service two or more get a small standard deviation and therefore a more reliable
(e.g., two to four) database partitions. estimate of the mean of the random variable.
As noted above, the entire Volume of query session records In the present invention, the distribution of millions of
session records and their associated users within a database
are broken into L partitions. In some embodiments, the value partition can be approximated by a normal distribution. The
"L' is also the number of leaf query servers in the hierarchical 10 formula above can be used to provide guidance with respect to
structure. Query session records are randomly assigned to the the number of users required to produce a reliable statistical
leaf query servers. The query session records at a particular estimate. For example, to estimate the probability of a query
leaf query server are sorted by their respective users, but are including a specific term, a leafcquery server may need to scan
otherwise distributed randomly. For example, one set of ses through tens of thousands of session records until a predeter
sion records associated with a user from the United States 15 mined number of queries containing the specific term are
may be preceded by a set of session records associated with a found. In some embodiments, this predetermined number is
user from Germany and followed by another set of session one thousand, two thousand, or a number larger than two
records associated with a user from India. The session records thousand (e.g., at least 2,500 positive samples). The magni
associated with a particular user are also distributed randomly tude of the predetermined number is, at least in part, depen
according to the same or similar principle. dent upon the ratio of the Sub-sampled query log versus the
Due to the random distribution of query records among the entire query log and their absolute sizes. There are many
database partitions, and the fact that each database partition known statistical theories on how to determine the number of
contains millions of query session records, each database data samples for given reliability requirements.
partition has coverage of representative query sessions very In order to identify at least a predetermined number of
similar to that of the entire database. A statistical estimate 25 matching query session records, a two-step approach is
based on a Subset (e.g., one, or a few) of the database parti described below in connection with FIG. 7. The first opera
tions is often sufficiently close to a statistical estimate based tion is to search a selected subset of the database partitions for
on all the database partitions. In other words, there is no need matching query sessions. The second operation is to search
to invoke all the leaf query servers for every single statistical additional database partitions (by employing additional leaf
query. This can significantly boost the query processing sys 30 query servers) to conduct a new search if the first search
tems throughput. On the other hand, an estimate based on too operation fails to return at least the predetermined number of
few data samples is not a reliable approximate of an estimate matching query session records. Therefore, the first operation
based on the entire database. There must be a sufficient num is always required and the second operation is conditional.
ber of data samples to ensure the reliability of an estimate As shown in FIG. 7, there are several sub-operations within
based on a Subset of a database, and also to ensure that 35 each primary operation. Within the first operation, the root
reported Statistics cannot be traced back to individual users or query server first receives a statistics query from a requestor
even to Small groups of Such users. (702). The query may include one or more query terms. These
In statistics, the standard deviation is a common measure of query terms are combined together using one or more Bool
the distribution of data samples within a dataset and the ean operators such as AND, OR, etc. Next, the root query
normal distribution, also called Gaussian distribution, is an 40 server generates a list of Sub-queries based on the statistical
extremely important probability distribution in many fields. query (704). For example, given the statistical query having
FIG. 13 is a diagram of a bell-shaped curve 1300 representing only one term "iPod, the root query server may generate a
the probability density of a random variable X having a nor series of sub-queries for a predetermined period of time, each
mal distribution. The value L in the middle corresponds to the Sub-query for determining the percentage of users having
mean of all the data samples within a dataset of the random 45 Submitted at least one iPod-related query in a given week.
variable X and the value O represents the standard deviation Similarly, the root query server may also generate Sub-queries
of the dataset. If the random variable X takes on data values to determine the percentage of users having Submitted at least
{x1, x2, ..., xx}, below are the respective definitions of the one iPod-related query from a set of distinct locations, or
mean LL and the standard deviation O, using a set of distinct languages.
50 The root query server (or the root query server in conjunc
tion with one or more of intermediate query servers) identifies
a Subset of the database partitions, and their corresponding
leaf query servers, for each sub-query (706). In doing so, the
root query server checks the workload and free resources
55 available at different leafquery servers and then selects one or
more database partitions against which to execute the Sub
query to achieve an overall load balance. Since there may be
multiple leaf query servers executing the same Sub-query,
each leaf query server is only responsible for identifying a
As shown in FIG. 13, in the case of a normal distribution, 60 fraction of the predetermined number of matching query ses
one standard deviation O from the mean LL accounts for sions. For example, if the original predetermined number for
68.26% of the set of data values. In other words, nearly 70% the entire database is 2500 and 25 database partitions are
ofall the data samples generated by the random variableX are selected to execute a Sub-query, each corresponding leaf
between u-O, L+O. Similarly, two standard deviations 20 query server only needs to identify 100 instances of matching
from the meanu account for 95.5%, and three standard devia 65 query session records within each of its selected database
tions 3O from the mean Laccount for 99.7% of the entire set partitions. Alternately, each leaf query server may be asked to
of data values. A large standard deviation indicates that the identify up to N instances of matching query session records
US 9,262,767 B2
13 14
in each of its selected database partitions, where N is larger identifiers, the leaf query server consults the two look-up
(e.g., 10 percent to 100 percent larger) than the predefined maps (mapping repository positions to session IDs, and map
number divided by the number of database partitions being ping session IDs to userIDs) discussed above with reference
queried. Next, the Sub-queries are executed by their respec to FIGS. 5B and 5C. Moreover, the leaf query server must
tive subsets of leaf query servers (708). A more detailed access the query session records themselves in order to aggre
description of this sub-operation is provided below in con gate their parameter values (e.g., time, location and lan
nection with FIG. 8. guage).
After completing the assigned sub-query, a selected Subset The leaf query server selects the first repository position in
of leaf query servers return their query results to the root the list as the current repository position (806). If the sub
query server. The root query server then aggregates these 10 query is a “discovery Sub-query, the leaf query server iden
query results and checks if the aggregated query result has tifies the query session (using look-up map 508) that encom
met some predetermined requirements (710). There are two passes the current repository position and then retrieves the
possible outcomes for the aggregated query result. If the leaf query session record from the database partition (808). Next,
query servers have found their shares of matching query the leaf query server aggregates the query session record by
session records in their database partitions, the aggregated 15 its parameter values (810). For example, if the sub-query is to
query result is deemed as a reliable approximation of the count the unique users whose query session records fall into
query result derived from the entire database. In this event, a particular week (e.g., the sub-query is “iPOD AND
Step Two is skipped. The root query server checks if all week=103), the leaf query server simply increments a count
sub-queries have been executed (724). If not, the root query value. If the Sub-query is a discovery Sub-query (e.g., identify
server returns to operation 708 and asks the leaf query servers the cities of the users who submitted matching queries), the
execute another sub-query. In some embodiments, operation leaf query server identifies the value of the session parameter
708 launches all of the identified sub-queries prior to receiv (e.g., city) and adds a count to a corresponding counter.
ing the results for any of these Sub-queries. In these embodi Depending upon the number of database partitions selected
ments, the loop control operation 720 is not needed. for a particular Sub-query, each leaf query server is respon
If the total number of matching query session records iden 25 sible for identifying a predetermined number of matching
tified by the leaf query servers has not reached the predeter query sessions in corresponding database partition. As noted
mined number, the root query server identifies a second Sub above, this predetermined number is typically a fraction of
set of additional database partitions and their corresponding the number for the entire database. Every time the leaf query
leaf query servers (712). For example, if the average number server finds a matching query session, it needs to check if the
of matching query sessions identified in the initially selected 30 current count of matching query sessions has met the prede
25 database partitions is only 50 (for a total of approximately termined number (811). If so, the leaf query server does not
1250 matching query session records), the root query server have to proceed to a subsequent repository position in the list.
identify at least another 25 database partitions on which to Rather, the leaf query server moves to determine the total
execute the same Sub-query. In some embodiments, the root number ofusers that have been scanned (816). Otherwise, the
query server may identify more than 25 (e.g., 30) database 35 aggregation moves onto the next repository position in the list
partitions (and their leaf query servers) to make Sure that a (814).
Sufficient number of matching query sessions are found. After finishing the data aggregation operation, the leaf
Next, the newly-selected leaf query servers execute the same query server checks if the current repository position is the
Sub-query against their associated database partitions (714). last one in the list (812). If not, the leaf query server proceeds
The root query server then checks if the aggregated query 40 to a subsequent repository position in the list (814). Note that
result has met the predetermined requirements (716). If the if the leafquery server generates user-based query results, the
number of matching query sessions is still below the prede next repository position that matches the Sub-query may not
termined number (716, No) and the root query server predicts be the correct one since it may be associated with the same
that there is not sufficient data in the entire database (718), the session (e.g., one query session record may include multiple
root query server makes a mark in an entry in the query result 45 instances of the query term like "iPod) or a different session
corresponding to the Sub-query. The frontend server, upon that corresponds to the same user. Therefore, the leaf query
receiving the query result, can use other measures (e.g., data server must find the first repository position corresponding to
interpolation) to fix the missing entry, or it may omit the query a different user than the previously processed session record.
result corresponding to the Sub-query that returned an insuf In some embodiments, from the current user, the leaf query
ficient number of results. After executing all the sub-queries 50 server finds an entry including the next user in the look-up
(720, Yes), the root query server normalizes the aggregated map 510. The same entry also includes the starting session ID
query results (722) and returns them to the requestor through corresponding to the next user. Based on the starting session
the frontend server (724). ID, the leafquery server visits the look-up map 508 to identify
FIG. 8 is a flowchart illustrating how a leaf query server the starting repository position corresponding to the starting
executes a sub-query in accordance with some embodiments 55 session ID. Based on this starting repository position, the leaf
of the present invention. Upon receipt of a Sub-query from a query server then examines its list of repository positions to
query distributor (e.g., the root query server), the leaf query find the first repository position in the list of query-matching
server visits the tokenspace inverse index.314 and identifies a repository positions that is beyond that starting position.
list of repository positions for the sub-query (804). If the Alternately, operation 804 may identify a list of sessions
Sub-query is a Boolean expression including more than one 60 that match the sub-query, and operations 806 and 808 utilize
query term, the leaf query server identifies the repository the identified list of matching sessions.
position lists corresponding to different query terms and then If all the repository positions (or matching sessions) in the
generates a list of query sessions that satisfy the Boolean list have been processed (812, Yes), the leaf query server then
expression. As noted above, the list of repository positions for determines the total number of users that have been scanned
any given query term (i.e., token) corresponds to the locations 65 in the database partition (816). This sub-operation also
of the query term(s) in the matching query session records. To involves the two look-up maps 508 and 510. The leaf query
map the repository positions to session records and/or user server first checks the look-up map 508 to find the session ID
US 9,262,767 B2
15 16
whose record contains the last repository position in the list. In Some embodiments, the execution of one statistics query
From the session ID, the leaf query server finds the corre as described above in connection with FIG. 7 may bring
sponding user ID in the look-up map 510. The user ID asso forward a set of Subsequent candidate statistics queries. For
ciated with the Smaller starting session ID corresponds to the example, when a user Submits a statistics query for the term
total number of users that have been scanned in the database
partition. Finally, the leaf query server normalizes the query "iPod, besides the terms popularity over time, the user may
result using the total number of scanned users and returns the be interested in learning the top 10 cities that have the largest
normalized query result to the query distributor (818). number of users Submitting at least one iPod-related query.
Generally, there are two types of normalization in connec But this information cannot be generated during the first pass
tion with the generation of user-based Statistics. For example, 10
of executing the statistics query for "iPod because there
in order to find the popularity of the query term "iPod in could be thousands of cities having at least one user Submit
different weeks, the root query server generates at least two ting an iPod-related query. This list of cities is unknown at the
Sub-queries for each particular week. One Sub-query time of executing the query for "iPod'. Even if this list is
N(“iPod”, “WEEK xyz') estimates the number of users that known, it would not be practical for the root query server to
have submitted at least one iPod-related query during the 15 generate even hundreds of sub-queries in order to identify and
week and the other sub-query N(“WEEK-xyz') estimates the generate statistics for the top 10 cities. This massive number
number of users that have Submitted any query during the of sub-queries would adversely affect the throughput of the
same week N(“WEEK-xyz'). Since there may be a signifi query log servers.
cant growth of users visiting the web search engine from FIG. 9 is a flowchart illustrating a two-pass process of
week to week, the number corresponding to the query identifying the top N candidates without generating a large
N(“iPod”, “WEEK=1000) may also increase from week to number of Sub-queries. The first pass of the process includes
week without indicating any increase in popularity of the three operations 902,904, and 906. Operation 902 is similar
query. Without normalization, changes in the numbers corre to the process shown in FIG. 7. The root query server executes
sponding to the query N(“iPod”, “WEEK=xyz') from week a statistics query at a selected Subset of leafquery servers (i.e.,
to week may not corresponding to changes in popularity. 25 against a selected Subset of the database partitions). For
Therefore, the popularity of the query term "iPod has to be example, the statistics query has only one query term "iPod'.
expressed in a normalized fashion as The leaf query servers identify the users having at least one
iPod-related query and aggregate the user by their associated
cities, regions, countries and languages. Next, the root query
NC iPod", "WEEK = xy.") 30 server selects from the aggregated query results a Subset of
N(WWEEK = xyz") parameter values meeting predefined criteria (904). For
example, the root query server selects the top 25 cities,
As noted above, when a leaf query server executes a Sub regions, countries and/or languages based on the aggregated
query N(“iPod”, “WEEK xyz'), it may stop short of deter query results. Since a database partition is like a smaller
mining the total number of users who have submitted at least 35 version of the entire database, it is very likely that the top 25
one iPod-related query during the week "xyz'. Rather, the candidates identified by the leaf query servers encompass the
leaf query server may stop its search when it finds a prede top 10 candidates, although the candidates order may not be
termined number of such users (e.g., 100 if 25 database par exactly the same. For each of the top 25 candidates in differ
titions are being searched). But in order to find the 100 users, ent categories, the root query server generates a new Subse
the leaf query server must scan a larger number of users, i.e., 40 quent statistics query (906). The first pass of the process is
the total number of Scanned users determined at Sub-opera also referred as the discovery pass. Its goal is to reduce the
number of candidates from several thousand to a few dozen
tion 816. If the query is very popular, the leaf query server by performing a “discovery search of a small subset of the
may not need to scan too many users to reach the goal of 100. database partitions.
But if the query is not popular, the leaf query server must scan
a substantial number ofusers to reach the goal of 100. In other 45 The second pass of the process is referred to as the refine
words, the total number of scanned users for each Sub-query ment pass. From the query results of the discovery pass, the
indicates the popularity of the Sub-query. A higher total num root query server not only has a list of candidates for each
ber of scanned users correspond to a less popular Sub-query. parameter category, but also has reasonable estimates of the
Therefore, the number corresponding to the query densities of query session records containing the different
N(“iPod”, “WEEK xyz') should be expressed in a normal 50 candidates. For each Subsequent statistics query, the root
ized fashion as query server identifies (i.e., selects) a respective Subset of
database partitions (and their corresponding leaf query serv
ers) (908). The number of the database partitions depends on
the density estimate associated with the Subsequent query. A
N( iPod", "WEEK = xyz") = CPredetermined number 55 query having a lower density estimate needs to be executed
Crotal Users Scanned
against more database partitions to find the predetermined
number of matching query session records. Next, the root
where Ceeeee, winter corresponds to the predeter query server executes each of the Subsequent queries at the
mined number of matching query session records the leaf corresponding Subset of leaf query servers against the iden
query server was asked to find and C. sesse, repre 60 tified database partitions (910). Finally, the root query returns
sents the total number of users that the leaf query server has the query results associated with the Subsequent queries to the
scanned in order to reach the predetermined number of requestor (912). Since the second pass has found a sufficient
matching query sessions. In some embodiments, the two nor number of matching records for each query, the query result
malizations are merged into one sub-operation 722 of FIG. 7. generated at the second pass is more reliable than that of the
In this case, the leaf query server returns the total number of 65 first pass. But the first pass substantially reduces the number
scanned users for each individual sub-query at operation 818 of candidates and makes it possible to execute the second pass
of FIG. 8. efficiently.
US 9,262,767 B2
17 18
FIG. 10 is an exemplary screenshot of a webpage contain for interconnecting these components. Memory 1110 may
ing statistics about users related to a particular query term include high speed random access memory and may also
"iPod’. The webpage includes various types of statistical include non-volatile memory, Such as one or more magnetic
information. It takes less than a second for the statistical disk storage devices. Memory 1110, or alternatively non
query processing system to generate this webpage. 5 volatile memory within memory 1110, comprises a computer
Some of the statistical information is time-based. For readable storage medium having Stored thereon data repre
example, the curve 1002 represents the popularity of the term senting sequences of executable instructions. Memory 1110
"iPod over a time period of more than two years. Each data or the computer readable storage medium of memory 1110
point on the curve 1002 corresponds to a ratio between the preferably stores the following programs, modules and data
number of users that have submitted at least one iPod-related 10 structures, or a Subset or Superset thereof:
query during a particular week and the number of users that an operating system 1114 that includes procedures for
have submitted any query during that week. The curve 1002 handling various basic system services and for perform
has both peaks and troughs suggesting that the terms popu ing hardware dependent tasks;
larity indeed varies with time. a network communication module (or instructions) 1116
Below the curve 1002 is another curve 1004. This curve 15 that is used for connecting the query server 1100 to other
represents the volume of news coverage of iPod during the computers via the one or more communication network
same time period. Each data point on the curve 1004 repre interfaces 1104;
sents the number of occurrences of the term "iPod in that a sub-query distributor 1118 for assigning a Sub-query to a
week's news coverage. The spikes on the curve 1004 indicate selected subset of database partitions and their corre
increases in news coverage concerning the iPod product dur sponding leaf query servers;
ing that week. The frontend server selects the weeks corre a Sub-query result aggregator 1120 for aggregating query
sponding to six spikes on the curve 1004 and marks the results generated by leaf query servers;
corresponding weeks on the curve 1002 using labels A-F. For a Sub-query generator 1122 for generating a set of Sub
some of the spikes on the news curve 1004, there is a corre queries for a statistics query; and
sponding increase of users Submitting the iPod-related query. 25 a load balance analyzer 1126 for analyzing the workload at
For example, the label C indicates that the increase in iPod different leaf query servers and selecting a subset of the
related queries is in synch with the news coverage spike. The leaf query servers based upon their workload.
representative news during that week is that Apple releases its FIG. 11B is a block diagram of an exemplary leaf query
new generation of product, iPod Nano and iPod phone. server 1150 in accordance with some embodiments of the
In contrast, although label D corresponds to the highest 30 present invention, which typically includes one or more pro
news coverage spike on the curve 1004, there is no significant cessing units (CPUs) 1152, one or more network or other
increase of iPod-related queries. The representative iPod communications interfaces 1154, memory 1160, and one or
related news is that Apple releases its iPod Video. A compari more communication buses 1162 for interconnecting these
son of the two curves 1002 and 1004 may provide valuable components. Memory 1160 may include high speed random
insights to various persons (e.g., marketing professional, 35 access memory and may also include non-volatile memory,
Social Scientist, etc.) and organizations. Such as one or more magnetic disk storage devices. Memory
Below the two curves are three tabs, 1006 for cities, 1008 1160, or alternatively non-volatile memory within memory
for countries and 1010 for languages. Under the Cities tab 1160, comprises a computer readable storage medium having
1006 are the top 10 cities that have the largest number ofusers stored thereon data representing sequences of executable
that have submitted at least one iPod-related query. Note that 40 instructions. Memory 1160 or the computer readable storage
the number next to each city is not necessarily the actual medium of memory 1160 preferably stores the following
number of users from that city. It is a value representing the programs, modules and data structures, or a Subset or Superset
scale of one city versus a reference value. To the right of the thereof:
numbers is a bar chart 1012. The bar chart 1012 illustrates the an operating system 1164 that includes procedures for
volume of users from the top 10 cities in a more intuitive 45 handling various basic system services and for perform
manner. When a user clicks the Countries tab 1008 or the ing hardware dependent tasks;
Language tab 1010, the statistics query processing system a network communication module (or instructions) 1166
receives a new statistics request and then generates query that is used for connecting the query server 1100 to other
results using the two-pass process described above in connec computers via the one or more communication network
tion with FIG. 9. 50 interfaces 1104;
On the top-right corner of the screenshot are two dropdown a Sub-query executor 1168 for executing a Sub-query
lists, 1016 for date sub-ranges (labeled “years') and 1018 for against a database partition; and
countries. A user can request that the prior inquiry be re one or more database partitions 1170, each of which may
executed for a particular month, in which case a new webpage include a token repository 1172, an inverse index 1174,
is provided that reflects statistics based on the query session 55 a session ID range map (session ID to token position
records for that month, as shown in FIG. 10B. Similarly, a map) 1176, and a user ID range map (user ID to session
user request that the prior inquiry be re-executed for a par ID map) 1178.
ticular country, in which case a new webpage is provided that FIG. 12 is a block diagram of an exemplary query session
reflects statistics based on the query session records for that partition server 1200 in accordance with some embodiments
country, as shown in FIG. 10C. 60 of the present invention, which typically includes one or more
FIG. 11A is a block diagram of an exemplary query distri processing units (CPUs) 1202, one or more network or other
bution and aggregation server (e.g., a root query server or communications interfaces 1204, memory 1210, and one or
intermediate query server) 1100 in accordance with some more communication buses 1212 for interconnecting these
embodiments of the present invention, which typically components. Memory 1210 may include high speed random
includes one or more processing units (CPUs) 1102, one or 65 access memory and may also include non-volatile memory,
more network or other communications interfaces 1104, Such as one or more magnetic disk storage devices. Memory
memory 1110, and one or more communication buses 1112 1210, or alternatively non-volatile memory within memory
US 9,262,767 B2
19 20
1210, comprises a computer readable storage medium having each of a plurality of geographic locations or rep
stored thereon data representing sequences of executable resentative statistics of occurrence of the user-iden
instructions. Memory 1210 or the computer readable storage tified event with respect to each of a plurality of
medium of memory 1210 preferably stores the following languages:
programs, modules and data structures, or a Subset or Superset sending to the client system for display at the client
thereof: system the result, including the Sub-results;
an operating system 1214 that includes procedures for wherein the Sub-results comprise normalized repre
handling various basic system services and for perform sentative statistics of occurrence of the user-iden
ing hardware dependent tasks; tified event that indicate differences in popularity
a network communication module (or instructions) 1216 10
of the user-identified event during the plurality of
that is used for connecting the partition server 1200 to time periods, differences in popularity of the user
other computers via the one or more communication identified event in the a plurality of geographic
network interfaces 1204;
a query log sampler 1218 for sampling a web search locations and/or differences in popularity of the
engine's query log; 15 user-identified event with respect to the plurality of
a query session extractor 1220 for grouping different query languages.
records into different query sessions in accordance with 2. The method of claim 1, wherein an event in the database
predefined criteria: of events includes one or more mini-sessions, each mini
a query session partitioner 1222 for breaking the entire session having a user identifier, a temporal value, a geo
query session database into multiple partitions; graphic location value, and one or more query strings.
a query session sorter 1224 for Sorting the query session 3. The method of claim 2, wherein the representative sta
records within a partition by their associated users; and tistics of occurrence of the user-identified event includes
multiple database partitions 1226, each database partition numbers of instances of the event within multiple time peri
having a token repository 1228, an inverse index 1230, a ods, and the method including aggregating instances of the
session ID range map 1232 and a user ID range map 25 event in each of the plurality of respective time periods based
1234. on their associated mini-session temporal values.
Although some of various drawings illustrate a number of 4. The method of claim 2, wherein the representative sta
logical stages in a particular order, stages that are not order tistics of occurrence of the user-identified event includes
dependent may be reordered and other stages may be com numbers of instances of the event within multiple geographi
bined or broken out. While some reordering or other group 30
cal regions, and the method including aggregating instances
ings are specifically mentioned, others will be obvious to of the event in each of the plurality of respective geographical
those of ordinary skill in the art and so do not present an regions based on their associated mini-session geographic
exhaustive list of alternatives. Moreover, it should be recog location values.
nized that the stages could be implemented in hardware,
firmware, software or any combination thereof. 35 5. The method of claim 1, further comprising:
The foregoing description, for purpose of explanation, has generating a series of Sub-queries based on the received
been described with reference to specific embodiments. How request and one or more of a timeframe of query Sub
ever, the illustrative discussions above are not intended to be mission, a geographic location of query Submission, and
exhaustive or to limit the invention to the precise forms dis a language of query Submission, wherein the instances
closed. Many modifications and variations are possible in 40 of occurrence of the user-identified event correspond to
view of the above teachings. The embodiments were chosen instances of the sub-queries within the database of
and described in order to best explain the principles of the eVentS.
invention and its practical applications, to thereby enable 6. The method of claim 1, wherein the result includes
others skilled in the art to best utilize the invention and vari Sub-results, comprising representative statistics of occur
ous embodiments with various modifications as are Suited to 45 rence of the user-identified event, for each of said plurality of
the particular use contemplated. time periods.
What is claimed is: 7. The method of claim 1, wherein the result includes
1. A computer-implemented method, comprising: Sub-results, comprising representative statistics of occur
at a server system, distinct from a client system, the server rence of the user-identified event, for each of said plurality of
system having one or more processors and memory stor 50 time periods and each of said plurality of geographic loca
ing programs executed by the one or more processors: tions.
responding to a request from the client system for sta 8. The method of claim 1, wherein the user identified-event
tistics of occurrence of a user-identified event, the comprises Submission of search engine queries identified in
responding including: accordance with the request.
accessing a database of events; 55 9. A server system for generating user requested Statistics,
identifying instances of occurrence of the user-iden comprising:
tified event in at least a subset of the database of memory;
events to produce representative statistics of occur one or more processors;
rence of the user-identified event within the entire one or more programs stored in the memory and configured
database of events: 60 for execution by the one or more processors, the one or
generating a result including at least a portion of the more programs including:
representative statistics of occurrence of the user instructions for responding to a request from a client sys
identified event, the result including sub-results tem for statistics of occurrence of a user-identified event,
comprising at least representative statistics of the responding including:
occurrence of the user-identified event during each 65 instructions for accessing a database of events;
of a plurality of time periods, representative statis instructions for identifying instances of occurrence of
tics of occurrence of the user-identified event in the user-identified event in at least a subset of the
US 9,262,767 B2
21 22
database of events to produce representative statistics instructions for responding to a request from a client sys
of occurrence of the user-identified event within the tem for statistics of occurrence of a user-identified event,
entire database of events; the responding including:
instructions for generating a result including at least a instructions for accessing a database of events;
portion of the representative statistics of occurrence 5 instructions for identifying instances of occurrence of
of the user-identified event, the result including sub the user-identified event in at least a subset of the
results comprising at least representative statistics of database of events to produce representative statistics
occurrence of the user-identified event during each of of occurrence of the user-identified event within the
a plurality of time periods, representative statistics of entire database of events;
occurrence of the user-identified event in each of a 10 instructions for generating a result including at least a
plurality of geographic locations or representative portion of the representative statistics of occurrence
statistics of occurrence of the user-identified event of the user-identified event, the result including sub
with respect to each of a plurality of languages; and results comprising at least representative statistics of
instructions for sending to the client system for display occurrence of the user-identified event, during each of
at the client system the result, including the Sub-re 15 a plurality of time periods, representative statistics of
Sults; occurrence of the user-identified event in each of a
wherein the Sub-results comprise normalized represen plurality of geographic locations or representative
tative statistics of occurrence of the user-identified statistics of occurrence of the user-identified event
event that indicate differences in popularity of the with respect to each of a plurality of languages; and
user-identified event during the plurality of time peri instructions for sending to the client system for display
ods, differences in popularity of the user-identified at the client system the result, including the Sub-re
event in the a plurality of geographic locations and/or Sults;
differences in popularity of the user-identified event wherein the Sub-results comprise normalized represen
with respect to the plurality of languages. tative statistics of occurrence of the user-identified
10. The system of claim 9, wherein an event in the database 25 event that indicate differences in popularity of the
of events includes one or more mini-sessions, each mini user-identified event during the plurality of time peri
session having a user identifier, a temporal value, a geo ods, differences in popularity of the user-identified
graphic location value, and one or more query strings. event in the a plurality of geographic locations and/or
11. The system of claim 10, wherein the representative differences in popularity of the user-identified event
statistics of occurrence of the user-identified event includes 30 with respect to the plurality of languages.
numbers of instances of the event within multiple time peri 18. The non-transitory computer readable storage medium
ods, the one or more programs including instructions for of claim 17, wherein an event in the database of events
aggregating instances of the event in each of the plurality of includes one or more mini-sessions, each mini-session having
respective time periods based on their associated mini-ses a user identifier, a temporal value, a geographic location
sion temporal values. 35 value, and one or more query strings.
12. The system of claim 10, wherein the representative 19. The non-transitory computer readable storage medium
statistics of occurrence of the user-identified event includes of claim 18, wherein the representative statistics of occur
numbers of instances of the event within multiple geographi rence of the user-identified event includes numbers of
cal regions, the one or more programs including instructions instances of the event within multiple time periods, the one or
for aggregating instances of the event in each of the plurality 40 more programs including instructions for aggregating
of respective geographical regions based on their associated instances of the event in each of the plurality of respective
mini-session geographic location values. time periods based on their associated mini-session temporal
13. The system of claim 9, the one or more programs values.
including: 20. The non-transitory computer readable storage medium
instructions for generating a series of sub-queries based on 45 of claim 18, wherein the representative statistics of occur
the received request and one or more of a timeframe of rence of the user-identified event includes numbers of
query Submission, a geographic location of query Sub instances of the event within multiple geographical regions,
mission, and a language of query Submission, wherein the one or more programs including instructions for aggre
the instances of occurrence of the user-identified event gating instances of the event in each of the plurality of respec
correspond to instances of the Sub-queries within the 50 tive geographical regions based on their associated mini
database of events. session geographic location values.
14. The system of claim 9, wherein the result includes 21. The non-transitory computer readable storage medium
Sub-results, comprising representative statistics of occur of claim 17, the one or more programs including:
rence of the user-identified event, for each of said plurality of instructions for generating a series of Sub-queries based on
time periods. 55 the received request and one or more of a timeframe of
15. The system of claim 9, wherein the result includes query Submission, a geographic location of query Sub
Sub-results, comprising representative statistics of occur mission, and a language of query Submission, wherein
rence of the user-identified event, for each of said plurality of the instances of occurrence of the user-identified event
time periods and each of said plurality of geographic loca correspond to instances of the Sub-queries within the
tions. 60 database of events.
16. The system of claim 9, wherein the user identified 22. The non-transitory computer readable storage medium
event comprises Submission of search engine queries identi of claim 17, wherein the result includes sub-results, compris
fied in accordance with the request. ing representative statistics of occurrence of the user-identi
17. A non-transitory computer readable storage medium fied event, for each of said plurality of time periods.
storing one or more programs for execution by one or more 65 23. The non-transitory computer readable storage medium
processors of a server computer system, the one or more of claim 17, wherein the result includes sub-results, compris
programs comprising: ing representative statistics of occurrence of the user-identi
US 9,262,767 B2
23 24
fied event, for each of said plurality of time periods and each
of said plurality of geographic locations.
24. The non-transitory computer readable storage medium
of claim 17, wherein the user identified-event comprises sub
mission of search engine queries identified in accordance 5
with the request.

Faceted Exploration of Multiple RDF Data Sources Using SPARQL
No ratings yet
Faceted Exploration of Multiple RDF Data Sources Using SPARQL
84 pages
Us 11080336
No ratings yet
Us 11080336
468 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
IRS Unit-4
No ratings yet
IRS Unit-4
35 pages
Us 7076481
No ratings yet
Us 7076481
16 pages
Us7076481 SHTRD
No ratings yet
Us7076481 SHTRD
16 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
p2897 Russo
No ratings yet
p2897 Russo
14 pages
Method For Scoring Documents in A Linked Database
No ratings yet
Method For Scoring Documents in A Linked Database
10 pages
非一作或通讯SCI论文一
No ratings yet
非一作或通讯SCI论文一
15 pages
Ulllted States Patent (10) Patent N0.: US 8,555,288 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,555,288 B2
82 pages
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
No ratings yet
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
44 pages
Mod 3 Irs
No ratings yet
Mod 3 Irs
18 pages
Zebra Is A Free, Fast, Friendly Information Management System.
No ratings yet
Zebra Is A Free, Fast, Friendly Information Management System.
161 pages
United States Patent: Brown Et A) - (10) Patent N0.: (45) Date of Patent
No ratings yet
United States Patent: Brown Et A) - (10) Patent N0.: (45) Date of Patent
66 pages
Ulllted States Patent (10) Patent N0.: US 8,370,362 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,370,362 B2
66 pages
p255 Wen
No ratings yet
p255 Wen
7 pages
United States Patent: (12) (10) Patent N0.: US 8,250,105 B2 Bollinger Et Al. (45) Date of Patent: Aug. 21, 2012
No ratings yet
United States Patent: (12) (10) Patent N0.: US 8,250,105 B2 Bollinger Et Al. (45) Date of Patent: Aug. 21, 2012
41 pages
Thesis
No ratings yet
Thesis
49 pages
160960475X
No ratings yet
160960475X
411 pages
Anomaly Detect Ion Using Visualization and Machine Learning
No ratings yet
Anomaly Detect Ion Using Visualization and Machine Learning
6 pages
Ulllted States Patent (10) Patent N0.: US 8,315,981 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,315,981 B2
34 pages
Improving Retrieval Augmented Generation
No ratings yet
Improving Retrieval Augmented Generation
33 pages
Document Scoring Based On Document Inception Date
No ratings yet
Document Scoring Based On Document Inception Date
17 pages
Equnix PostgreSQL Query Tuning
100% (2)
Equnix PostgreSQL Query Tuning
45 pages
D6.2.1 First Version FacetedBrowsing
No ratings yet
D6.2.1 First Version FacetedBrowsing
17 pages
Us 20130254838
No ratings yet
Us 20130254838
31 pages
Grid-Partition Index: A Hybrid Method For Nearest-Neighbor Queries in Wireless Location-Based Services
No ratings yet
Grid-Partition Index: A Hybrid Method For Nearest-Neighbor Queries in Wireless Location-Based Services
35 pages
United States Patent: Shah-Hosseini
No ratings yet
United States Patent: Shah-Hosseini
28 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
United States Patent (10) Patent No.: US 8,504,876 B2
No ratings yet
United States Patent (10) Patent No.: US 8,504,876 B2
20 pages
Boolean Similarity Measure
No ratings yet
Boolean Similarity Measure
14 pages
Ulllted States Patent (10) Patent N0.: US 8,370,386 B1
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,370,386 B1
17 pages
Searc Ching An ND Repo Orting Wit TH Splun NK 4.2 CL Lass Labs S
No ratings yet
Searc Ching An ND Repo Orting Wit TH Splun NK 4.2 CL Lass Labs S
14 pages
PPL PPT - 1
No ratings yet
PPL PPT - 1
537 pages
Unlted States Patent (10) Patent N0.2 US 8,521,730 B1
No ratings yet
Unlted States Patent (10) Patent N0.2 US 8,521,730 B1
12 pages
United States Patent: Muras Et Al. (10) Patent N0.: (45) Date of Patent
No ratings yet
United States Patent: Muras Et Al. (10) Patent N0.: (45) Date of Patent
11 pages
United States Patent: Mowry Patent No.: Jul. 3, 2012 Date of Patent
No ratings yet
United States Patent: Mowry Patent No.: Jul. 3, 2012 Date of Patent
28 pages
Data Analytics
100% (3)
Data Analytics
97 pages
05 We 222242132222.:1. 24222 ?i?i'é'iriliilfiii .0110
No ratings yet
05 We 222242132222.:1. 24222 ?i?i'é'iriliilfiii .0110
9 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
International Journal of Engineering and Science Invention (IJESI)
No ratings yet
International Journal of Engineering and Science Invention (IJESI)
6 pages
United States Patent (10) Patent No.: US 8,527,512 B2
No ratings yet
United States Patent (10) Patent No.: US 8,527,512 B2
6 pages
United States Patent (10) Patent N0.2 US 8,549,154 B2
No ratings yet
United States Patent (10) Patent N0.2 US 8,549,154 B2
28 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Online Learning With Stream Mining
No ratings yet
Online Learning With Stream Mining
36 pages
Course Plan CD
No ratings yet
Course Plan CD
10 pages
Advanced Database Indexing
No ratings yet
Advanced Database Indexing
17 pages
Interval Query Indexing For Efficient Stream Processing
No ratings yet
Interval Query Indexing For Efficient Stream Processing
10 pages
SAS Interview Questions and Answers
No ratings yet
SAS Interview Questions and Answers
107 pages
Ieee 2010 Project Titles
No ratings yet
Ieee 2010 Project Titles
7 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Intelligent Generic Statistical Query Mode: Article
No ratings yet
Intelligent Generic Statistical Query Mode: Article
6 pages
Balaji Institute of Sciences: Narsampet, Warangal-506 331 2010-11
No ratings yet
Balaji Institute of Sciences: Narsampet, Warangal-506 331 2010-11
36 pages
Apple 28
No ratings yet
Apple 28
16 pages
TY BSC Computer Science - 25 06 15 PDF
No ratings yet
TY BSC Computer Science - 25 06 15 PDF
41 pages
Big Data and The Web
No ratings yet
Big Data and The Web
170 pages
Dynamic Organization of User Historical Queries: M. A. Arif, Syed Gulam Gouse
No ratings yet
Dynamic Organization of User Historical Queries: M. A. Arif, Syed Gulam Gouse
3 pages
The Use of Ontologies For Effective Knowledge Modelling
No ratings yet
The Use of Ontologies For Effective Knowledge Modelling
11 pages
5-Introduction To Information Retrieval
No ratings yet
5-Introduction To Information Retrieval
3 pages
Intro To The Historian Excel Add-In
No ratings yet
Intro To The Historian Excel Add-In
39 pages
Powerquery Manula
No ratings yet
Powerquery Manula
1,217 pages
C Data Structures A Laboratory Course Stefan Brandle Download
No ratings yet
C Data Structures A Laboratory Course Stefan Brandle Download
81 pages
patent 36
No ratings yet
patent 36
70 pages
Splunk 4.x Cheatsheet
100% (1)
Splunk 4.x Cheatsheet
8 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Patent 10
No ratings yet
Patent 10
70 pages
Patent 25
No ratings yet
Patent 25
44 pages
Mariadb
No ratings yet
Mariadb
222 pages
Analyze Your Scratch Projects With Dr. Scratch and Assess Your Computational Thinking Skills
100% (1)
Analyze Your Scratch Projects With Dr. Scratch and Assess Your Computational Thinking Skills
34 pages
Patent 21
No ratings yet
Patent 21
36 pages
Cs2352 Principles of Compiler Design
No ratings yet
Cs2352 Principles of Compiler Design
1 page
Patent 22
No ratings yet
Patent 22
30 pages
Patent 6
No ratings yet
Patent 6
34 pages
Patent 23
No ratings yet
Patent 23
18 pages
CC1101 Module
No ratings yet
CC1101 Module
24 pages
Patent 11
No ratings yet
Patent 11
13 pages
Quiz3 2
100% (1)
Quiz3 2
10 pages
Patent 13
No ratings yet
Patent 13
11 pages
Module 1
No ratings yet
Module 1
185 pages
patent 50
No ratings yet
patent 50
7 pages
Patent 32
No ratings yet
Patent 32
7 pages
Ethernet/Ip Programmer'S Guide: Acr Motion Controllers
No ratings yet
Ethernet/Ip Programmer'S Guide: Acr Motion Controllers
33 pages
patent 38
No ratings yet
patent 38
5 pages
patent 39
No ratings yet
patent 39
5 pages
Syntax Analysis
No ratings yet
Syntax Analysis
73 pages
REXX400 Programmer's Guide
No ratings yet
REXX400 Programmer's Guide
251 pages
Teaching NLTK Norwegian
No ratings yet
Teaching NLTK Norwegian
68 pages
Python Revision Tour
No ratings yet
Python Revision Tour
7 pages
CSC 319 Compiler Constructions
No ratings yet
CSC 319 Compiler Constructions
54 pages
Biology
No ratings yet
Biology
15 pages
LT Unit 3 Notes 2017
No ratings yet
LT Unit 3 Notes 2017
28 pages
Describing Motion
No ratings yet
Describing Motion
7 pages
US10279110
No ratings yet
US10279110
33 pages
CC 2
No ratings yet
CC 2
65 pages
Harvard CS197 Lecture 4 Notes
No ratings yet
Harvard CS197 Lecture 4 Notes
15 pages
US8707164
No ratings yet
US8707164
25 pages
US9104972
No ratings yet
US9104972
20 pages
Compilers Theory
No ratings yet
Compilers Theory
16 pages
Lex Yacc
No ratings yet
Lex Yacc
9 pages
Something That Cannot Be Useful
No ratings yet
Something That Cannot Be Useful
15 pages
CS 4300: Compiler Theory A Simple Syntax-Directed Translator
No ratings yet
CS 4300: Compiler Theory A Simple Syntax-Directed Translator
70 pages
Compiler Lecture 3
No ratings yet
Compiler Lecture 3
16 pages
CSE 3yr Syllabus 240920
No ratings yet
CSE 3yr Syllabus 240920
40 pages
Prog 2 Chap 45
No ratings yet
Prog 2 Chap 45
28 pages
Patent of The Cotton Bud
No ratings yet
Patent of The Cotton Bud
7 pages
First Term: 1. Colouring 2. Shading 3. Smudging 4. Drawing/ Sketching 5. Craft
No ratings yet
First Term: 1. Colouring 2. Shading 3. Smudging 4. Drawing/ Sketching 5. Craft
8 pages
B.Tech CSE AI AIML BDA GT 6th Sem Final
No ratings yet
B.Tech CSE AI AIML BDA GT 6th Sem Final
15 pages
NLP Individual Assignment ch-2
No ratings yet
NLP Individual Assignment ch-2
4 pages
Introduction To Compiler Development
No ratings yet
Introduction To Compiler Development
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

US9262767

Uploaded by

US9262767

Uploaded by

US0092.

(12) United States Patent (10) Patent No.: US 9,262,767 B2

Normalize the aggregated query results

return the normalized query results to

Partitioned Query Session Records

SOrted Partitions of Session Records

Temporal Values (e.g., Day, Week,

<q>Query String Z-C/q>

Information Service 300

O TokenS for Session.0

1 TokenS for Session 1

TokenS for SessionA

A+1 TokenS for SessionA+1

Sessions for USer1

TokenS for Session Y-1

Sessions for USerL

Tokenized Query String Z

Session N Rep. Position

Query Distributor and I-606 Query Distributor and (1 606

Execute the sub-query at the Second subset of leaf query

ExCuting a Sub-query at a leaf Query Server

Retrieve from the database partition the ? 808

one in the list of reposito

Determine the total number of users that - 816

TWO-paSS Query proCeSS

Root or Intermediate Query Server

Partition Server Memory

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.