0% found this document useful (0 votes)
2 views56 pages

US8171032

The patent US 8,171,032 B2, granted on May 1, 2012, relates to a system for providing customized electronic information, particularly for identifying desirable objects like news articles. It describes a method for constructing target profiles based on word frequency and user interest summaries, allowing for a ranked listing of relevant objects for users. Additionally, it includes a pseudonym proxy server to protect user privacy regarding their interest summaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views56 pages

US8171032

The patent US 8,171,032 B2, granted on May 1, 2012, relates to a system for providing customized electronic information, particularly for identifying desirable objects like news articles. It describes a method for constructing target profiles based on word frequency and user interest summaries, allowing for a ranked listing of relevant objects for users. Additionally, it includes a pseudonym proxy server to protect user privacy regarding their interest summaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

US008171032B2

(12) United States Patent (10) Patent No.: US 8,171,032 B2


HerZ (45) Date of Patent: *May 1, 2012

(54) PROVIDING CUSTOMIZED ELECTRONIC Iw


A E. E. et ?al
aSa ca.
INFORMATION 4,506,387 A 3, 1985 Walter
(75) Inventor: Frederick S. M. Herz, Davis, WV (US) 4,521,806 A 6, 1985 Abraham
4,529,870 A 7/1985 Chaum
(73) Assignee: Pinpoint, Incorporated, Chicago, IL 4,558.302 A 12/1985 Welch
(US) 4,567,512 A 1/1986. Abraham
4,590,516 A 5/1986 Abraham
(*) Notice: Subject to any disclaimer, the term of this 4,602,279 A 7/1986 Freeman
patent is extended or adjusted under 35 4,654,815 A 3, 1987 Marinet al.
U.S.C. 154(b) by 0 days. (Continued)
This patent is Subject to a terminal dis- Primary Examiner —Thu-Nguyet Le
claimer. (74) Attorney, Agent, or Firm — Wolf, Greenfield & Sacks,
(21) Appl. No.: 12/221,507 P.C.
(22) Filed: Aug. 4, 2008 (57) ABSTRACT
(65) Prior Publication Data This invention relates to customized electronic identification
US 2008/O294.584A1 Nov. 27, 2008 of desirable objects, such as news articles, in an electronic
media environment, and in particular to a system that auto
Related U.S. Application Data matically constructs both a “target profile’ for each target
(60) Continuation of application No. 10/262,123, filed on
? - ?? ???? ????
object in the electronic media based, for example, on the
frequency with which each word appears in an article relative
Oct. 1, 2002, now Pat. No. 7,483,871, which is a to its overall frequency of use in all articles, as well as a
division of application No. 08/985,732, filed on Dec. 5, “t req y 99 ??????
1997, now Pat. No. 6.460.036, which is a arget profile interest summary for each user, which target
?? ???? ?? ? in~~
???? - ~? profile interest summary describes the user's interest level
continuation-in-part of application No. 08/346,425, various types of target objects. The system then evaluates the
filed on Nov. 29, 1994, now Pat. No. 5,758,257. target profiles against the users’ target profile interest Sum
(60) Provisional application No. 60/032,462, filed on Dec. maries to generate a user-customized rank ordered listing of
9, 1996. target objects most likely to be of interest to each user so that
the user can select from among these potentially relevant
(51) Int. Cl. target objects, which were automatically selected by this sys
G06F 7/30 (2006.01) tem from the plethora of target objects that are profiled on the
(52) U.S. Cl. .…. T07/748 electronic media. Users’ target profile interest Summaries can
(58) Field of Classification Search .................. 707/736, be used to efficiently organize the distribution of information
707/748, 749 in a large scale system consisting of many users intercon
See application file for complete search history. nected by means of a communication network. Additionally,
a cryptographically-based pseudonym proxy server is pro
(56) References Cited vided to ensure the privacy of a user's target profile interest
summary, by giving the user control over the ability of third
U.S. PATENT DOCUMENTS parties to access this Summary and to identify or contact the
4,170,782 A 10, 1979 Miller USC.
4.264,924 A 4, 1981 Freeman
4,381.522 A 4, 1983 Lambert 9 Claims, 13 Drawing Sheets

INTIAZELIST OF
TARGE OBJECTS TO THE 13A-00
EMPTY LIS

NITALIZECURRENT TREETO
THE HEIRARCHICAL CLUSER 13A-01
TREE OF ALL OBJECTS

SCAN CURRENT TREE FOR


ARGET OBJECTSSMAR 13A-02
TOPUSINGPROCESS 13B

RETURNS OF
TARGET OBJECTS
US 8,171,032 B2
Page 2

U.S. PATENT DOCUMENTS 5,410,344 A 4/1995 Graves et al.


5,420,806 A 5/1995 Shou et al.
:: A : al 5420,807 A 5/1995 Shou et al.
470 1745 A 10/1987 Waterworth
4,704,725 A 1 1/1987 Harvey et al.
:4;478 A
5.444499 A
$22. ?iù,
8/1995 Saitoh
4,706,080 A 1 1/1987 Sincoskie III
4,706,121. A 1 1/1987 Young 5,446,891 A 8/1995 Kaplanet al.
4745,549 A 5/1988 Hashimoto 5,446.919 A 8/1995 Wilkins
4.751.578 A 68 1988 Rei 1 5,455,576 A 10/1995 Clark, III et al.
4759,063 A T. 1988 ????????t al. 5,455,862 A 10/1995 Hoskinson
4761684 A 8, 1988 ??????' ?????al 5,459,306 A 10/1995 Stein et al.
4.7639. A 8, 1988 ????????????al 5469.206 A 11/1995 Strubbe et al.
4814,746 A 3, 1989 Miller et al. 5,477.541. A 12/1995 White et al.
4,831,526 A 5/1989 Luchs et al 5.483.278 A 1/1996 Strubbe et al.
4.847.619 A
4,853,678 A
7/1989 Kato et al.
8/1989 Bishop, Jr. et al.
šiöiö
I- W
 SIG ÄRour et al.
4870,579 A 9/1989 Hey 5,519,858 A 5/1996 Walton et al.
4876,541. A
4,881.075 A
10/1989 Storer
11/1989 Weng
5.534.911 A
5,537,315 A
7/1996 Levitan
7/1996 Mitcham
4.885757. A 12/1989 Provence 5,537,586 A 7/1996 Amram et al.
4.906.991 A 3/1990 Fiala etal 5,539,734. A 7/1996 Burwell et al.
4914698 A
4,926,480 A
4/1990 Chaum
5/1990 Ch
5,541,638 A
5,541,911 A
7/1996 Story
7/1996 Nilakantan et al.
4947.430 A 8, 1990 Ä 5,546,370 A 8, 1996 Ishikawa
4965.825 A 10, 1990 H 3.?? al 5,559,549 A 9, 1996 Hendricks et al.
4977,455. A 13/1990 {{{lwyet 5,561,421 A 10/1996 Smith et al.
4979 is A 12/1990 ??t†p; 5,565,809 A 10/1996 Shou et al.
4987593 A 1, 1991 E. pir 5,583,763 A 12/1996 Atcheson et al.
4,988.99s. A 1, 1991 5.586,121 A 12/1996 Moura et al.
4,996.642 A 2, 1991 H 5,600,364 A 2f1997 Hendricks et al.
50o1478 ? 3, 1991 ? 5,600,798 A 2/1997 Cherukuri et al.
5.003.307 A 3, 1991 Vi, 1 5,613,209 A 3, 1997 Peterson et al.
5.003551 A 3/1991 K Âà: 5,614,940 A * 3/1997 Cobbley et al. ............... 725, 138
5.016,009 A 5, 1991 (???????????l. 5,617.565 A 4/1997 Augenbraun et al.
5,023,610 A
5.038.211 A
6/1991 Rubowal.
8/1991 Hallenbeck
5.638.457 A
5,642.484 A
6/1997 Deaton
6/1997 Harrison, III et al.
5,047,867 A 9, 1991 5,642,485 A 6/1997 Deaton et al.
5,049.881 A 9/1991 Gibsonet al. 5,644,723. A 7/1997 Deaton et al.
5,058, 137 A 10/1991 Shah 5,649,114 A 7/1997 Deaton et al.
5075,771 A 13, 1991 Hashimoto 5,649,186 A 7/1997 Ferguson
5,087,913. A 2f1992 Eastm 5,659,469 A 8, 1997 Deaton et al.
5.109414 ? 4, 1992 ? al al 5,675,662 A 10/1997 Deaton et al.
5.126.739 A 6, 1992 (W????????? i 5,680,116 A 10/1997 Hashimoto et al.
5.13039 A 7, 1992 Y??? et al. 5,687,322 A 1 1/1997 Deaton et al.
? ?. ?. ?. ? 5,689,648 .A
1 1/1997 Diaz et al
?A 2: ???i?rman et al. 5,696,965 A * 12/1997 Dedrick .......................... 707/10
5,144,663 A
5,151,789 A
9, 1992 Kudelski et al.
9, 1992 Y
5,710,884. A 1/1998
5,710,887 A
1/1998 Chelliah
Dedrick et al.
5.53591 A 10/1992 ?????? 5,717.923 A 2/1998 Dedrick
5.155.484 A 10/1992
5,155,591 A 10/1992 Wachob
?hambers, IV 72421 A.
5,724,567 A 1998 Pedrick
3/1998 Rose et al. ............................ 1.1
5,179,378 A
5.201.010 A
1/1993 Ranganathan et al.
4, 1993 Deaton et all i I MY
A SE i ????al.
eilly et al.
5,208877 A 5/1993 Murphveta 5,754,938 A 5/1998 Herz et al.
5235924 A 6, 1993 R a. 5,754,939 A 5/1998 Herz et al.
5,230,020 A 7, 1993 ÈÄRi?al 5,758,257 A 5/1998 Herz et al.
5,233,654 A.
5.237,157 A
8/1993 Harvey et al.
8/1993 Kaplan
????????
sy
A g ?.
edric
5243341 A 9/1993 Seroussi et al 5,798,785 A 8, 1998 Hendricks et al.
5.245.420.
5.245,656 A
A 9/1993 Harneveral
9, 1993 }; {{{??
5,802,054 A
5,805,156 A
9,9, 1998 Bellenger
1998 Richmond et al.
5,251.324 A 10/1993 McMullan, Jr. 5,809,478 A 9/1998 Greco et al.
5,262.776 A 11, 1993 Kutk 5,812,776 A 9, 1998 Gifford
5,276.736 A 1, 1994 ??ilitih, 5,835.087 A 11/1998 Herz et al.
5,301.109 A
5,309.437 A
4,5/1994
1994 Perlman
Landaueretetal.al. ????????????
sy sy
A ?? et al.
5.321,833. A 6/1994 Ch tal 5,855,008 A 12/1998 Goldhaber et al.
5.323,240 A 6, 1994 Mi 5,856,981 A 1/1999 Voelker
5.331554 A T. 1994 ????, 5,873,066 A 2f1999 Underwood et al.
5.33556 A 7/1994 Black, Jr. et al. 5,875,108 A 2/1999 Hoffberg et al.
5,341,427 A 8/1994 Hardy et al. 5,884.270 A 3, 1999 Walker et al.
5,347,304 A 9/1994 Moura et al. 5,892,924 A 4/1999 Lyon et al.
5,351,075 A 9/1994 HerZ et al. 5,903.559 A 5, 1999 Acharya et al.
5,353,121 A 10/1994 Young et al. 5,933,811 A 8/1999 Angles et al.
5,371,794. A 12/1994 Diffie et al. 5,945,988 A 8, 1999 Williams et al.
5,373,290 A 12/1994 Lempel et al. 5,960,411 A 9/1999 Hartman et al.
5,373,558 A 12/1994 Chaum 5,961,593. A 10/1999 Gabber et al.
5,374,951 A 12/1994 Welsh 5,991,735 A 11/1999 Gerace
5,388,165 A 2f1995 Deaton et al. 5.991,740 A 11/1999 Messer
US 8,171,032 B2
Page 3

6,014,090 A 1/2000 Rosen et al. 6.256,675 B1 7/2001 Rabinovich


6,029, 141 ? 2/2000 Bezos et al. 6,275,824 B1 8/2001 O'Flaherty et al.
6,047,327 A 4/2000 Tso et al. 6,377,972 B1 4/2002 Guo et al.
6,052,064 A 4/2000 Budnik et al. 6,381,465 B1 4/2002 Chern et al.
6,052,718 A 4, 2000 Gifford 6,397,040 B1 5/2002 TimuSs et al.
6,055,513 A 4/2000 Katz et al. 6,456,852 B2 9/2002 Baret al.
6,058,379 A 5/2000 Odom et al. 6,553,376 B1 4/2003 Lewis et al.
6,064,970 A 5, 2000 McMillan et al. 6,563,910 B2 5/2003 Menard et al.
6,064,980 A 5/2000 Jacobi et al. 6,571,279 B1 5/2003 Herz et al.
6,088,722 A 7/2000 HerZ et al. 6,591,103 B1 7/2003 Dunn et al.
6,134,658. A 10/2000 Multerer et al. 6,697,824 B1 2/2004 Bowman-Amuah
6,154,745 A 1 1/2000 Kari et al. 6,708,213 B1 3/2004 Bommaiah et al.
6, 182.068 B 1 * 1/2001 Culliss .......................... 707/721 7,242.988 B1 7/2007 Hoffberg et al.
6,199,045 B1 3, 2001 Giniger et al. 2003/0101.124 A1 5/2003 Semiret et al.
6,243,467 B1 6, 2001 Reiter et al. * cited by examiner
U.S. Patent May 1, 2012 Sheet 1 of 13 US 8,171,032 B2

?
! Kawas -ro Krav

"GOI || | | .
------ Y
Na
: UJE----- |
????

======"||| |
s
U.S. Patent May 1, 2012 Sheet 2 of 13 US 8,171,032 B2

* DF

H| H. TK||E FE
U.S. Patent May 1, 2012 Sheet 3 of 13 US 8,171,032 B2

B------------- - 0
???
(
D
FIG. 3 ~?
U.S. Patent May 1, 2012 Sheet 4 of 13 US 8,171,032 B2

50 RETRIEVE NEW DOCUMENT


FROM DOCUMENT SOURCE

CALCULATE
502 DOCUMENT PROFILES

503 CLUSTER DOCUMENTS INTO


A HERARCHICAL CLUSER

GENERATE LABELS
504 FOREACH CLUSTER

GENERATE MENUS FROM


505 CLUSTER STRUCTURE
AND LABELS

506 MONITOR DOCUMENT ACTIVITY


AND ADJUST PROFILE

FIG. 5
U.S. Patent May 1, 2012 Sheet 5 of 13 US 8,171,032 B2

FIG. 6

(a, b, C d, 6, f, 9. h, i, j, k, l)

(a, b, c, d, e, f
(a, b, d) (g, h, i, j, k)
U.S. Patent May 1, 2012 Sheet 6 of 13 US 8,171,032 B2
U.S. Patent May 1, 2012 Sheet 7 of 13 US 8,171,032 B2

110 USER LOGIN

O2 USER ACCESSES NEWS

1 O3 COMPARE PROFILES AND


SELECT ARTICLES

1 104 PRESENT LIST TO USER

1105 USER SELECTS ARTICLE

1 O6 SERVER DEL WERS ARTICLE


OUSER

MONTOR WHICH
1 O7 ARTICLES ARE READ

UPDATE USER
11 O8 PREFERENCE PROFILES

FIG. 10
U.S. Patent May 1, 2012 Sheet 8 of 13 US 8,171,032 B2

FIG. 12 DETERMINE SET OF


ATTRIBUTESFOR 12O1
TARGET OBJECT

ASSIGNATTRIBUTE
WEIGHT ASA FUNCTION 1202
OF THE USER

COMPUTE WEIGHTED
SUM OF SELECTED 1203
NORMATIVE ATTRIBUTES
OF TARGET OBJECT

RETRIEVE SUMMARIZED
WEIGHTED RELEVANCE 204
FEEDBACKDATA

COMPUTE TOPICAL INTEREST


OF TARGET OBJECT FOR 1205
SELECTED USER BASED ON
RELEVANCE FEEDBACK
FROMALLUSERS
U.S. Patent May 1, 2012 Sheet 9 of 13 US 8,171,032 B2

FIG. 13A

NITAZE LIST OF
TARGET OBJECTS TO THE 13A-00
EMPTY LIST

NTIALIZE CURRENT TREETO


THE HEIRARCHICAL CLUSTER 13A-01
TREE OF ALL OBJECTS

SCAN CURRENT TREE FOR


TARGET OBJECTS SMLAR 13A-02
TO P. USING PROCESS 13B

RETURN LIST OF
TARGET OBJECTS 13A-03
U.S. Patent May 1, 2012 Sheet 10 of 13 US 8,171,032 B2

i = 1,
n = NUMBER OF CHILD 13B-OO
FIG. 13B SUBTREES OF CURRENT TREE

RETRIEVE ith CHILD SUBTREE


OFCURRENT TREE

CALCULATE d(P,p)
WHEREp IS 13B-02
PROFILE OF ith
CHILDSUBTREE

DOES
ith CHILD
SUBTREE CONTAN YES
ONLY ONE TARGET
OBJECT 13B-05
ADDTARGET OBJECT
SCAN ith CHILD SUBTREE TO LIST OF
FORTARGET OBJECTS SIMILAR TARGET OBJECTS
TOP BY INVOKING PROCESS
13B RECURSVELY

RETURN
U.S. Patent May 1, 2012 Sheet 11 of 13 US 8,171,032 B2

FIG. 14
USER GENERATES
A PSEUDONYM

PSEUDONYMS ENCRYPTED 1401

USER SELECTS SERVICE


PROVIDERIDENTIFIER

USER BLNDS PSEUDONYM


& PROWDER DENTFER 1403
WITH RANDOM FACTOR

TRANSMT SIGNED MESSAGE


TO VALIDATINGAGENCY 404
SERVER

VALIDATION SERVER RECEIVES 1405


AND VERIFES MESSAGE

VALIDATION SERVER SIGNS


PSEUDONYMAND RETURNS 1406
TO USER

USERIS IN RECEIPTOF
VALIDATED PSEUDONYM
U.S. Patent May 1, 2012 Sheet 12 of 13 US 8,171,032 B2

CLIENT PROCESSOR FORMS


ENCRYPTED MESSAGE WITH 15OO
SIGNED WALIDATED PSEUDONYM
FIG. 15
MESSAGE IS ROUTED TO
PROXY SERVER 15O1

PROXY SERVER
DECODES MESSAGE

PROXY SERWERFORWARDS
MESSAGE TO IDENTIFIED 503
INFORMATION SERVER

INFORMATION SERVER
PROCESSES RECEIVED 1504
REQUEST

|NFORMATION SERVER
TRANSMTS RESPONSE 1505
TO PROXY SERVER

PROXY SERVER CREATES


RESPONSE MESSAGE 1506
TO USER

CLENT PROCESSOR 1507


TABULATESUSER INTEREST

CLIENT PROCESSORTRANSMITS
MESSAGET0, PR0XY SERWER 1508
TO UPDATE PROFILE
INTEREST SUMMARY
U.S. Patent May 1, 2012 Sheet 13 of 13 US 8,171,032 B2

USERESTABLISHES
FIG. 16 COMMUNICATION CONNECTION 1600
WITH NETWORK WENDOR

USERACTIVATES BROWSNG PROGRAM 16O1

USER INPUTS QUERY 16O2

NETWORK WENDOR FORWARDS


QUERY TO DENTIFIED GENERAL 1603
NFORMATION SERVER

GENERAL, INFORMATIONSERWER
MATCHES QUERY PROFILEAGAINST
CLUSTER PROFILES TO LOCATE 1604
SPECIFIC INFORMATION SERVERTO
SERVE THE RECEIVED QUERY

SPECIFC INFORMATIONSERWER
DETERMINES DEGREE OF MATCH 1605
WITH SPECIFIC CLUSTER

NETWORK WENDORTRANSMITS
COMPUTED DEGREE OF MATCH FOR 1606
EACH INFORMATION SERVER TO USER

USER SELECTS DENTIFIED CLUSTER 16O7

CLENT PROCESSORTRANSMITS 1608


SELECTION TONETWORKVENDOR

NETWORK WENDOR RETRIEVES


DENTIFIED TARGET OBJECT AND 1609
TRANSMITS TOCLENT PROCESSOR
US 8,171,032 B2
1. 2
PROVIDING CUSTOMZED ELECTRONIC because they are not easily identified or expends a significant
INFORMATION amount of time and energy to conduct an exhaustive search of
all articles to identify those most likely to be of interest to the
CROSS-REFERENCE TO RELATED user. Furthermore, even if the user conducts an exhaustive
APPLICATIONS search, present information searching techniques do not nec
essarily accurately extract only the most relevant articles, but
This application is a continuation of U.S. application Ser. also present articles of marginal relevance due to the func
No. 10/262,123, entitled CUSTOMIZED ELECTRONIC tional limitations of the information searching techniques.
NEWSPAPERS AND ADVERTISEMENTS, filed on Oct. 1, There is also no existing system which automatically esti
2002, now U.S. Pat. No. 7,483,871, which is a divisional of 10 mates the inherent quality of an article or other target object
U.S. application Ser. No. 08/985,732, entitled “SYSTEM to distinguish among a number of articles or target objects
AND METHOD FOR PROVIDING CUSTOMIZED ELEC identified as of possible interest to a user.
TRONIC NEWSPAPERS AND TARGET ADVERTISE Therefore, in the field of information retrieval, there is a
MENTS. filed on Dec. 5, 1997, now U.S. Pat. No. 6,460,036, long-standing need for a system which enables users to navi
which claims priority under 35 U.S.C. $ 119(e) to U.S. Pro 15 gate through the plethora of information. With commercial
visional Application Ser. No. 60/032.462, entitled “SYSTEM ization of communication networks. Such as the Internet, the
FOR THE AUTOMATIC DETERMINATION OF CUS growth of available information has increased. Customiza
TOMIZED PRICES AND PROMOTIONS filed on Dec. 9, tion of the information delivery process to the user's unique
1996, and is a continuation-in-part of U.S. application Ser. tastes and interests is the ultimate solution to this problem.
No. 08/346,425, entitled “SYSTEM AND METHOD FOR However, the techniques which have been proposed to date
SCHEDULING BROADCAST OF AND ACCESS TO either only address the user's interests on a superficial level or
VIDEO PROGRAMS AND OTHER DATAUSING CUS provide greater depth and intelligence at the cost of unwanted
TOMER PROFILES, filed on Nov. 29, 1994, now U.S. Pat. demands on the user's time and energy. While many research
No. 5,758,257. ers have agreed that traditional methods have been lacking in
25 this regard, no one to date has successfully addressed these
FIELD OF INVENTION problems in a holistic manner and provided a system that can
fully learn and reflect the user's tastes and interests. This is
This invention relates to customized electronic identifica particularly true in a practical commercial context, such as
tion of desirable objects, such as news articles, in an elec on-line services available on the Internet. There is a need for
tronic media environment, and in particular to a system that 30 an information retrieval system that is largely or entirely
automatically constructs both a “target profile' for each target passive, unobtrusive, undemanding of the user, and yet both
object in the electronic media based, for example, on the precise and comprehensive in its ability to learn and truly
frequency with which each word appears in an article relative represent the user's tastes and interests. Present information
to its overall frequency of use in all articles, as well as a retrieval systems require the user to specify the desired infor
“target profile interest Summary for each user, which target 35 mation retrieval behavior through cumbersome interfaces.
profile interest summary describes the user's interest level in Users may receive information on a computer network
various types of target objects. The system then evaluates the either by actively retrieving the information or by passively
target profiles against the users’ target profile interest Sum receiving information that is sent to them. Just as users of
maries to generate a user-customized rank ordered listing of information retrieval systems face the problem of too much
target objects most likely to be of interest to each user so that 40 information, so do users who are targeted with electronic junk
the user can select from among these potentially relevant mail by individuals and organizations. An ideal system would
target objects, which were automatically selected by this sys protect the user from unsolicited advertising, both by auto
tem from the plethora of target objects that are profiled on the matically extracting only the most relevant messages received
electronic media. Users’ target profile interest Summaries can by electronic mail, and by preserving the confidentiality of
be used to efficiently organize the distribution of information 45 the user's preferences, which should not be freely available to
in a large scale system consisting of many users intercon others on the network.
nected by means of a communication network. Additionally, Researchers in the field of published article information
a cryptographically based proxy server is provided to ensure retrieval have devoted considerable effort to finding efficient
the privacy of a user's target profile interest Summary, by and accurate methods of allowing users to select articles of
giving the user control over the ability of third parties to 50 interest from a large set of articles. The most widely used
access this Summary and to identify or contact the user. methods of information retrieval are based on keyword
matching: the user specifies a set of keywords which the user
Problem thinks are exclusively found in the desired articles and the
information retrieval computer retrieves all articles which
It is a problem in the field of electronic media to enable a 55 contain those keywords. Such methods are fast, but are noto
user to access information of relevance and interest to the user riously unreliable, as users may not think of the right key
without requiring the user to expend an excessive amount of words; or the keywords may be used in unwanted articles in
time and energy searching for the information. Electronic an irrelevant or unexpected context. As a result, the informa
media, Such as on-line information sources, provide a vast tion retrieval computers retrieve many articles which are
amount of information to users, typically in the form of 60 unwanted by the user. The logical combination of keywords
“articles.” each of which comprises a publication item or and the use of wild-card search parameters help improve the
document that relates to a specific topic. The difficulty with accuracy of keyword searching but do not completely solve
electronic media is that the amount of information available the problem of inaccurate search results. Starting in the
to the user is overwhelming and the article repository systems 1960s, an alternate approach to information retrieval was
that are connected on-line are not organized in a manner that 65 developed: users were presented with an article and asked if it
sufficiently simplifies access to only the articles of interest to contained the information they wanted, or to quantify how
the user. Presently, a user either fails to access relevantarticles close the information contained in the article was to what they
US 8,171,032 B2
3 4
wanted. Each article was described by a profile which com interface: implementing the pile metaphor for organizing
prised either a list of the words in the article or, in more information” was published in 16 Ann. Int’l SIGIR 93, ACM
advanced systems, a table of word frequencies in the article. 260-269 by Rose E. D. et al. The Apple interface uses word
Since a measure of similarity between articles is the distance frequencies to automatically file articles by picking the pile
between their profiles, the measured similarity of article pro 5 most similar to the article being filed. This system functions
files can be used in article retrieval. For example, a user to cluster articles into subpiles, determine key words for
searching for information on a Subject can write a short indexing by picking the words with the largest TF/IDF (where
description of the desired information. The information TF is term (word) frequency and IDF is the inverse document
retrieval computer generates an article profile for the request frequency) and label piles by using the determined key words.
and then retrieves articles with profiles similar to the profile 10 Numerous patents address information retrieval methods,
generated for the request. These requests can then be refined but none develop records of a users interest based on passive
using “relevance feedback', where the user actively or pas monitoring of which articles the user accesses. None of the
sively rates the articles retrieved as to how close the informa systems described in these patents present computer archi
tion contained therein is to what is desired. The information tectures to allow fast retrieval of articles distributed across
retrieval computer then uses this relevance feedback informa 15 many computers. None of the systems described in these
tion to refine the request profile and the process is repeated patents address issues of using such article retrieval and
until the user either finds enough articles or tires of the search. matching methods for purposes of commerce or of matching
A number of researchers have looked at methods for select users with common interests or developing records of users
ing articles of most interest to users. An article titled “Social interests. U.S. Pat. No. 5,321,833 issued to Chang et al.
Information filtering: algorithms for automating word of teaches a method in which users choose terms to use in an
mouth' was published at the CHi-95 Proceedings by Patti information retrieval query, and specify the relative weight
Maes et al and describes the Ringo information retrieval ings of the different terms. The Chang system then calculates
system which recommends musical selections. The Ringo multiple levels of weighting criteria. U.S. Pat. No. 5,301,109
system requires active feedback from the users—users must issued to Landauer et al. teaches a method for retrieving
manually specify how much they like or dislike each musical 25 articles in a multiplicity of languages by constructing “latent
selection. The Ringo System maintains a complete list of vectors' (SVD or PCA vectors) which represent correlations
users ratings of music selections and makes recommenda between the different words. U.S. Pat. No. 5,331,554 issued
tions by finding which selections were liked by multiple to Graham et al. discloses a method for retrieving segments of
people. However, the Ringo System does not take advantage a manual by comparing a query with nodes in a decision tree.
of any available descriptions of the music, such as structured 30 U.S. Pat. No. 5,331,556 addresses techniques for deriving
descriptions in a database, or free text, such as that contained morphological part-of-speech information and thus to make
in music reviews. An article titled “Evolving agents for per use of the similarities of different forms of the same word (e.g.
sonalized information filtering', published at the Proc. 9th “article” and “articles”).
IEEE Conf. on AI for Applications by Sheth and Maes, Therefore, there presently is no information retrieval and
described the use of agents for information filtering which use 35 delivery system operable in an electronic media environment
genetic algorithms to learn to categorize Usenet news articles. that enables a user to access information of relevance and
In this system, users must define news categories and the interest to the user without requiring the user to expend an
users actively indicate their opinion of the selected articles. excessive amount of time and energy.
Their system uses a list of keywords to represent sets of
articles and the records of users interests are updated using 40 Solution
genetic algorithms.
A number of other research groups have looked at the The above-described problems are solved and a technical
automatic generation and labeling of clusters of articles for advance achieved in the field by the system for customized
the purpose of browsing through the articles. A group at electronic identification of desirable objects in an electronic
Xerox Parc published a paper titled "Scatter/gather: a cluster 45 media environment, which system enables a user to access
based approach to browsing large article collections' at the 15 target objects of relevance and interest to the user without
Ann. Int’l SIGIR 92, ACM 318-329 (Cutting et al. 1992). requiring the user to expend an excessive amount of time and
This group developed a method they call "scatter/gather for energy. Profiles of the target objects are stored on electronic
performing information retrieval searches. In this method, a media and are accessible via a data communication network.
collection of articles is "scattered’ into a small number of 50 In many applications, the target objects are informational In
clusters, the user then chooses one or more of these clusters nature, and so may themselves be stored on electronic media
based on short summaries of the cluster. The selected clusters and be accessible via a data communication network.
are then "gathered into a subcollection, and then the process Relevant definitions of terms for the purpose of this
is repeated. Each iteration of this process is expected to pro description include: (a.) an object available for access by the
duce a small, more focused collection. The cluster “summa 55 user, which may be either physical or electronic in nature, is
ries' are generated by picking those words which appear most termed a “target object', (b.) a digitally represented profile
frequently in the cluster and the titles of those articles closest indicating that target objects attributes is termed a “target
to the center of the cluster. However, no feedback from users profile’, (c.) the user looking for the target object is termed a
is collected or stored, so no performance improvement occurs “user”, (d.) a profile holding that user's attributes, including
over time. 60 age/zip code/etc. is termed a “user profile', (e.) a Summary of
Apple's Advanced Technology Group has developed an digital profiles of target objects that a user likes and/or dis
interface based on the concept of a “pile of articles”. This likes, is termed the “target profile interest summary of that
interface is described in an article titled “A pile metaphor for user, (f) a profile consisting of a collection of attributes. Such
Supporting casual organization of information in Human fac that a user likes target objects whose profiles are similar to this
tors in computer systems’ published in CIE 92 Conf. Proc. 65 collection of attributes, is termed a “search profile' or in some
627-634 by Mander, R. G. Salomon and Y. Wong. 1992. contexts a "query' or "query profile.” (g.) a specific embodi
Another article titled “Content awareness in a file system ment of the target profile interest Summary which comprises
US 8,171,032 B2
5 6
a set of search profiles is termed the “search profile set of a user profiles, which do not include enough information to
user, (h.) a collection of target objects with similar profiles, is identify the individual users in question, in order to carry out
termed a “cluster” (i.) an aggregate profile formed by aver standard kinds of demographic analysis and market research
aging the attributes of all target objects in a cluster, termed a on the resulting database of partial user profiles.
“cluster profile. (...) a real number determined by calculating In the preferred embodiment of the invention, the system
the statistical variance of the profiles of all target objects in a for customized electronic identification of desirable objects
cluster, is termed a “cluster variance.” (k.) a real number uses a fundamental methodology for accurately and effi
determined by calculating the maximum distance between ciently matching users and target objects by automatically
the profiles of any two target objects in a cluster, is termed a calculating, using and updating profile information that
“cluster diameter.” 10 describes both the users interests and the target objects
The system for electronic identification of desirable characteristics. The target objects may be published articles,
objects of the present invention automatically constructs both purchasable items, or even other people, and their properties
a target profile for each target object in the electronic media are stored, and/or represented and/or denoted on the elec
based, for example, on the frequency with which each word tronic media as (digital) data. Examples of target objects can
appears in an article relative to its overall frequency of use in 15 include, but are not limited to: a newspaper story of potential
all articles, as well as a “target profile interest summary” for interest, a movie to watch, an item to buy, e-mail to receive, or
each user, which target profile interest Summary describes the another person to correspond with. In all these cases, the
users interest level in various types of target objects. The information delivery process in the preferred embodiment is
system then evaluates the target profiles against the users based on determining the similarity between a profile for the
target profile interest Summaries to generate a user-custom target object and the profiles of target objects for which the
ized rank ordered listing of target objects most likely to be of user (or a similar user) has provided positive feedback in the
interest to each user so that the user can select from among past. The individual data that describe a target object and
these potentially relevant target objects, which were auto constitute the target objects profile are herein termed
matically selected by this system from the plethora of target “attributes' of the target object. Attributes may include, but
objects available on the electronic media. 25 are not limited to, the following: (1) long pieces of text (a
Because people have multiple interests, a target profile newspaper story, a movie review, a product description oran
interest Summary for a single user must represent multiple advertisement), (2) short pieces of text (name of a movie's
areas of interest, for example, by consisting of a set of indi director, name of town from which an advertisement was
vidual search profiles, each of which identifies one of the placed, name of the language in which an article was written),
users areas of interest. Each user is presented with those 30 (3) numeric measurements (price of a product, rating given to
target objects whose profiles most closely match the user's a movie, reading level of a book), (4) associations with other
interests as described by the user's target profile interest types of objects (list of actors in a movie, list of persons who
Summary. Users’ target profile interest Summaries are auto have read a document). Any of these attributes, but especially
matically updated on a continuing basis to reflect each user's the numeric ones, may correlate with the quality of the target
changing interests. In addition, target objects can be grouped 35 object, Such as measures of its popularity (how often it is
into clusters based on their similarity to each other, for accessed) or of user satisfaction (number of complaints
example, based on similarity of their topics in the case where received).
the target objects are published articles, and menus automati The preferred embodiment of the system for customized
cally generated for each cluster of target objects to allow users electronic identification of desirable objects operates in an
to navigate throughout the clusters and manually locate target 40 electronic media environment for accessing these target
objects of interest. For reasons of confidentiality and privacy, objects, which may be news, electronic mail, other published
a particular user may not wish to make public all of the documents, or product descriptions. The system in its broad
interests recorded in the user's target profile interest Sum est construction comprises three conceptual modules, which
mary, particularly when these interests are determined by the may be separate entities distributed across many implement
user's purchasing patterns. The user may desire that all or part 45 ing systems, or combined into a lesser Subset of physical
of the target profile interest Summary be kept confidential, entities. The specific embodiment of this system disclosed
Such as information relating to the user's political, religious, herein illustrates the use of a first module which automati
financial or purchasing behavior, indeed, confidentiality with cally constructs a “target profile' for each target object in the
respect to purchasing behavior is the user's legal right in electronic media based on various descriptive attributes of the
many states. It is therefore necessary that data in a user's 50 target object. A second module uses interest feedback from
target profile interest Summary be protected from unwanted users to construct a “target profile interest Summary for each
disclosure except with the users agreement. At the same user, for example in the form of a “search profile set con
time, the user's target profile interest Summaries must be sisting of a plurality of search profiles, each of which corre
accessible to the relevant servers that perform the matching of sponds to a single topic of high interest for the user. The
target objects to the users, if the benefit of this matching is 55 system further includes a profile processing module which
desired by both providers and consumers of the target objects. estimates each users interest in various target objects by
The disclosed system provides a solution to the privacy prob reference to the users’ target profile interest Summaries, for
lem by using a proxy server which acts as an intermediary example by comparing the target profiles of these target
between the information provider and the user. The proxy objects against the search profiles in users search profile sets,
server dissociates the user's true identity from the pseudonym 60 and generates for each user a customized rank-ordered listing
by the use of cryptographic techniques. The proxy server also of target objects most likely to be of interest to that user. Each
permits users to control access to their target profile interest user's target profile interest Summary is automatically
Summaries and/or user profiles, including provision of this updated on a continuing basis to reflect the user's changing
information to marketers and advertisers if they so desire, interests.
possibly in exchange for cash or other considerations. Mar 65 Target objects may be of various sorts, and it is sometimes
keters may purchase these profiles in order to target adver advantageous to use a single system that delivers and/or clus
tisements to particular users, or they may purchase partial ters target objects of several distinct sorts at once, in a unified
US 8,171,032 B2
7 8
framework. For example, users who exhibit a strong interest for user information access. The information maps so pro
in certain novels may also show an interest in certain movies, duced and the application of users target profile interest
presumably of a similar nature. A system in which some target Summaries to predict the information consumption patterns
objects are novels and other target objects are movies can ofa user allows for pre-caching of data at locations on the data
discover Such a correlation and exploit it in order to group communication network and at times that minimize the traffic
particular novels with particular movies, e.g., for clustering flow in the communication network to thereby efficiently
purposes, or to recommend the movies to a user who has provide the desired information to the user and/or conserve
demonstrated interest in the novels. Similarly, if users who valuable storage space by only storing those target objects (or
exhibit an interest in certain World Wide Web sites also segments thereof) which are relevant to the user's interests.
exhibit an interest in certain products, the system can match 10
the products with the sites and thereby recommend to the BRIEF DESCRIPTION OF THE DRAWING
marketers of those products that they place advertisements at
those sites, e.g., in the form of hypertext links to their own FIG. 1 illustrates in block diagram form a typical architec
sites. ture of an electronic media system in which the system for
The ability to measure the similarity of profiles describing 15 customized electronic identification of desirable objects of
target objects anda user’s interests can be applied in two basic the present invention can be implemented as part of a user
ways: filtering and browsing. Filtering is useful when large server system;
numbers of target objects are described in the electronic FIG. 2 illustrates in block diagram form one embodiment
media space. These target objects can for example be articles of the system for customized electronic identification of
that are received or potentially received by a user, who only desirable objects;
has time to read a small fraction of them. For example, one FIGS. 3 and 4 illustrate typical network trees;
might potentially receive all items on the AP news wire ser FIG. 5 illustrates in flow diagram form a method for auto
vice, all items posted to a number of news groups, all adver matically generating article profiles and an associated hierar
tisements in a set of newspapers, or all unsolicited electronic chical menu system;
mail, but few people have the time or inclination to read so 25 FIGS. 6-9 illustrate examples of menu generating process;
many articles. A filtering system in the system for customized FIG. 10 illustrates in flow diagram form the operational
electronic identification of desirable objects automatically steps taken by the system for customized electronic identifi
selects a set of articles that the user is likely to wish to read. cation of desirable objects to screen articles for a user;
The accuracy of this filtering system improves over time by FIG. 11 illustrates a hierarchical cluster tree example:
noting which articles the user reads and by generating a 30 FIG. 12 illustrates in flow diagram form the process for
measurement of the depth to which the user reads each article. determination of likelihood of interest by a specific user in a
This information is then used to update the user's target selected target object;
profile interest Summary. Browsing provides an alternate FIGS. 13 A-B illustrate in flow diagram form the automatic
method of selecting a small Subset of a large number of target clustering process;
objects, such as articles. Articles are organized so that users 35 FIG. 14 illustrates in flow diagram form the use of the
can actively navigate among groups of articles by moving pseudonymous server,
from one group to a larger, more general group, to a smaller, FIG. 15 illustrates in flow diagram form the use of the
more specific group, or to a closely related group. Each indi system for accessing information in response to a user query;
vidual article forms a one-member group of its own, so that and
the user can navigate to and from individual articles as well as 40 FIG. 16 illustrates in flow diagram form the use of the
larger groups. The methods used by the system for custom system for accessing information in response to a user query
ized electronic identification of desirable objects allow when the system is a distributed network implementation.
articles to be grouped into clusters and the clusters to be
grouped and merged into larger and larger clusters. These DETAILLED DESCRIPTION
hierarchies of clusters then form the basis for menuing and 45
navigational systems to allow the rapid searching of large Measuring Similarity
numbers of articles. This same clustering technique is appli This section describes a general procedure for automati
cable to any type of target objects that can be profiled on the cally measuring the similarity between two target objects, or,
electronic media. more precisely, between target profiles that are automatically
There are a number of variations on the theme of developing 50 generated for each of the two target objects. This similarity
and using profiles for article retrieval, with the basic imple determination process is applicable to target objects in a wide
mentation of an on-line news clipping service representing variety of contexts. Target objects being compared can be, as
the preferred embodiment of the invention. Variations of this an example but not limited to: textual documents, human
basic system are disclosed and comprise a system to filter beings, movies, or mutual funds. It is assumed that the target
electronic mail, an extension for retrieval of target objects 55 profiles which describe the target objects are stored at one or
Such as purchasable items which may have more complex more locations in a data communication network on data
descriptions, a system to automatically build and alter menu storage media associated with a computer system. The com
ing systems for browsing and searching through large num puted similarity measurements serve as input to additional
bers of target objects, and a system to construct virtual com processes, which function to enable human users to locate
munities of people with common interests. These intelligent 60 desired target objects using a large computer system. These
filters and browsers are necessary to provide a truly passive, additional processes estimate a human users interest in vari
intelligent system interface. A user interface that permits ous target objects, or else cluster a plurality of target objects
intuitive browsing and filtering represents for the first time an in to logically coherent groups. The methods used by these
intelligent system for determining the affinities between users additional processes might in principle be implemented on
and target objects. The detailed, comprehensive target pro 65 either a single computer or on a computer network. Jointly or
files and user-specific target profile interest Summaries enable separately, they form the underpinning for various sorts of
the system to provide responsive routing of specific queries database systems and information retrieval systems.
US 8,171,032 B2
9 10
Target Objects and Attributes which are most interesting to the users), the system is likely to
In classical Information Retrieval (IR) technology, the user be concerned with values of attributes such as these:
is a literate human and the target objects in question are (a.) title of movie,
textual documents stored on data storage devices intercon (b.) name of director,
nected to the user via a computer network. That is, the target (c.) Motion Picture Association of America (MPAA) child
objects consist entirely of text, and so are digitally stored on appropriateness rating (0-G, 1=PG, ...),
the data storage devices within the computer network. How (d.) date of release,
ever, there are other target object domains that present related (e.) number of Stars granted by a particular critic,
retrieval problems that are not capable of being solved by (f) number of stars granted by a second critic,
present information retrieval technology which are appli 10 (g.) number of stars granted by a third critic,
cable to targeting of articles and advertisements to readers of For example, a customized financial news column may be
an on-line newspaper: presented to the user in the form of articles which are of
(a.) the user is a film buff and the target objects are movies interest to the user. In this case, however, an accordingly those
available on videotape. stocks which are most interesting to the user may be presented
(b.) the user is a consumer and the target objects are used 15 as well.
cars being sold. (h.) full text of review by the third critic,
(c.) the user is a consumer and the target objects are prod (i.) list of customers who have previously rented this
ucts being sold through promotional deals. movie,
(d.) the user is an investorand the target objects are publicly (i.) list of actors.
traded stocks, mutual funds and/or real estate properties. Each movie has a different set of values for these attributes.
(e.) the user is a student and the target objects are classes This example conveniently illustrates three kinds of
being offered. attributes. Attributes c-gare numeric attributes, of the sort that
(f) the user is an activist and the target objects are Con might be found in a database record. It is evident that they can
gressional bills of potential concern. be used to help the user identify target objects (movies) of
25 interest. For example, the user might previously have rented
(g.) the user is a net-Surfer and the target objects are links to
pages, servers, or newsgroups available on the World many Parental Guidance (PG) films, and many films made in
Wide Web which are linked from pages and articles the 1970s. This generalization is useful: new films with
on-line newspaper. values for one or both attributes that are numerically similar to
(h.) the user is a philanthropist and the target objects are these (such as MPAA rating of 1, release date of 1975) are
charities. 30 judged similar to the films the user already likes, and there
(i.) the user is ill and the target objects are ads for medical fore of probable interest. Attributes a-b and h are textual
specialists. attributes. They too are important for helping the user locate
(..) the user is an employee and the target objects are desired films. For example, perhaps the user has shown a past
classifieds for potential employers. interest in films whose review text (attribute h) contains
(k.) the user is an employer and the target objects are 35 words like “chase.” “explosion.” “explosions.” “hero,” “grip
classifieds for potential employees. ping.” and 'superb. This generalization is again useful in
(1.) the user is a lonely heart and the target objects are identifying new films of interest. Attribute i is an associative
classifies for potential conversation partners. attribute. It records associations between the target objects in
(m) the user is in search of an expert and the target objects this domain, namely movies, and ancillary target objects of an
are users, with known retrieval habits, of an document 40 entirely different sort, namely humans. A good indication that
retrieval system. the user wants to rent a particular movie is that the user has
(n) the user is in need of insurance and the target objects are previously rented other movies with similar attribute values,
classifieds for insurance policy offers. and this holds for attribute I just as it does for attributes a-h.
In all these cases, the user wishes to locate some Small For example, if the user has often liked movies that customer
Subset of the target objects—such as the target objects that the 45 C, and customer Coo have rented, then the user may like
user most desires to rent, buy, investigate, meet, read, give other such movies, which have similar values for attribute i.
mammograms to, insure, and so forth. The task is to help the Attribute j is another example of an associative attribute,
user identify the most interesting target objects, where the recording associations between target objects and actors.
users interest in a target object is defined to be a numerical Notice that any of these attributes can be made subject to
measurement of the user's relative desire to locate that object 50 authentication when the profile is constructed, through the
rather than others. use of digital signatures; for example, the target object could
The generality of this problem motivates a general be accompanied by a digitally signed note from the MPAA,
approach to Solving the information retrieval problems noted which note names the target object and specifies its authentic
above. It is assumed that many target objects are known to the value for attribute c.
system for customized electronic identification of desirable 55 These three kinds of attributes are common: numeric, tex
objects, and that specifically, the system stores (or has the tual, and associative. In the classical information retrieval
ability to reconstruct) several pieces of information about problem, where the target objects are documents (or more
each target object. These pieces of information are termed generally, coherent document sections extracted by a text
“attributes': collectively, they are said to form a profile of the segmentation method), the system might only consider a
target object, or a “target profile.” For example, where the 60 single, textual attribute when measuring similarity: the full
system for customized electronic identification of desirable text of the target object. However, a more Sophisticated sys
objects is activated to identify selection of interest, a particu tem would consider a longer target profile, including numeric
lar category of on-line products for review or purchase by the and associative attributes:
user, it can be appreciated that there are certain unique sets of (a.) full text of document (textual),
attributes which are pertinent to the particular product cat 65 (b.) title (textual),
egory of choice. For the application as part of a movie critic (c.) author (textual),
column (where the system identifies movie titles and reviews (d.) language in which document is written (textual),
US 8,171,032 B2
11 12
(e.) date of creation (numeric), answer pair from a question-and-answer list vs. tabloid news
(f) date of last update (numeric), paper article vs. . . . ); the source may be represented as a
(g) length in words (numeric), single-term textual attribute. Important associative attributes
(h.) reading level (numeric), for a hypertext document are the list of documents that it links
(i.) quality of document as rated by a thirdW party editorial to, and the list of documents that link to it. Documents with
agency (numeric), similar citations are similar with respect to the former
(i.) list of other readers who have retrieved this document attribute, and documents that are cited in the same places are
(associative). similar with respect to the latter. A convention may optionally
As another domain example, consider a domain where the be adopted that any document also links to itself. Especially
user is an advertiser and the target objects are potential cus 10
in Systems where users can choose whether or not to retrieve
tomers. The system might store the following attributes for a target object, a target object's popularity (or circulation) can
each target object (potential customer): be usefully measured as a numeric attribute specifying the
(a.) first two digits of Zip code (textual),
(b.) first three digits of zip code (textual), number of users who have retrieved that object. Related mea
(c.) entire five-digit Zip code (textual), 15 Surable numeric attributes that also indicate a kind of popu
(d.) distance of residence from advertiser's nearest physi larity include the number of replies to a target object, in the
cal storefront (numeric), domain where target objects are messages posted to an elec
(e.) annual family income (numeric), tronic community Such as an computer bulletin board or
(f) number of children (numeric), newsgroup, and the number of links leading to a target object,
(g.) list of previous items purchased by this potential cus in the domain where target objects are interlinked hypertext
tomer (associative), documents on the World Wide Web or a similar system. A
(h.) list of filenames stored on this potential customer's target object may also receive explicit numeric evaluations
client computer (associative), (another kind of numeric attribute) from various groups, such
(i.) list of movies rented by this potential customer (asso as the Motion Picture Association of America (MPAA), as
ciative), 25 above, which rates movies appropriateness for children, or
(i.) list of investments in this potential customers invest the American Medical Association, which might rate the
ment portfolio (associative), accuracy and novelty of medical research papers, or a random
(k.) list of documents retrieved by this potential customer Survey sample of users (chosen from all users or a selected set
(associative), of experts), who could be asked to rate nearly anything. Cer
(1.) written response to Rorschach inkblottest (textual), 30
tain other types of evaluation, which also yield numeric
(m.) multiple-choice responses by this customer to 20 self attributes, may be carried out mechanically. For example, the
image questions (20 textual attributes). difficulty of reading a text can be assessed by standard pro
As always, the notion is that similar consumers buy similar cedures that count word and sentence lengths, while the Vul
products. It should be noted that diverse sorts of information garity of a text could be defined as (say) the number of Vulgar
are being used here to characterize consumers, from their 35
words it contains, and the expertise of a text could be crudely
consumption patterns to their literary tastes and psychologi
cal peculiarities, and that this fact illustrates both the flexibil assessed by counting the number of similar texts its author
ity and power of the system for customized electronic iden had previously retrieved and read using the invention, perhaps
tification of desirable objects of the present invention. confining this count to texts that have high approval ratings
Diverse sorts of information can be used as attributes in other 40 from critics. Finally, it is possible to synthesize certain textual
domains as well (as when physical, economic, psychological attributes mechanically, for example to reconstruct the script
and interest-related questions are used to profile the appli of a movie by applying speech recognition techniques to its
cants to a dating service, which is indeed a possible domain Soundtrack or by applying optical character recognition tech
for the present system), and the advertiser domain is simply niques to its closed-caption Subtitles.
an example. 45 Decomposing Complex Attributes
As a final domain example, consider a domain where the Although textual and associative attributes are large and
user is an Stock market investor and the target objects are complex pieces of data, for information retrieval purposes
publicly traded corporations. A great many attributes might they can be decomposed into Smaller, simpler numeric
be used to characterize each corporation, including but not attributes. This means that any set of attributes can be
limited to the following: 50 replaced by a (usually larger) set of numeric attributes, and
(a.) type of business (textual), hence that any profile can be represented as a Vector of num
(b.) corporate mission statement (textual), bers denoting the values of these numeric attributes. In par
(c.) number of employees during each of the last 10 years ticular, a textual attribute, such as the full text of a movie
(ten separate numeric attributes), review, can be replaced by a collection of numeric attributes
(d.) percentage growth in number of employees during 55 that represent scores to denote the presence and significance
each of the last 10 years, of the words “aardvark,” “aback,” “abacus, and so on
(e.) dividend paymentissued in each of the last 40 quarters, through “Zymurgy’ in that text. The score of a word in a text
as a percentage of current share price, may be defined in numerous ways. The simplest definition is
(f) percentage appreciation of stock value during each of that the score is the rate of the word in the text, which is
the last 40 quarters, list of shareholders (associative), 60 computed by computing the number of times the word occurs
(g.) composite text of recent articles about the corporation in the text, and dividing this number by the total number of
in the financial press (textual). words in the text. This sort of score is often called the “term
It is worth noting some additional attributes that are of frequency” (TF) of the word. The definition of term frequency
interest in Some domains. In the case of documents and cer may optionally be modified to weight different portions of the
tain other domains, it is useful to know the Source of each 65 text unequally: for example, any occurrence of a word in the
target object (for example, refereed journal article vs. UPI text's title might be counted as a 3-fold or more generally
newswire article vs. Usenet newsgroup posting vs. question k-fold occurrence (as if the title had been repeated k times
US 8,171,032 B2
13 14
within the text), in order to reflect a heuristic assumption that tive attribute may be decomposed into a number of compo
the words in the title are particularly important indicators of nent associations. For instance, in a domain where the target
the texts content or topic. objects are movies, a typical associative attribute used in
However, for lengthy textual attributes, such as the text of profiling a movie would be a list of customers who have
an entire document, the score of a word is typically defined to rented that movie. This list can be replaced by a collection of
be not merely its term frequency, but its term frequency numeric attributes, which give the “association scores'
multiplied by the negated logarithm of the words “global between the movie and each of the customers known to the
frequency, as measured with respect to the textual attribute in system. For example, the 165th such numeric attribute would
question. The global frequency of a word, which effectively be the association score between the movie and customer
measures the words uninformativeness, is a fraction between 10 #165, where the association score is defined to be 1 if cus
0 and 1, defined to be the fraction of all target objects for tomer #165 has previously rented the movie, and 0 otherwise.
which the textual attribute in question contains this word. In a subtler refinement, this association score could be defined
This adjusted score is often known in the art as TF/IDF (“term to be the degree of interest, possibly zero, that customer #165
frequency times inverse document frequency'). When global exhibited in the movie, as determined by relevance feedback
frequency of a word is taken into account in this way, the 15 (as described below). As another example, in a domain where
common, uninformative words have scores comparatively target objects are companies, an associative attribute indicat
close to Zero, no matter how often or rarely they appear in the ing the major shareholders of the company would be decom
text. Thus, their rate has little influence on the object’s target posed into a collection of association scores, each of which
profile. Alternative methods of calculating word scores would indicate the percentage of the company (possibly Zero)
include latent semantic indexing or probabilistic models. owned by some particular individual or corporate body. Just
Instead of breaking the text into its component words, one as with the term scores used in decomposing lengthy textual
could alternatively break the text into overlapping word big attributes, each association score may optionally be adjusted
rams (sequences of 2 adjacent words), or more generally, by a multiplicative factor: for example, the association score
word n-grams. These word n-grams may be scored in the between a movie and customer #165 might be multiplied by
same way as individual words. Another possibility is to use 25 the negated logarithm of the “global frequency of customer
character n-grams. For example, this sentence contains a #165, i.e., the fraction of all movies that have been rented by
sequence of overlapping character 5-grams which starts “for customer #165. Just as with the term scores used in decom
e', 'or ex”, “r exa”, “exam”, “examp', etc. The sentence may posing textual attributes, most association scores found when
be characterized, imprecisely but usefully, by the score of decomposing a particular value of an associative attribute are
each possible character 5-gram ('aaaaa', 'aaaab'. . . . 30 Zero, and a similar economy of storage may be gained in
“ZZZZZ”) in the sentence. Conceptually speaking, in the char exactly the same manner by storing a list of only those ancil
acter 5-gram case, the textual attribute would be decomposed lary objects with which the target object has a nonzero asso
into at least 26–11.881,376 numeric attributes. Of course, ciation score, together with their respective association
for a given target object, most of these numeric attributes have SCOCS.
values of 0, since most 5-grams do not appear in the target 35 Similarity Measures
object attributes. These Zero values need not be stored any What does it mean for two target objects to be similar?
where. For purposes of digital storage, the value of a textual More precisely, how should one measure the degree of simi
attribute could be characterized by storing the set of character larity? Many approaches are possible and any reasonable
5-grams that actually do appear in the text, together with the metric that can be computed over the set of target object
nonzero score of each one. Any 5-gram that is not included in 40 profiles can be used, where target objects are considered to be
the set can be assumed to have a score of Zero. The decom similar if the distance between their profiles is small accord
position of textual attributes is not limited to attributes whose ing to this metric. Thus, the following preferred embodiment
values are expected to be long texts. A simple, one-term of a target object similarity measurement system has many
textual attribute can be replaced by a collection of numeric variations.
attributes in exactly the same way. Consider again the case 45 First, define the distance between two values of a given
where the target objects are movies. The “name of director attribute according to whether the attribute is a numeric,
attribute, which is textual, can be replaced by numeric associative, or textual attribute. If the attribute is numeric,
attributes giving the scores for “Federico-Fellini.” “Woody then the distance between two values of the attribute is the
Allen,” “Terence-Davies, and so forth, in that attribute. For absolute value of the difference between the two values.
these one-term textual attributes, the score of a word is usu 50 (Other definitions are also possible: for example, the distance
ally defined to be its rate in the text, without any consideration between prices p1 and p2 might be defined by (p1-p2)/(max
of global frequency. Note that under these conditions, one of (p1.p2)+1), to recognize that when it comes to customer
the scores is 1, while the other scores are 0 and need not be interest, $50 00 and S5020 are very similar, whereas S3 and
stored. For example, if Davies did direct the film, then it is S23 are not.) If the attribute is associative, then its value V
“Terence-Davies' whose score is 1, since “Terence-Davies' 55 may be decomposed as described above into a collection of
constitutes 100% of the words in the textual value of the real numbers, representing the association scores between the
“name of director' attribute. It might seem that nothing has target object in question and various ancillary objects. V may
been gained over simply regarding the textual attribute as therefore be regarded as a vector with components V, V, V,
having the string value “Terence-Davies.” However, the trick etc., representing the association scores between the object
of decomposing every non-numeric attribute into a collection 60 and ancillary objects 1, 2, 3, etc., respectively. The distance
of numeric attributes proves useful for the clustering and between two vector values V and U of an associative attribute
decision tree methods described later, which require the is then computed using the angle distance measure, arccos
attribute values of different objects to be averaged and/or (VU/sqrt((Vv)(UU)). (Note that the three inner products in
ordinally ranked. Only numeric attributes can be averaged or this expression have the form XYXY,+X, Y,+X
ranked in this way. 65 Y+..., and that for efficient computation, terms of the form
Just as a textual attribute may be decomposed into a num X, Y, may be omitted from this sum if either of the scores X,
ber of component terms (letter or word n-grams), an associa and Y, is zero.) Finally, if the attribute is textual, then its value
US 8,171,032 B2
15 16
V may be decomposed as described above into a collection of other hand, if the weight of the “color” attribute is compara
real numbers, representing the scores of various word tively very high, then users are predicted to show interest
n-grams or character n-grams in the text. Then the value V primarily in products whose colors they have liked in the past:
may again be regarded as a vector, and the distance between a brown massage cushion and a blue massage cushion are not
two values is again defined via the angle distance measure. at all the same kind of target object, however similar in other
Other similarity metrics between two vectors, such as the dice attributes, and a good experience with one does not by itself
measure, may be used instead. It happens that the obvious inspire much interest in the other. Target objects may be of
alternative metric, Euclidean distance, does not work well: various sorts, and it is sometimes advantageous to use a single
even similar texts tend not to overlap substantially in the system that is able to compare target objects of distinct sorts.
content words they use, so that texts encountered in practice 10 For example, in a system where some target objects are novels
are all substantially orthogonal to each other, assuming that while other target objects are movies, it is desirable to judge
TF/IDF scores are used to reduce the influence of non-content a novel and a movie similar if their profiles show that similar
words. The scores of two words in a textual attribute vector users like them (an associative attribute). However, it is
may be correlated; for example, “Kennedy” and "JFK” tend important to note that certain attributes specified in the mov
to appear in the same documents. 15 ie's target profile are undefined in the novel's target profile.
Thus it may be advisable to alter the text somewhat before and vice versa: a novel has no “cast list' associative attribute
computing the scores of terms in the text, by using a synonym and a movie has no “reading level numeric attribute. In
dictionary that groups together similar words. The effect of general, a system in which target objects fall into distinct sorts
this optional pre-alteration is that two texts using related may sometimes have to measure the similarity of two target
words are measured to be as similar as if they had actually objects for which somewhat different sets of attributes are
used the same words. One technique is to augment the set of defined. This requires an extension to the distance metric
words actually found in the article with a set of synonyms or d(*, *) defined above. In certain applications, it is sufficient
other words which tend to co-occur with the words in the when carrying out such a comparison simply to disregard
article, so that “Kennedy' could be added to every article that attributes that are not defined for both target objects: this
mentions “JFK. Alternatively, words found in the article may 25 allows a cluster of novels to be matched with the most similar
be wholly replaced by synonyms, so that "JFK” might be cluster of movies, for example, by considering only those
replaced by “Kennedy” or by “John F. Kennedy' wherever it attributes that novels and movies have in common.
appears. In either case, the result is that documents about However, while this method allows comparisons between
Kennedy and documents about JFK are adjudged similar. The (say) novels and movies, it does not define a proper metric
synonym dictionary may be sensitive to the topic of the docu 30 over the combined space of novels and movies and therefore
ment as a whole; for example, it may recognize that "crane' is does not allow clustering to be applied to the set of all target
likely to have a different synonym in a document that men objects. When necessary for clustering or other purposes, a
tions birds than in a document that mentions construction. A metric that allows comparison of any two target objects
related technique is to replace each word by its morphological (whether of the same or different sorts) can be defined as
stem, so that “staple”, “stapler, and "staples' are all replaced 35 follows. Ifa is an attribute, then let Max(a) be an upper bound
by “staple.” Common function words (“a”, “and”, “the'...) on the distance between two values of attribute a; notice that
can influence the calculated similarity of texts without regard if attributea is an associative or textual attribute, this distance
to their topics, and so are typically removed from the text is an angle determined by arccos, so that Max(a) may be
before the scores of terms in the text are computed. A more chosen to be 180 degrees, while if attribute a is a numeric
general approach to recognizing synonyms is to use a revised 40 attribute, a sufficiently large number must be selected by the
measure of the distance between textual attribute vectors V system designers. The distance between two values of
and U, namely arccos(AV(AU)'/sqrt(AV(AV)'' AU (AU)'), attribute a is given as before in the case where both values are
where the matrix A is the dimensionality-reducing linear defined; the distance between two undefined values is taken to
transformation (or an approximation thereto) determined by be zero; finally, the distance between a defined value and an
collecting the vector values of the textual attribute, for all 45 undefined value is always taken to be Max(a)/2. This allows
target objects known to the system, and applying singular us to determine how close together two target objects are with
value decomposition to the resulting collection. The same respect to an attribute a, even if attribute a does not have a
approach can be applied to the vector values of associative defined value for both target objects. The distance d(*.*)
attributes. The above definitions allow us to determine how between two target objects with respect to their entire multi
close together two target objects are with respect to a single 50 attribute profiles is then given in terms of these individual
attribute, whether numeric, associative, or textual. The dis attribute distances exactly as before. It is assumed that one
tance between two target objects X and Y with respect to their attribute in such a system specifies the sort of target object
entire multi-attribute profiles P and P is then denoted d(X, (“movie”, “novel”, etc.), and that this attribute may be highly
Y) or d(PP) and defined as: weighted if target objects of different sorts are considered to
(((distance with respect to attribute a) (weight of 55 be very different despite any attributes they may have in
attribute a)'+(distance with respect to attribute COOm.
b)(weight of attribute b))^+((distance with
respect to attribute c)(weight of attribute Utilizing the Similarity Measurement
c))*+...)
where k is a fixed positive real number, typically 2, and the 60 Matching Buyers and Sellers
weights are non-negative real numbers indicating the relative A simple application of the similarity measurement is a
importance of the various attributes. For example, if the target system to match buyers with sellers in small-volume markets,
objects are consumer goods, and the weight of the “color” such as used cars and other used goods, artwork, or employ
attribute is comparatively very small, then price is not a con ment. Sellers submit profiles of the goods (target objects) they
sideration in determining similarity: a user who likes a brown 65 want to sell, and buyers submit profiles of the goods (target
massage cushion is predicted to show equal interest in the objects) they want to buy. Participants may submit or with
same cushion manufactured in blue, and Vice-versa. On the draw these profiles at any time. The system for customized
US 8,171,032 B2
17 18
electronic identification of desirable objects computes the passive feedback score. In a variation, the user cannot see or
similarities between seller-submitted profiles and buyer-sub adjust the indicator until just after the user has finished view
mitted profiles, and when two profiles match closely (i.e., the ing the target object. Regardless how a user's feedback is
similarity is above a threshold), the corresponding seller and computed, it is stored long-term as part of that user's target
buyer are notified of each other's identities. To prevent users profile interest Summary.
from being flooded with responses, it may be desirable to Filtering: Determining Topical Interest Through Similarity
limit the number of notifications each user receives to a fixed Relevance feedback only determines the user's interest in
number, such as ten per day. certain target objects: namely, the target objects that the user
Filtering: Relevance Feedback has actually had the opportunity to evaluate (whether actively
A filtering system is a device that can search through many 10 or passively). For target objects that the user has not yet seen,
target objects and estimate a given users interest in each the filtering system must estimate the user's interest. This
target object, so as to identify those that are of greatest interest estimation task is the heart of the filtering problem, and the
to the user. The filtering system uses relevance feedback to reason that the similarity measurement is important. More
refine its knowledge of the user's interests: whenever the concretely, the preferred embodiment of the filtering system
filtering system identifies a target object as potentially inter 15 is a news clipping service that periodically presents the user
esting to a user, the user (ifan on-line user) provides feedback with news articles of potential interest. The user provides
as to whether or not that target object really is of interest. Such active and/or passive feedback to the system relating to these
feedback is stored long-term in Summarized form, as part of a presented articles. However, the system does not have feed
database of user feedback information, and may be provided back information from the user for articles that have never
either actively or passively. In active feedback, the user been presented to the user, such as new articles that have just
explicitly indicates his or her interest, for instance, on a scale been added to the database, or old articles that the system
of -2 (active distaste) through 0 (no special interest) to 10 chose not to present to the user. Similarly, in the dating service
(great interest). In passive feedback, the system infers the domain where target objects are prospective romantic part
users interest from the user's behavior. For example, if target ners, the system has only received feedback on old flames, not
objects are textual documents, the system might monitor 25 on prospective new loves.
which documents the user chooses to read, or not to read, and As shown in flow diagram form in FIG. 12, the evaluation
how much time the user spends reading them. A typical for of the likelihood of interest in a particular target object for a
mula for assessing interest in a document via passive feed specific user can automatically be computed. The interest that
back, in this domain, on a scale of 0 to 10, might be: a given target object Xholds for a user U is assumed to be a
+2 if the second page is viewed, 30 sum of two quantities: q(U, X), the intrinsic “quality” of X
+2 ifall pages are viewed, plus f(U, X), the “topical interest' that users like U have in
+2 if more than 30 seconds was spent viewing the docu target objects like X. For any target object X, the intrinsic
ment, quality measure q(U. X) is easily estimated at Steps 1201
+2 if more than one minute was spent viewing the docu 1203 directly from numeric attributes of the target object X.
ment, 35 The computation process begins at step 1201, where certain
+2 if the minutes spent viewing the document are greater designated numeric attributes of target object X are specifi
than half the number of pages. cally selected, which attributes by their very nature should be
If the target objects are electronic mail messages, interest positively or negatively correlated with users’ interest. Such
points might also be added in the case of aparticularly lengthy attributes, termed “quality attributes.” have the normative
or particularly prompt reply. If the target objects are purchas 40 property that the higher (or in some cases lower) their value,
able goods, interest points might be added for target objects the more interesting a user is expected to find them. Quality
that the user actually purchases, with further points in the case attributes of target object X may include, but are not limited
of a large-quantity or high-price purchase. In any domain, to, target object X’s popularity among users in general, the
further points might be added for target objects that the user rating a particular reviewer has given target objectX, the age
accesses early in a session, on the grounds that users access 45 (time since authorship—also known as outdatedness) of tar
the objects that most interest them first. Other potential get objectX, the number of Vulgar words used in target object
Sources of passive feedback include an electronic measure X, the price of target objectX, and the amount of money that
ment of the extent to which the user's pupils dilate while the the company selling target object X has donated to the user's
user views the target object or a description of the target favorite charity. At step 1202, each of the selected attributes is
object. It is possible to combine active and passive feedback. 50 multiplied by a positive or negative weight indicative of the
One option is to take a weighted average of the two ratings. strength of user Us preference for those target objects that
Another option is to use passive feedback by default, but to have high values for this attribute, which weight must be
allow the user to examine and actively modify the passive retrieved from a data file storing quality attribute weights for
feedback score. In the scenario above, for instance, an unin the selected user. At step 1203, a weighted sum of the iden
teresting article may sometimes remain on the display device 55 tified weighted selected attributes is computed to determine
for a long period while the user is engaged in unrelated the intrinsic quality measure q(U, X). At step 1204, the Sum
business; the passive feedback score is then inappropriately marized weighted relevance feedback data is retrieved,
high, and the user may wish to correct it before continuing. In wherein some relevance feedback points are weighted more
the preferred embodiment of the invention, a visual indicator, heavily than others and the stored relevance data can be
Such as a sliding bar or indicator needle on the user's screen, 60 Summarized to Some degree, for example by the use of search
can be used to continuously display the passive feedback profile sets. The more difficult part of determining user Us
score estimated by the system for the target object being interest in target objectX is to find or compute at step 1205 the
viewed, unless the user has manually adjusted the indicator value of f(U, X), which denotes the topical interest that users
by a mouse operation or other means in order to reflect a like U generally have in target objects like X. The method of
different score for this target object, after which the indicator 65 determining a users interest relies on the following heuristic:
displays the active feedback score selected by the user, and when X and Y are similar target objects (have similar
this active feedback score is used by the system instead of the attributes), and U and V are similar users (have similar
US 8,171,032 B2
19 20
attributes), then topical interest f(U. X) is predicted to have a X) and (V.Y.), for any users U and V and any target objects X
similar value to the value of topical interest f(V, Y). This and Y. We have already seen how to define the distance dCX,
heuristic leads to an effective method because estimated val Y) between two target objects X and Y. given their attributes.
ues of the topical interest function f(*, *) are actually known We may regarda pair such as (U.X) as an extended object that
for certain arguments to that function: specifically, if user V 5
bears all the attributes of target X and all the attributes of user
has provided a relevance-feedback rating of r(V.Y.) for target U; then the distance between (U. X) and (V, Y) may be
object Y, then insofar as that rating represents user V’s true computed in exactly the same way. This approach requires
interest in target objectY, we have r(V,Y)-q(VY)+f(V,Y) and user U, user V, and all other users to have some attributes of
can estimate f(V.Y) as r(V,Y)-q(V.Y). Thus, the problem of their own stored in the system: for example, age (numeric),
estimating topical interest at all points becomes a problem of 10
Social security number (textual), and list of documents pre
interpolating among these estimates of topical interest at viously retrieved (associative). It is these attributes that deter
selected points, such as the feedback estimate of f(V, Y) as mine the notion of “similar users.” Thus it is desirable to
r(V,Y)-q(V.Y). This interpolation can be accomplished with
any standard Smoothing technique, using as input the known generate profiles of users (termed “user profiles') as well as
point estimates of the value of the topical interest function 15 profiles of target objects (termed “target profiles). Some
f(*, *), and determining as output a function that approxi attributes employed for profiling users may be related to the
mates the entire topical interest function f(*, *). attributes employed for profiling target objects: for example,
Not all point estimates of the topical interest function using associative attributes, it is possible to characterize tar
f(*, *) should be given equal weight as inputs to the smooth get objects such as X by the interest that various users have
ing algorithm. Since passive relevance feedback is less reli shown in them, and simultaneously to characterize users such
able than active relevance feedback, point estimates made as U by the interest that they have shown in various target
from passive relevance feedback should be weighted less objects. In addition, user profiles may make use of any
heavily than pointestimates made from active relevance feed attributes that are useful in characterizing humans, such as
back, or even not used at all. In most domains, a user's those Suggested in the example domain above where target
interests may change over time and, therefore, estimates of 25 objects are potential consumers. Notice that userUs interest
topical interest that derive from more recent feedback should can be estimated even if userU is a new user oran off-line user
also be weighted more heavily. A users interests may vary who has never provided any feedback, because the relevance
according to mood, so estimates oftopical interest that derive feedback of users whose attributes are similar to Us
from the current session should be weighted more heavily for attributes is taken into account.
the duration of the current session, and past estimates of 30
For some uses offiltering systems, when estimating topical
topical interest made at approximately the current time of day interest, it is appropriate to make an additional “presumption
or on the current weekday should be weighted more heavily.
Finally, in domains where users are trying to locate target of no topical interest” (or “bias toward Zero'). To understand
objects of long-term interest (investments, romantic partners, the usefulness of Such a presumption, Suppose the system
pen pals, employers, employees, suppliers, service providers) 35 needs to determine whether target object X is topically inter
from the possibly meager information provided by the target esting to the user U, but that users like user U have never
profiles, the users are usually not in a position to provide provided feedback on target objects even remotely like target
reliable immediate feedback on a target object, but can pro object X. The presumption of no topical interest says that if
vide reliable feedback at a later date. An estimate of topical this is so, it is because users like user U are simply not
interest f(V,Y) should be weighted more heavily if user V has 40
had more experience with target object Y. Indeed, a useful interested in Such target objects and therefore do not seek
strategy is for the system to track long-term feedback for Such them out and interact with them. On this presumption, the
target objects. For example, if target profile Y was created in system should estimate topical interest f(U. X) to be low.
1990 to describe a particular investment that was available in Formally, this example has the characteristic that (U, X) is far
1990, and that was purchased in 1990 by user V, then the 45
away from all the points (V.Y.) where feedback is available. In
system solicits relevance feedback from user V in the years Such a case, topical interest f(UX) is presumed to be close to
1990, 1991, 1992, 1993, 1994, 1995, etc., and treats these as
Successively stronger indications of user V’s true interest in Zero, even if the value of the topical interest function f(*, *) is
target profile Y, and thus as indications of user V’s likely high at all the faraway Surrounding points at which its value is
interest in new investments whose current profiles resemble 50 known. When a Smoothing technique is used, such a pre
the original 1990 investment profile Y. In particular, if in 1994 Sumption of no topical interest can be introduced, if appro
and 1995 user V is well-disposed toward his or her 1990 priate, by manipulating the input to the Smoothing technique.
purchase of the investment described by target profile Y, then In addition to using observed values of the topical interest
in those years and later, the system tends to recommend function f(*, *) as input, the trick is to also introduce fake
additional investments when they have profiles like target 55
observations of the form topical interest f(V,Y)=0 for a lattice
profile Y, on the grounds that they too will turn out to be of points (V.Y) distributed throughout the multidimensional
satisfactory in 4 to 5 years. It makes these recommendations space. These fake observations should be given relatively low
both to user V and to users whose investment portfolios and weight as inputs to the Smoothing algorithm. The more
other attributes are similar to user Vs. The relevance feed
back provided by user V in this case may be either active 60 strongly they are weighted, the stronger the presumption of
(feedback satisfaction ratings provided by the investor V) or no interest.
passive (feedback difference between average annual return The following provides another simple example of an esti
of the investment and average annual return of the Dow Jones mation technique that has a presumption of no interest. Letg
index portfolio since purchase of the investment, for be a decreasing function from non-negative real numbers to
example). 65 non-negative real numbers, such as g(x)=e or g(x) min(1.
To effectively apply the Smoothing technique, it is neces x) where k>1. Estimate topical interest f(U, X) with the
sary to have a definition of the similarity distance between (U. following g-weighted average:
US 8,171,032 B2
21 22
as input, for any j such that the distance d(X, X) is smaller
than a fixed threshold. That is, estimate each q(U, X)+f(U.
f(U, X) = Xg (distance (U, V)^(V, Y)) X) from other values offeedback rating r only; in particular,
do not user(U, X) itself. Call this estimate b. The difference
a-b, is hereintermed the “residue feedback r(U. X) of user
Here the summations are overall pairs (V.Y) such that user U on target objectX, (iii) Compute user U's error measure,
V has provided feedback r(V.Y.) on target object Y, i.e., all (a,-b,)*+(a-b;)*+(as-b)*+ . . . +(a,-b,)”.
pairs (V. Y) such that relevance feedback r(V, Y) is defined. A gradient-descent or other numerical optimization
Note that both with this technique and with conventional method may be used to adjust user U's attribute weights so
Smoothing techniques, the estimate of the topical interest f(U. 10 that this error measure reaches a (local) minimum. This
X) is not necessarily equal to r(U. X)-q(U, X), even when approach tends to work best if the Smoothing technique used
r(U. X) is defined. in estimation is such that the value of f(V, Y) is strongly
Filtering: Adjusting Weights and Residue Feedback affected by the point estimate r(V,Y)-q(V,Y) when the latter
The method described above requires the filtering system value is provided as input. Otherwise, the presence or absence
15 of the single input feedback rating r(U, X), in steps (i)-(ii)
to measure distances between (user, target object) pairs. Such may not makea, and b, very different from each other. A slight
as the distance between (U. X) and (V, Y). Given the means variation of this learning technique adjusts a single global set
described earlier for measuring the distance between two of attribute weights for all users, by adjusting the weights so
multi-attribute profiles, the method must therefore associate a as to minimize not a particular user's error measure but rather
weight with each attribute used in the profile of (user, target the total error measure of all users. These global weights are
object) pairs, that is, with each attribute used to profile either used as a default initial setting for a new user who has not yet
users or target objects. These weights specify the relative provided any feedback. Gradient descent can then be
importance of the atributes in establishing similarity or dif employed to adjust this users individual weights over time.
ference, and therefore, in determining how topical interest is Even when the attribute weights are chosen to minimize the
generalized from one (user, target object) pair to another. 25 error measure for user U, the error measure is generally still
Additional weights determine which attributes of a target positive, meaning that residue feedback from user U has not
object contribute to the quality function q, and by how much. been reduced to 0 on all target objects. It is useful to note that
It is possible and often desirable for a filtering system to high residue feedback from a user U on a target object X
store a different set of weights for each user. For example, a indicates that user Uliked target objectX unexpectedly well
user who thinks of two-star films as having materially differ 30 given its profile, that is, better than the Smoothing model
ent topic and style from four-star films wants to assign a high could predict from user U's opinions on target objects with
weight to "number of stars” for purposes of the similarity similar profiles. Similarly, low residue feedback indicates
distance measure d(*, *); this means that interest in a two-star that user U liked target object X less than was expected. By
film does not necessarily signal interest in an otherwise simi definition, this unexplained preference or dispreference can
lar four-star film, or vice-versa. If the user also agrees with the 35 not be the result of topical similarity, and therefore must be
critics, and actually prefers four-star films, the user also wants regarded as an indication of the intrinsic quality of target
to assign "number of Stars' a high positive weight in the object X. It follows that a useful quality attribute for a target
determination of the quality function q. In the same way, a object X is the average amount of residue feedback r(V,X)
user who dislikes Vulgarity wants to assign the “vulgarity from users on that target object, averaged overall users V who
score' attribute a high negative weight in the determination of 40 have provided relevance feedback on the target object. In a
the quality function q, although the “vulgarity Score' attribute variation of this idea, residue feedback is never averaged
does not necessarily have a high weight in determining the indiscriminately over all users to form a new attribute, but
topical similarity of two films. instead is Smoothed to consider users similarity to each other.
Attribute weights (of both sorts) may be set or adjusted by Recall that the quality measure q(U, X) depends on the user U
the system administrator or the individual user, on either a 45 as well as the target object X, so that a given target object X
temporary ora permanent basis. However, it is often desirable may be perceived by different users to have different quality.
for the filtering system to learn attribute weights automati In this variation, as before, q(UX) is calculated as a weighted
cally, based on relevance feedback. The optimal attribute Sum of various quality attributes that are dependent only on X,
weights for a user U are those that allow the most accurate but then an additional term is added, namely an estimate of
prediction of user U's interests. That is, with the distance 50 r(U. X) found by applying a Smoothing algorithm to known
measure and quality function defined by these attribute values of r(V. X). Here V ranges over all users who have
weights, userUs interest in target objectX, q(U, X)+f(U, X), provided relevance feedback on target object X, and the
can be accurately estimated by the techniques above. The Smoothing algorithm is sensitive to the distances d(U. V)
effectiveness of a particular set of attribute weights for user U from each such user V to user U.
can therefore be gauged by seeing how well it predicts user 55 Using the Similarity Computation for Clustering
U’s known interests. A method for defining the distance between any pair of
Formally, suppose that user U has previously provided target objects was disclosed above. Given this distance mea
feedback on target objects X, X, X. . . . X, and that the Sure, it is simple to apply a standard clustering algorithm,
feedback ratings are r(U.X.), r(U.X.), r(U.X.), ... r(U, X). Such as k-means, to group the target objects into a number of
Values of feedback ratings r(*, *) for other users and other 60 clusters, in Such a way that similar target objects tend to be
target objects may also be known. The system may use the grouped in the same cluster. It is clear that the resulting
following procedure to gauge the effectiveness of the set of clusters can be used to improve the efficiency of matching
attribute weights it currently stores for user U: (I) For each buyers and sellers in the application described in section
1 <=I< n, use the estimation techniques to estimate q(U.X)+ “Matching Buyers and Sellers’ above: it is not necessary to
f(U, X) from all known values offeedback ratings r. Call this 65 compare every buy profile to every sell profile, but only to
estimatea, (ii) Repeat step (i), but this time make the estimate compare buy profiles and sell profiles that are similar enough
for each 1<=i-n without using the feedback ratings r(U.X.) to appear in the same cluster. As explained below, the results
US 8,171,032 B2
23 24
of the clustering procedure can also be used to make filtering read what to do a better job of clustering based on word
more efficient, and in the service of querying and browsing frequencies. One could similarly combine the methods
tasks. 1b and 2b described above.
The k-means clustering method is familiar to those skilled Hierarchical clustering of target objects is often useful.
in the art. Briefly put, it finds a grouping of points (target 5 Hierarchical clustering produces a tree which divides the
profiles, in this case, whose numeric coordinates are given by target objects first into two large clusters of roughly similar
numeric decomposition of their attributes as described above) objects; each of these clusters is in turn divided into two or
to minimize the distance between points in the clusters and more smaller clusters, which in turn are each divided into yet
the centers of the clusters in which they are located. This is smaller clusters until the collection of target objects has been
done by alternating between assigning each point to the clus 10 entirely divided into “clusters' consisting of a single object
ter which has the nearest center and then, once the points have each, as diagrammed in FIG. 8 In this diagram, the noded
denotes a particular target object d, or equivalently, a single
been assigned, computing the (new) center of each cluster by member cluster consisting of this target object. Target object
averaging the coordinates of the points (target profiles) d is a member of the cluster (a, b, d), which is a subset of the
located in this cluster. Other clustering methods can be used, 15 cluster (a, b, c, d, e, f), which in turn is a Subset of all target
such as “soft' or “fuzzy' k-means objects. The tree shown in FIG.8 would be produced from a
clustering, in which objects are allowed to belong to more set of target objects Such as those shown geometrically in
than one cluster. This can be cast as a clustering problem FIG. 7. In FIG. 7, each letter represents a target object, and
similar to the k-means problem, but now the criterion being axes X1 and X2 represent two of the many numeric attributes
optimized is a little different: on which the target objects differ. Such a cluster tree may be
created by hand, using human judgment to form clusters and
Subclusters of similar objects, or may be created automati
where C ranges over cluster numbers, i ranges over target cally in either of two standard ways: top-down or bottom-up.
objects, X, is the numeric vector corresponding to the profile In top-down hierarchical clustering, the set of all target
of target object number i, is the mean of all the numeric 25 objects in FIG. 7 would be divided into the clusters (a, b, c, d,
vectors corresponding to target profiles of target objects in e, f) and (g, h, i,j,k). The clustering algorithm would then be
cluster number C, termed the “cluster profile' of cluster C. reapplied to the target objects in each cluster, so that the
d(*, *) is the metric used to measure distance between two cluster (g, h, i,j,k) is Subpartitioned into the clusters (g, k) and
target profiles, and it is a value between 0 and 1 that indicates (h, i, j), and so on to arrive at the tree shown in FIG. 8. In
how much target object number i is associated with cluster 30 bottom-up hierarchical clustering, the set of all target objects
number C, where i is an indicator matrix with the property that in FIG. 7 would be grouped into numerous small clusters,
for eachi, SUMSUBC ISUBiC=1. For k-means clustering, namely (a, b), d. (c., f). e. (g, k), (h, i), and j. These clusters
i is either 0 or 1. would then themselves be grouped into the larger clusters (a,
Any of these basic types of clustering might be used by the b., d), (c, e, f), (g, k), and (h, i, j), according to their cluster
system: 35 profiles. These larger clusters would themselves be grouped
1) Association-based clustering, in which profiles contain into (a, b, c, d, e, f) and (g, k, h, i,j), and so on until all target
only associative attributes, and thus distance is defined objects had been grouped together, resulting in the tree of
entirely by associations. This kind of clustering gener FIG. 8. Note that for bottom-up clustering to work, it must be
ally (a) clusters target objects based on the similarity of possible to apply the clustering algorithm to a set of existing
the users who like them or (b) clusters users based on the 40 clusters. This requires a notion of the distance between two
similarity of the target objects they like. In this approach, clusters. The method disclosed above for measuring the dis
the system does not need any information about target tance between target objects can be applied directly, provided
objects or users, except for their history of interaction that clusters are profiled in the same way as target objects. It
with each other. is only necessary to adopt the convention that a cluster's
2) Content-based clustering, in which profiles contain only 45 profile is the average of the target profiles of all the target
non-associative attributes. This kind of clustering (a) objects in the cluster; that is, to determine the cluster's value
clusters target objects based on the similarity of their for a given attribute, take the mean value of that attribute
non-associative attributes (such as word frequencies) or across all the target objects in the cluster. For the mean value
(b) clusters users based on the similarity of their non to be well-defined, all attributes must be numeric, so it is
associative attributes (such as demographics and psy 50 necessary as usual to replace each textual or associative
chographics). In this approach, the system does not need attribute with its decomposition into numeric attributes
to record any information about users historical pat (scores), as described earlier. For example, the target profile
terns of information access, but it does need information of a single Woody Allen film would assign “Woody-Allen' a
about the intrinsic properties of users and/or target score of 1 in the “name-of-director field, while giving
objects. 55 “Federico-Fellini” and “Terence-Davies' scores of 0. A clus
3) Uniform hybrid method, in which profiles may contain ter that consisted of 20 films directed by Allen and 5 directed
both associative and non-associative attributes. This by Fellini would be profiled with scores of 0.8, 0.2, and 0
method combines 1a and 2a, or 1b and 2b. The distance respectively, because, for example, 0.8 is the average of 20
d(PP) between two profiles P and P. may be com ones and 5 Zeros.
puted by the general similarity-measurement methods 60 Searching for Target Objects
described earlier. Given a target object with target profile P. or alternatively
4) Sequential hybrid method. First apply the k-means pro given a search profile P, a hierarchical cluster tree of target
cedure to do 1a, so that articles are labeled by cluster objects makes it possible for the system to search efficiently
based on which user read them, then use Supervised for target objects with target profiles similar to P. It is only
clustering (maximum likelihood discriminant methods) 65 necessarily to navigate through the tree, automatically, in
using the word frequencies to do the process of method search of Such target profiles. The system for customized
2a described above. This tries to use knowledge of who electronic identification of desirable objects begins by con
US 8,171,032 B2
25 26
sidering the largest, top-level clusters, and selects the cluster tree, until it reaches an cluster of intermediate size whose
whose profile is most similar to target profile P. In the event of profile is similar to target profile P, and then continues by
a near-tie, multiple clusters may be selected. Next, the system using a decision tree specialized to search for low-level Sub
considers all subclusters of the selected clusters, and this time clusters of that intermediate cluster.
selects the subcluster or subclusters whose profiles are closest One use of these searching techniques is to search for target
to target profile P. This refinement process is iterated until the objects that match a search profile from a user's search profile
clusters selected on a given step are sufficiently small, and set. This form of searching is used repeatedly in the news
these are the desired clusters of target objects with profiles clipping service, active navigation, and Virtual Community
most similar to target profile P. Any hierarchical cluster tree Service applications, described below. Another use is to add a
therefore serves as a decision tree for identifying target 10 new target object quickly to the cluster tree. An existing
objects. In pseudo-code form, this process is as follows (and cluster that is similar to the new target object can be located
in flow diagram form in FIGS. 13A and 13B): rapidly, and the new target object can be added to this cluster.
1. Initialize list of identified target objects to the empty list If the object is beyond a certain threshold distance from the
at step 13A00 cluster center,
2. Initialize the current tree T to be the hierarchical cluster 15 then it is advisable to start a new cluster. Several variants of
tree of all objects at step 13A01 and at step 13A02 scan this incremental clustering scheme can be used, and can be
the current cluster tree for target objects similar to P. built using variants of Subroutines available in advanced sta
using the process detailed in FIG. 13B. At step 13A03, tistical packages. Note that various methods can be used to
the list of target objects is returned. locate the new target objects that must be added to the cluster
3. At step 13B00, the variable I is set to 1 and for each child tree, depending on the architecture used. In one method, a
Subtree Ti of the root of tree T, is retrieved. “webcrawler program running on a central computer peri
4. At step 13B02, calculate d(P. p.), the similarity distance odically scans all servers in search of new target objects,
between P and p, calculates the target profiles of these objects, and adds them to
5. At step 13B03, ifd(P. p.)<t, a threshold, branch to one of the hierarchical cluster tree by the above method. In another,
two options 25 whenever a new target object is added to any of the servers, a
6. If tree Ti contains only one target object at step 13B04, software "agent” at that server calculates the target profile and
add that target object to list of identified target objects at adds it to the hierarchical cluster tree by the above method.
step 13B05 and advance to step 13B07. Rapid Profiling
7. If tree Ti contains multiple target objects at step 13B04, In some domains, complete profiles of target objects are
scan the ith child subtree for target objects similar to Pby 30 not always easy to construct automatically. When target
invoking the steps of the process of FIG. 13B recursively objects are multi-media games e.g., an attribute such as genre
and then recurse to step 3 (step 13A01 in FIG. 13A) with (a single textual term such as “action”, “suspense/thriller,
T bound for the duration of the recursion to tree Ti, in “word games’, etc.) may be a matter of judgment and opin
order to search in tree Ti for target objects with profiles ion. More significantly, if each title has an associated attribute
similar to P. 35 that records the positive or negative relevance feedback to that
In step 5 of this pseudo-code, smaller thresholds are typi title from various human users (consumers), then all the asso
cally used at lower levels of the tree, for example by making ciation scores of any newly introduced titles are initially Zero,
the threshold an affine function or other function of the cluster so that it is initially unclear what other titles are similar to the
variance or cluster diameter of the clusterp, . If the cluster tree new title with respect to the users who like them. Indeed, if
is distributed across a plurality of servers, as described in the 40 this associative attribute is highly weighted, the initial lack of
section of this description titled “Network Context of the relevance feedback information may be difficult to remedy,
Browsing System, this process may be executed in distrib due to a vicious circle in which users of moderate-to-high
uted fashion as follows: steps 3-7 are executed by the server interest are needed to provide relevance feedback but rel
that stores the root node of hierarchical cluster tree T, and the evance feedback is needed to identify users of moderate-to
recursion in step 7 to a subcluster treeT, involves the trans 45 high interest.
mission of a search request to the server that stores the root Fortunately, however, it is often possible in principle to
node of tree T, which server carries out the recursive step determine certain attributes of a new target object by extraor
upon receipt of this request. Steps 1-2 are carried out by the dinary methods, including but not limited to methods that
processor that initiates the search, and the server that executes consult a human. For example, the system can in principle
step 6 must senda messageidentifying the target object to this 50 determine the genre of a title by consulting one or more
initiating processor, which adds it to the list. randomly chosen individuals from a set of known human
Assuming that low-level clusters have been already been experts, while to determine the numeric association score
formed through clustering, there are alternative search meth between a new title and a particular user, it can in principle
ods for identifying the low-level cluster whose profile is most show the title to the that user and obtain relevance feedback.
similar to a given target profile P. A standard back-propaga 55 Since Such requests inconvenience people, however, it is
tion neural net is one such method: it should be trained to take important not to determine all difficult attributes this way, but
the attributes of a target object as input, and produce as output only the ones that are most important in classifying the article.
a unique pattern that can be used to identify the appropriate “Rapid profiling” is a method for selecting those numeric
low-level cluster. For maximum accuracy, low-level clusters attributes that are most important to determine. (Recall that
that are similar to each other (close together in the cluster tree) 60 all attributes can be decomposed into numeric attributes. Such
should be given similar identifying patterns. Another as association scores or term scores.) First, a set of existing
approach is a standard decision tree that considers the target objects that already have complete or largely complete
attributes of target profile P one at a time until it can identify profiles are clustered using a k-means algorithm. Next, each
the appropriate cluster. If profiles are large, this may be more of the resulting clusters is assigned a unique identifying num
rapid than considering all attributes. A hybrid approach to 65 ber, and each clustered target object is labeled with the iden
searching uses distance measurements as described above to tifying number of its cluster. Standard methods then allow
navigate through the top few levels of the hierarchical cluster construction of a single decision tree that can determine any
US 8,171,032 B2
27 28
target objects cluster number, with Substantial accuracy, by planning mass-market or direct-mail campaigns, about the
considering the attributes of the target object, one at a time. most significant characteristics of consumers of product X.
Only attributes that can if necessary be determined for any Similar information can alternatively be extracted from a
new target object are used in the construction of this decision collection of consumer profiles without recourse to a decision
tree. To profile a new target object, the decision tree is tra 5 tree, by considering attributes one at a time, and identifying
versed downward from its root as far as is desired. The root of those attributes on which product X’s consumers differ sig
the decision tree considers some attribute of the target object. nificantly from its non-consumers. These techniques serve to
If the value of this attribute is not yet known, it is determined characterize consumers of a particular product; they can be
by a method appropriate to that attribute; for example, if the equally well applied to voter research or other survey
attribute is the association score of the target object with user 10 research, where the objective is to characterize those indi
#4589, then relevance feedback (to be used as the value of this viduals from a given set of surveyed individuals who favor a
attribute) is solicited from user #4589, perhaps by the ruse of particular candidate, hold a particular opinion, belong to a
adding the possibly uninteresting target object to a set of particular demographic group, or have some other set of
objects that the system recommends to the users attention, in distinguishing attributes. Researchers may wish to purchase
order to find out what the user thinks of it. Once the root 15 batches of analyzed or unanalyzed user profiles from which
attribute is determined, the rapid profiling method descends personal identifying information has been removed. As with
the decision tree by one level, choosing one of the decision any statistical database, statistical conclusions can be drawn,
subtrees of the root in accordance with the determined value and relationships between attributes can be elucidated using
of the root attribute. The root of this chosen subtree considers knowledge discovery techniques which are well known in the
another attribute of the target object, whose value is likewise art.
determined by an appropriate method. The process can be
repeated to determine as many attributes as desired, by what Supporting Architecture
ever methods are available, although it is ordinarily stopped
after a small number of attributes, to avoid the burden of The following section describes the preferred computer
determining too many attributes. 25 and network architecture for implementing the methods
It should be noted that the rapid profiling method can be described in this patent.
used to identify important attributes in any sort of profile, and Electronic Media System Architecture
not just profiles of target objects. In particular, recall that the FIG. 1 illustrates in block diagram form the overall archi
disclosed method for determining topical interest through tecture of an electronic media system, known in the art, in
similarity requires users as well as target objects to have 30 which the system for customized electronic identification of
profiles. New users, like new target objects, may be profiled or desirable objects of the present invention can be used to
partially profiled through the rapid profiling process. For provide user customized access to target objects that are avail
example, when user profiles include an associative attribute able via the electronic media system. In particular, the elec
that records the user's relevance feedback on all target objects tronic media system comprises a data communication facility
in the system, the rapid profiling procedure can rapidly form 35 that interconnects a plurality of users with a number of infor
a rough characterization of a new users interests by Soliciting mation servers. The users are typically individuals, whose
the user's feedback on a small number of significant target personal computers (terminals) T-T are connected via a
objects, and perhaps also by determining a small number of data communications link, Such as a modem and a telephone
other key attributes of the new user, by on-line queries, tele connection established in well-known fashion, to a telecom
phone Surveys, or other means. Once the new user has been 40 munication network N. User information access Software is
partially profiled in this way, the methods disclosed above resident on the user's personal computer and serves to com
predict that the new users interests resemble the known inter municate over the data communications link and the telecom
ests of other users with similar profiles. In a variation, each munication network N with one of the plurality of network
user's user profile is subdivided into a set of long-term vendors V-V (America Online, Prodigy, CompuServe,
attributes, such as demographic characteristics, and a set of 45 other private companies or even universities) who provide
short-term attributes that help to identify the user's temporary data interconnection service with selected ones of the infor
desires and emotional state, such as the user's textual or mation servers I-I. The user can, by use of the user infor
multiple-choice answers to questions whose answers reflect mation access Software, interact with the information servers
the user's mood. A subset of the user's long-term attributes I-I, to request and obtain access to data that resides on mass
are determined when the user first registers with the system, 50 storage systems -SS that are part of the information server
through the use of a rapid profiling tree of long-term apparatus. New data is input to this system y users via their
attributes. In addition, each time the user logs on to the sys personal computers T-T and by commercial information
tem, a subset of the user's short-term attributes are addition services by populating their mass storage systems SS-SS,
ally determined, through the use of a separate rapid profiling with commercial data. Each user terminal T-T and the infor
tree that asks about short-term attributes. 55 mation servers I-I have phone numbers or IP addresses on
Market Research the network N which enable a data communication link to be
A technique similar to rapid profiling is of interest in mar established between a particular user terminal T-T and the
ket research (or Voter research). Suppose that the target selected information server I-I. A user's electronic mail
objects are consumers. A particular attribute in each target address also uniquely identifies the user and the user's net
profile indicates whether the consumer described by that tar 60 work vendor V-V in an industry-standard format Such as:
get profile has purchased product X. A decision tree can be username(a)laol.com or username(a.netcom.com. The net
built that attempts to determine what value a consumer has for work vendors V-V provide access passwords for their sub
this attribute, by consideration of the other attributes in the scribers (selected users), through which the users can access
consumer's profile. This decision tree may be traversed to the information servers I-I. The Subscribers pay the net
determine whether additional users are likely to purchase 65 work vendors V-V for the access services on a fee schedule
product X. More generally, the top few levels of the decision that typically includes a monthly Subscription fee and usage
tree provide information, valuable to advertisers who are based charges.
US 8,171,032 B2
29 30
A difficulty with this system is that there are numerous formance and the disclosed method for profiling and cluster
information servers I-I, located around the world, each of ing target objects and users can in turn be used for optimizing
which provides access to a set of information of differing the distribution of data among the members of a virtual com
format, content and topics and via a cataloging system that is munity and through a data communications network, based
typically unique to the particular information server I-I. 5 on users target profile interest Summaries.
The information is comprised of individual “files, which can Network Elements and System Characteristics
contain audio data, video data, graphics data, text data, struc The various processors interconnected by the data commu
tured database data and combinations thereof In the termi nication network Nas shown in FIG. 1 can be divided into two
nology of this patent, each target object is associated with a classes and grouped as illustrated in FIG. 2: clients and serv
unique file: for target objects that are informational in nature 10 ers. The clients C1-Cn are individual user's computer sys
and can be digitally represented, the file directly stores the tems which are connected to servers S1-S5 at various times
informational content of the target object, while for target via data communications links. Each of the clients Ci is
objects that are not stored electronically, such as purchasable typically associated with a single server S, but these associa
goods, the file contains an identifying description of the target tions can change over time. The clients C1-Cn both interface
object. Target objects stored electronically as text files can 15 with users and produce and retrieve files to and from servers.
include commercially provided news articles, published The clients C1-Cn are not necessarily continuously on-line,
documents, letters, user-generated documents, descriptions since they typically serve a single user and can be movable
of physical objects, or combinations of these classes of data. systems, such as laptop computers, which can be connected to
The organization of the files containing the information and the data communications network N at any of a number of
the native format of the data contained in files of the same locations. Clients could also be a variety of other computers,
conceptual type may vary by information server I-I. Such as computers and kiosks providing access to customized
Thus, a user can have difficulty in locating files that contain information as well as targeted advertising to many users,
the desired information, because the information may be con where the users identify themselves with passwords or with
tained in files whose information server cataloging may not Smartcards. A server Si is a computer system that is presumed
enable the user to locate them. Furthermore, there is no stan 25 to be continuously on-line and functions to both collect files
dard catalog that defines the presence and services provided from various sources on the data communication network N
by all information servers I-I. A user therefore does not have for access by local clients C1-Cn and collect files from local
simple access to information but must expend a significant clients C1-Cn for access by remote clients. The server Si is
amount of time and energy to excerpt a segment of the infor equipped with persistent storage. Such as a magnetic disk data
mation that may be relevant to the user from the plethora of 30 storage medium, and are interconnected with other servers
information that is generated and populated on this system. Via data communications links. The data communications
Even if the user commits the necessary resources to this task, links can be of arbitrary topology and architecture, and are
existing information retrieval processes lack the accuracy and described herein for the purpose of simplicity as point-to
efficiency to ensure that the user obtains the desired informa point links or, more precisely, as virtual point-to-point links.
tion. It is obvious that within the constructs of this electronic 35 The servers S1-S5 comprise the network vendors V1-Vk as
media system, the three modules of the system for customized well as the information servers I-I of FIG. 1 and the func
electronic identification of desirable objects can be imple tions performed by these two classes of modules can be
mented in a distributed manner, even with various modules merged to a greater or lesser extent in a single server Si or
being implemented on and/or by different vendors within the distributed over a number of servers in the data communica
electronic media system. For example, the information serv 40 tion network N. Prior to proceeding with the description of
ers I-I can include the target profile generation module the preferred embodiment of the invention, a number of terms
while the network vendors V-V may implement the user are defined. FIG. 3 illustrates in block diagram form a repre
profile generation module, the target profile interest Summary sentation of an arbitrarily selected network topology for a
generation module, and/or the profile processing module. A plurality of servers A-D, each of which is interconnected to at
module can itself be implemented in a distributed manner, 45 least one other server and typically also to a plurality of
with numerous nodes being present in the network N, each clients p-s. Servers A-D are interconnected by a collection of
node serving a population of users in a particular geographic point to point data communications links, and server A is
area. The totality of these nodes comprises the functionality connected to client r, server B is connected to clients p-q,
of the particular module. Various other partitions of the mod while server D is connected to client S. Servers transmit
ules and their functions are possible and the examples pro 50 encrypted or unencrypted messages amongst themselves: a
vided herein represent illustrative examples and are not message typically contains the textual and/or graphic infor
intended to limit the scope of the claimed invention. For the mation stored in a particular file, and also contains data which
purposes of pseudonymous creation and update of users describe the type and origin of this file, the name of the server
target profile interest summaries (as described below), the that is Supposed to receive the message, and the purpose for
Vendors V-V may be augmented with some number of 55 which the file contents are being transmitted. Some messages
proxy servers, which provide a mechanism for ongoing are not associated with any file, but are sent by one server to
pseudonymous access and profile building through the other servers for control reasons, for example to request trans
method described herein. At least one trusted validation mission of a file or to announce the availability of a new file.
server must be in place to administer the creation of pseud Messages can be forwarded by a server to another server, as in
onyms in the system. 60 the case where server A transmits a message to server D via a
An important characteristic of this system for customized relay node of either server C or servers B. C. It is generally
electronic identification of desirable objects is its responsive preferable to have multiple paths through the network, with
ness, since the intended use of the system is in an interactive each path being characterized by its performance capability
mode. The system utility grows with the number of the users and cost to enable the network N to optimize traffic routing.
and this increases the number of possible consumer/product 65 Proxy Servers and Pseudonymous Transactions
relationships between users and target objects. A system that While the method of using target profile interest summa
serves a large group of users must maintain interactive per ries presents many advantages to both target object providers
US 8,171,032 B2
31 32
and users, there are important privacy issues for both users ties such as information servers (possibly including the
and providers that must be resolved if the system is to be used proxy server itself) and/or other users. Specifically, let
freely and without inhibition by users without fear of invasion ting S denote the server that is directly associated with
of privacy. It is likely that users desire that some, if not all, of user U's client processor, the proxy server communi
the user-specific information in their user profiles and target 5 cates with server S (and thence with user U), either
profile interest Summaries remain confidential, to be dis through anonymizing mix paths that obscure the identity
closed only under certain circumstances related to certain of server S and user U, in which case the proxy server
types of transactions and according to their personal wishes knows user U only through a secure pseudonym, or else
for differing levels of confidentiality regarding their pur through a conventional virtual point-to-point connec
chases and expressed interests. 10 tion, in which case the proxy server knows user U by
However, complete privacy and inaccessibility of user user U's address at server S, which address may be
transactions and profile Summary information would hinder regarded as a non-secure pseudonym for user U.
implementation of the system for customized electronic iden 2. A second function of the proxy server is to record user
tification of desirable objects and would deprive the user of specific information associated with user U. This user
many of the advantages derived through the systems use of 15 specific information includes a user profile and target
user-specific information. In many cases, complete and total profile interest summary for user U, as well as a list of
privacy is not desired by all parties to a transaction. For access control instructions specified by user U, as
example, a buyer may desire to be targeted for certain mail described below, and a set of one-time return addresses
ings that describe products that are related to his or her inter provided by userU that can be used to send messages to
ests, and a seller may desire to target users who are predicted userU without knowing userUs true identity. All of this
to be interested in the goods and services that the seller user-specific information is stored in a database that is
provides. Indeed, the usefulness of the technology described keyed by user U's pseudonym (whether secure or non
herein is contingent upon the ability of the system to collect secure) on the proxy server.
and compare data about many users and many target objects. 3. A third function of the proxy server is to act as a selective
A compromise between total user anonymity and total public 25 forwarding agent for unsolicited communications that
disclosure of the user's search profiles or target profile inter are addressed to user U: the proxy server forwards some
est summary is a pseudonym. A pseudonym is an artifact that Such communications to user U and rejects others, in
allows a service provider to communicate with users and accordance with the access control instructions specified
build and accumulate records of their preferences over time, by user U.
while at the same time remaining ignorant of the users’ true 30 Our combined method allows a given user to use either a
identities, so that users can keep their purchases or prefer single pseudonym in all transactions where he or she wishes
ences private. A second and equally important requirement of to remain pseudonymous, or else different pseudonyms for
a pseudonym system is that it provide for digital credentials, different types of transactions. In the latter case, each service
which are used to guarantee that the user represented by a provider might transact with the user under a different pseud
particular pseudonym has certain properties. These creden 35 onym for the user. More generally, a coalition of service
tials may be granted on the basis of result of activities and providers, all of whom match users with the same genre of
transactions conducted by means of the system for custom target objects, might agree to transact with the user using a
ized electronic identification of desirable objects, or on the common pseudonym, so that the target profile interest Sum
basis of other activities and transactions conducted on the mary associated with that pseudonym would be complete
network N of the present system, on the basis of users’ activi 40 with respect to said genre of target objects. When a user
ties outside of network N. For example, a service provider employs several pseudonyms in order to transact with differ
may require proof that the purchaser has sufficient finds on ent coalitions of service providers, the user may freely choose
deposit at his/her bank, which might possibly not be on a a proxy server to service each pseudonym; these proxy serv
network, before agreeing to transact business with that user. ers may be the same or different. From the service providers
The user, therefore, must provide the service provider with 45 perspective, our system provides security, in that it can guar
proof of funds (a credential) from the bank, while still not antee that users of a service are legitimately entitled to the
disclosing the user's true identity to the service provider. services used and that no user is using multiple pseudonyms
Our method solves the above problems by combining the to communicate with the same provider. This uniqueness of
pseudonym granting and credential transfer methods taught pseudonyms is important for the purposes of this application,
by D. Chaum and J. H. Evertse, in the paper titled “A secure 50 since the transaction information gathered for a given indi
and privacy-protecting protocol for transmitting personal vidual must represent a complete and consistent picture of a
information between organizations, with the implementa single user's activities with respect to a given service provider
tion of a set of one or more proxy servers distributed through or coalition of service providers; otherwise, a user's target
out the network N. Each proxy server, for example S2 in FIG. profile interest summary and user profile would not be able to
2, is a server which communicates with clients and other 55 represent the user's interests to other parties as completely
servers S5 in the network either directly or through anony and accurately as possible.
mizing mix paths as detailed in the paper by D. Chaum titled The service provider must have a means of protection from
“Untraceable Electronic Mail, Return Addresses, and Digital users who violate previously agreed upon terms of service.
Pseudonyms, published in Communications of the ACM, For example, ifa user that uses a given pseudonym engages in
Volume 24, Number 2, February 1981. Any server in the 60 activities that violate the terms of service, then the service
network N may be configured to act as a proxy server in provider should be able to take action against the user, such as
addition to its other functions. Each proxy server provides denying the user service and blacklisting the user from trans
service to a set of users, which set is termed the “user base' of actions with other parties that the user might be tempted to
that proxy server. A given proxy server provides three sorts of defraud. This type of situation might occur when a user
service to each user U in its user base, as follows: 65 employs a service provider for illegal activities or defaults in
1. The first function of the proxy server is to bidirectionally payments to the service provider. The method of the paper
transfer communications between userUand other enti titled “Security without identification: Transaction systems to
US 8,171,032 B2
33 34
make Big-Brother obsolete', published in the Communica a given user to either single network vendors and information
tions of the ACM, 28(10), October 1985; pp. 1030-1044, servers or coalitions thereof. A proxy server, e.g. S2, is a
incorporated herein, provides for a mechanism to enforce server computer with CPU, main memory, secondary disk
protection against this type of behavior through the use of storage and network communication function and with a
resolution credentials, whichare credentials that are periodi 5 database function which retrieves the target profile interest
cally provided to individuals contingent upon their behaving Summary and access control instructions associated with a
consistent with the agreed upon terms of service between the particular pseudonym P, which represents a particular user U.
user and information provider and network vendor entities and performs bi-directional routing of commands, target
(such as regular payment for services rendered, civil conduct, objects and billing information between the user at a given
etc.). For the user's safety, if the issuer of a resolution cre 10
client (e.g. C3) and other network entities such as network
dential refuses to grant this resolution credential to the user, vendors V1-Vk and information servers I1-Im. Each proxy
then the refusal may be appealed to an adjudicating third server maintains an encrypted target profile interest Summary
party. The integrity of the user profiles and target profile
interest Summaries stored on proxy servers is important: if a associated with each allocated pseudonym in its pseudonym
seller relies on such user-specific information to deliver pro 15 database D. The actual user-specific information and the
motional offers or other material to a particular class of users, associated pseudonyms need not be stored locally on the
but not to other users, then the user-specific information must proxy server, but may alternatively be stored in a distributed
be accurate and untampered with in any way. The user may fashion and be remotely addressable from the proxy server
likewise wish to ensure that other parties not tamper with the via point-to-point connections.
users user profile and target profile interest Summary, since The proxy server supports two types of bi-directional con
Such modification could degrade the system's ability to match nections: point-to-point connections and pseudonymous con
the user with the most appropriate target objects. This is done nections through mix paths, as taught by D. Chaum in the
by providing for the user to apply digital signatures to the paper titled “Untraceable Electronic Mail, Return Addresses,
control messages sent by the user to the proxy server. Each and Digital Pseudonyms”. Communications of the ACM, Vol
pseudonym is paired with a public cryptographic key and a 25 ume 24, Number 2, February 1981. The normal connections
private cryptographic key, where the private key is known between the proxy server and information servers, for
only to the user who holds that pseudonym; when the user example a connection between proxy server S2 and informa
sends a control message to a proxy server under a given tion server S4 in FIG. 2, are accomplished through the point
pseudonym, the proxy server uses the pseudonyms public to-point connection protocols provided by network N as
key to Verify that the message has been digitally signed by 30
described in the “Electronic Media System Architecture' sec
someone who knows the pseudonyms private key. This pre tion of this application. The normal type of point-to-point
vents other parties from masquerading as the user. connections may be used between S2-S4, for example, since
Our approach, as disclosed in this application, provides an the dissociation of the user and the pseudonym need only
improvement over the prior art in privacy-protected pseud
onym for network subscribers such as taught in U.S. Pat. No. 35 occur between the client C3 and the proxy server S2, where
5.245,656, which provides for a name translator station to act the pseudonym used by the user is available. Knowing that an
as an intermediary between a service provider and the user. information provider Such as S4 communicates with a given
However, while U.S. Pat. No. 5.245,656 provides that the pseudonym P on proxy server S2 does not compromise the
information transmitted between the end user U and the ser true identity of user U. The bidirectional connection between
vice provider be doubly encrypted, the fact that a relationship 40 the user and the proxy server S2 can also be a normal point
exists between user U and the service provider is known to to-point connection, but it may instead be made anonymous
the name translator, and this fact could be used to compromise and secure, if the user desires, though the consistent use of an
user U, for example if the service provider specializes in the anonymizing mix protocol as taught by D. Chaum in the
provision of content that is not deemed acceptable by user Us paper titled “Untraceable Electronic Mail, Return Addresses,
peers. The method of U.S. Pat. No. 5.245,656 also omits a 45 and Digital Pseudonyms”. Communications of the ACM, Vol
method for the convenient updating of pseudonymous user ume 24, Number 2, February 1981. This mix procedure pro
profile information, such as is provided in this application, Vides untraceable secure anonymous mail between to parties
and does not provide for assurance of unique and credentialed with blind return addresses through a set of forwarding and
registration of pseudonyms from a credentialing agent as is return routing servers termed “mixes'. The mix routing pro
also provided in this application, and does not provide a 50 tocol, as taught in the Chaum paper, is used with the proxy
means of access control to the user based on profile informa server S2 to provide a registry of persistent secure pseud
tion and conditional access as will be subsequently described. onyms that can be employed by users other than user U, by
The method described by Loeb et al. also does not describe information providers I1-Im, by vendors V1-Vk and by other
any provision for credentials, such as might be used for proxy servers to communicate with the users in the proxy
authenticating a user's right to access particular target 55 server's user base on a continuing basis. The security pro
objects, such as target objects that are intended to be available vided by this mix path protocol is distributed and resistant to
only upon payment of a subscription fee, or target objects that traffic analysis attacks and other known forms of analysis
are intended to be unavailable to younger users. which may be used by malicious parties to try and ascertain
Proxy Server Description the true identity of a pseudonym bearer. Breaking the protocol
In order that a user may ensure that some or all of the 60 requires a large number of parties to maliciously collude or be
information in the user's user profile and target profile interest cryptographically compromised. In addition an extension to
Summary remain dissociated from the user's true identity, the the method is taught where the user can include a return path
user employs as an intermediary any one of a number of proxy definition in the message so the information server S4 can
servers available on the data communication network N of return the requested information to the user's client processor
FIG. 2 (for example, server S2). The proxy servers function to 65 C3. We utilize this feature in a novel fashion to provide for
disguise the true identity of the user from other parties on the access and reachability control under user and proxy server
data communication network N. The proxy server represents control.
US 8,171,032 B2
35 36
Validation and Allocation of a Unique Pseudonym indicate to other future users quality of service which can be
Chaum's pseudonym and credential issuance system, as expected by Subsequent users on the basis of various criteria.
described in a publication by D. Chaum and J. H. Evertse, In our implementation, a pseudonym is a data record con
titled "A secure and privacy-protecting protocol for transmit sisting of two fields. The first field specifies the address of the
ting personal information between organizations.” has sev proxy server at which the pseudonym is registered. The sec
eral desirable properties for use as a component in our system. ond field contains a unique string of bits (e.g., a random
The system allows for individuals to use different pseud binary number) that is associated with a particular user, cre
onyms with different organizations (such as banks and coali dentials take the form of public-key digital signatures com
tions of service providers). The organizations which are pre puted on this number, and the number itself is issued by a
sented with a pseudonym have no more information about the 10 pseudonym administering server Z, as depicted in FIG. 2, and
individual than the pseudonym itself and a record of previous detailed In a generic form in the paper by D. Chaum and J. H.
transactions carried out under that pseudonym. Additionally, Evertse, titled “A secure and privacy-protecting protocol for
transmitting personal information between organizations.”. It
credentials, which represent facts about a pseudonym that an is possible to send information to the user holding a given
organization is willing to certify, can be granted to aparticular 15 pseudonym, by enveloping the information in a control mes
pseudonym, and transferred to other pseudonyms that the sage that specifies the pseudonym and is addressed to the
same user employs. For, example, the user can use different proxy server that is named in the first field of the pseudonym:
pseudonyms with different organizations (or disjoint sets of the proxy server may forward the information to the user upon
organizations), yet still present credentials that were granted receipt of the control message.
by one organization, under one pseudonym, in order to trans While the user may use a single pseudonym for all trans
act with another organization under another pseudonym, actions, in the more general case a user has a set of several
without revealing that the two pseudonyms correspond to the pseudonyms, each of which represents the user in his or her
same user. Credentials may be granted to provide assurances interactions with a single provider or coalition of service
regarding the pseudonym bearer's age, financial status, legal providers. Each pseudonym in the pseudonym set is desig
status, and the like. For example, credentials signifying “legal 25 nated for transactions with a different coalition of related
adult may be issued to a pseudonym based on information service providers, and the pseudonyms used with one pro
known about the corresponding user by the given is suing vider or coalition of providers cannot be linked to the pseud
organization. Then, when the credential is transferred to onyms used with other disjoint coalitions of providers. All of
another pseudonym that represents the user to another dis the user's transactions with a given coalition can be linked by
joint organization, presentation of this credential on the other 30 virtue of the fact that they are conducted under the same
pseudonym can be taken as proof of legal adulthood, which pseudonym, and therefore can be combined to define a unified
might satisfy a condition of terms of service. Credential picture, in the form of a user profile and a target profile
issuing organizations may also certify particular facts about a interest Summary, of the user's interests vis-a-vis the service
user's demographic profile or target profile interest Summary, or services provided by said coalition. There are other cir
for example by granting a credential that asserts “the bearer of 35 cumstances for which the use of a pseudonym may be useful
this pseudonym is either well-read or is middle-aged and and the present description is in no way intended to limit the
works for a large company'; by presenting this credential to Scope of the claimed invention for example, the previously
another entity, the user can prove eligibility for (say) a dis described rapid profiling tree could be used to pseudony
count without revealing the user's personal data to that entity. mously acquire information about the user which is consid
Additionally, the method taught by Chaum provides for 40 ered by the user to be sensitive such as that information which
assurances that no individual may correspond with a given is of interest to Such entities as insurance companies, medical
organization or coalition of organizations using more than specialists, family counselors or dating services.
one pseudonym; that credentials may not be feasibly forged Detailed Protocol
by the user; and that credentials may not be transferred from In our system, the organizations that the user U interacts
one user's pseudonym to a different user's pseudonym. 45 with are the servers S1-Sn on the network N. However, rather
Finally, the method provides for expiration of credentials and than directly corresponding with each server, the user
for the issuance of “black marks' against Individuals who do employs a proxy server, e.g. S2, as an intermediary between
not act according to the terms of service that they are the local server of the user's own client and the information
extended. This is done through the resolution credential provider or network vendor. Mix paths as described by D.
mechanism as described in Chaum's work, in which resolu 50 Chaum in the paper titled “Untraceable Electronic Mail,
tions are issued periodically by organizations to pseudonyms Return Addresses, and Digital Pseudonyms. Communica
that are in good standing. If a user is not issued this resolution tions of the ACM, Volume 24, Number 2, February 1981
credential by a particular organization or coalition of organi allow for untraceability and security between the client, such
zation, then this user cannot have it available to be transferred as C3, and the proxy server, e.g. S2. Let S(MK) represent the
to other pseudonyms which he uses with other organizations. 55 digital signing of message Mby modular exponentiation with
Therefore, the user cannot convince these other organizations key K as detailed in a paper by Rivest, R. L., Shamir, A., and
that he has acted accordance with terms of service in other Adleman, L. Titled “A method for obtaining digital signatures
dealings. If this is the case, then the organization can use this and public-key cryptosystems”, published in the Comm.
lack of resolution credential to infer that the user is not in ACM 21, 2 Feb. 120-126. Once a user applies to server Z for
good standing in his other dealings. In one approach organi 60 a pseudonym P and is granted a signed pseudonym signed
Zations (or other users) may issue a list of quality related with the private key SK of server Z, the following protocol
credentials based upon the experience of transaction (or inter takes place to establish an entry for the user U in the proxy
action) with the user which may act similarly to a letter of server S2’s database D. 1. The user now sends proxy server S2
recommendation as in a resume. If such a credential is issued the pseudonym, which has been signed by Z to indicate the
from multiple organizations, their values become averaged. 65 authenticity and uniqueness of the pseudonym. The user also
In an alternative variation organizations may be issued cre generatesa PKP, SKekey pairforusewith the granted pseud
dentials from users such as customers which may be used to onym, where is the private key associated with the pseud
US 8,171,032 B2
37 38
onym and PK is the public key associated with the pseud the case where address A is a pseudonym held by, for
onym. The user forms a request to establish pseudonym Pon example, a business or another user who prefers to operate
proxy server S2, by sending the signed pseudonym S(P. SK2) pseudonymously.
to the proxy server S2 along with a request to create a new In other scenarios, the request R to proxy server S2 formed
database entry, indexed by P. and the public key PK. It 5 by the user may have different content. For example, request
envelopes the message and transmits it to a proxy server S2 R may instruct proxy server S2 to use the methods described
through an anonymizing mix path, along with an anonymous later in this description to retrieve from the most convenient
return envelope header. 2. The proxy server S2 receives the server a particular piece of information that has been multi
database creation entry request and associated certified 10 cast to many servers, and to send this information to the user.
pseudonym message. The proxy server S2 checks to ensure Conversely, request R may instruct proxy server S2 to multi
that the requested pseudonym P is signed by server Zand if so provided by servers
cast to many a file associated with a new target object
grants the request and creates a database entry for the pseud subscriber tothe user, as described below. If the user is a
onym, as well as storing the user's public key PK to ensure request R may the news clipping service described below,
instruct proxy server S2 to forward to the user
that only the user U can make requests in the future using 15 all target objects that the news clipping service has sent to
pseudonym P. 3. The structure of the user's database entry proxy server S2 for the user's attention. If the user is employ
consists of a user profile as detailed herein, a target profile ing the active navigation service described below, request R
interest Summary as detailed herein, and a Boolean combina may instruct proxy server S2 to selecta particular cluster from
tion of access control criteria as detailed below, along with the the hierarchical cluster tree and provide a menu of its sub
associated public key for the pseudonym P. 4. At any time 20 clusters to the user, or to activate a query that temporarily
after database entry for Pseudonym P is established, the user affects proxy server S2's record of the user's target profile
U may provide proxy server S2 with credentials on that interest Summary. If the user is a member of a virtual com
pseudonym, provided by third parties, which credentials munity as described below, request R may instruct proxy
make certain assertions about that pseudonym. The proxy server S2 to forward to the user all messages that have been
server may verify those credentials and make appropriate 25 sent to the virtual community.
modifications to the user's profile as required by these cre Regardless of the content of request R, the user, at client
dentials such as recording the user's new demographic status C3, initiates a connection to the user's local server SI, and
as an adult. It may also store those credentials, so that it can instructs server SI to send the request R along a secure mix
path to the proxy server S2, initiating the following sequence
present them to service providers on the user's behalf
The above steps may be repeated, with either the same ora 30 of actions:
1. The user's client processor C3 forms a signed message
different proxy server, each time userU requires a new pseud S(R, SK), which is paired with the user's pseudonym P
onym for use with a new and disjoint coalition of providers. In and (if the request R requires a response) a secure one
practice there is an extremely small probability that a given time set of return envelopes, to form a message M. It
pseudonym may have already been allocated by due to the 35 protects the message M with an multiply enveloped
random nature of the pseudonym generation process carried route for the outgoing path. The enveloped routes pro
out by Z. If this highly unlikely event occurs, then the proxy vide for secure communication between SI and the
server S2 may reply to the user with a signed message indi proxy server S2. The message M is enveloped in the
cating that the generated pseudonym has already been allo most deeply nested message and is therefore difficult to
cated, and asking for a new pseudonym to be generated. 40 recover should the message be intercepted by an eaves
Pseudonymous Control of an Information Server dropper.
Once a proxy server S2 has authenticated and registered a 2. The message M is sent by client C3 to its local server S1,
user's pseudonym, the user may begin to use the services of and is then routed by the data communication network N
the proxy server S2, in interacting with other network entities from server SI through a set of mixes as dictated by the
such as service providers, as exemplified by server S4 in FIG. 45 outgoing envelope set and arrives at the selected proxy
2, an information service provider node connected to the Server S2.
network. The user controls the proxy server S2 by forming 3. The proxy server S2 separates the received message M
digitally encoded requests that the user Subsequently trans into the request message R, the pseudonym P. and (if
mits to the proxy server S2 over the network N. The nature included) the set of envelopes for the return path. The
and format of these requests will vary, since the proxy server 50 proxy server S2 uses pseudonym P to index and retrieve
may be used for any of the services described in this applica the corresponding record in proxy server S2’s database,
tion, Such as the browsing, querying, and other navigational which record is stored in local storage at the proxy server
functions described below. S2 or on other distributed storage media accessible to
In a generic scenario, the user wishes to communicate proxy server S2 via the network N. This record contains
under pseudonym P with a particular information provider or 55 a public key PK, user-specific information, and creden
user at address A, where P is a pseudonym allocated to the tials associated with pseudonym P. The proxy server S2
user and A is either a public network address at a server Such uses the public key PK to check that the signed version
as S4, or another pseudonym that is registered on a proxy S(R, SK) of request message R is valid.
server Such as S4. (In the most common version of this sce 4. Provided that the signature on request message R is
nario, addressA is the address ofan information provider, and 60 valid, the proxy server S2 acts on the request R. For
the user is requesting that the in formation provider send example, in the generic scenario described above,
target objects of interest.) The user must form a request R to request message R includes an embedded message M1
proxy server S2, that requests proxy server S2 to send a and an address A to whom message M1 should be sent;
message to address A and to forward the response back to the in this case, proxy server S2 sends message M1 to the
user. The user may thereby communicate with other parties, 65 server named in address A. Such as server S4. The com
either non-pseudonymous parties, in the case where address munication is done using signed and optionally
A is a public network address, or pseudonymous parties, in encrypted messages over the normal point to point con
US 8,171,032 B2
39 40
nections provided by the data communication network cating that the advertisement has been transmitted to a
N. When necessary in order to act on embedded message user with a particular predicted level of interest. The
M1, server S4 may exchange or be caused to exchange message may also indicate the identity of target object
further signed and optionally encrypted messages with X. In return, the advertiser may transmit an electronic
proxy server S2, Still over normal point to point connec- 5 payment to proxy server S2, proxy server S2 retains a
tions, in order to negotiate the release of user-specific service fee for itself, optionally forwards a service fee to
information and credentials from proxy server S2. In information server S4, and the balance is forwarded to
particular, server S4 may require server S2 to Supply the user or used to credit the user's account on the proxy
credentials proving that the user is entitled to the infor Se?“Ver.
mation requested—for example, proving that the user is 10 9. If the response M2 contains or identifies a target object,
a Subscriberin good standing to a particular information the passive and/or active relevance feedback that the user
service, that the user is old enough to legally receive provides on this object is tabulated by a process on the
adult material, and that the user has been offered a par user's client processor C3. A Summary of such relevance
ticular discount (by means of a special discount creden feedback information, digitally signed by client proces
tial issued to the user's pseudonym). 15 sor C3 with a proprietary private key SK, is periodi
. If proxy server S2 has sent a message to a server S4 and cally transmitted through an a secure mix path to the
server S4 has created a response M2 to message M1 to be proxy server S2, whereupon the search profile genera
sent to the user, then serverS4 transmits the response M2 tion module 202 resident on server S2 updates the appro
to the proxy server S2 using normal network point-to priate target profile interest Summary associated with
point connections. pseudonym P. provided that the signature on the Sum
. The proxy server S2, upon receipt of the response M2, mary message can be authenticated with the correspond
creates a return message Mr comprising the response ing public key PKC. which is available to all tabulating
M2 embedded in the return envelope set that was earlier process that are ensured to have integrity.
transmitted to proxy server S2 by the user in the original When a consumer enters into a financial relationship with
message M. It transmits the return message Mr along the 25 a particular information server based on both parties agreeing
pseudonymous mix path specified by this return enve to terms for the relationship, a particular pseudonym may be
lope set, so that the response M2 reaches the user at the extended for the consumer with respect to the given provider
users client processor C3. as detailed in the previous section. When entering into Such a
. The response M2 may contain a request for electronic relationship, the consumer and the service provider agree to
payment to the information server S4. The user may then 30 certain terms. However, if the user violates the terms of this
respond by means of a message M3 transmitted by the relationship, the service provider may decline to provide Ser
same means as described for message M1 above, which vice to the pseudonym under which it transacts with the user.
message M3 encloses some form of anonymous pay In addition, the service provider has the recourse of refusing
ment. Alternatively, the proxy server may respond auto to provide resolution credentials to the pseudonym, and may
matically with Such a payment, which is debited from an 35 choose to do so until the pseudonym bearer returns to good
account maintained by the proxy server for this user. Standing.
. Either the response message M2 from the information Pre-Fetching of Target Objects
server S4 to the user, ora subsequent message sent by the In some circumstances, a user may request access in
proxy server S2 to the user, may contain advertising sequence to many files, which are stored on one or more
material that is related to the user's request and/or is 40 information servers. This behavior is common when navigat
targeted to the user. Typically, if the user has just ing a hypertext system such as the World WideWeb, or when
retrieved a target object X, then (a) either proxy server using the target object browsing system described below.
S2 or information serverS4 determines a weighted set of In general, the user requests access to a particular target
advertisements that are “associated with target objectX object or menu of target objects; once the corresponding file
(b) a subset of this set is chosen randomly, where the 45 has been transmitted to the user's client processor, the user
weight of an advertisement is proportional to the prob views its contents and makes another such request, and so on.
ability that it is included in the subset, and (c) proxy Each request may take many seconds to satisfy, due to
server S2 selects from this subset just those advertise retrieval and transmission delays. However, to the extent that
ments that the user is most likely to be interested in. In the sequence of requests is predictable, the system for cus
the variation where proxy server S2 determines the set of 50 tomized electronic identification of desirable objects can
advertisements associated with target objectX, then this respond more quickly to each request, by retrieving or start
set typically consists of all advertisements that the proxy ing to retrieve the appropriate files even before the user
server's owner has been paid to disseminate and whose requests them. This early retrieval is termed “pre-fetching of
target profiles are within a threshold similarity distance files.
of the target profile of target object X. In the variation 55 Pre-fetching of locally stored data has been heavily studied
where proxy server S4 determines the set of advertise in memory hierarchies, including CPU caches and secondary
ments associated with target object X, advertisers typi storage (disks), for several decades. A leader in this area has
cally purchase the right to include advertisements in this been A. J. Smith of Berkeley, who identified a variety of
set. In either case, the weight of an advertisement is schemes and analyzed opportunities using extensive traces in
determined by the amount that an advertiser is willing to 60 both databases and CPU caches. His conclusion was that
pay. Following step (c), proxy server S2 retrieves the general schemes only really paid off where there was some
Selected advertising material and transmits it to the reasonable chance that sequential access was occurring, e.g.,
user's client processor C3, where it will be displayed to in a sequential read of data. As the balances between various
the user, within a specified length of time after it is latencies in the memory hierarchy shifted during the late
received, by a trusted process running on the user's 65 1980s and early 1990’s, J. M. Smith and others identified
client processor C3. When proxy server S2 transmits an further opportunities for pre-fetching of both locally stored
advertisement, it sends a message to the advertiser, indi data and network data. In particular, deeper analysis of pat
US 8,171,032 B2
41 42
terns in work by Blaha showed the possibility of using expert request for file Fis said to "trigger” files G1 . . . Gk. Proxy
systems for deep pattern analysis that could be used for pre server Spre-fetches each of these triggered files Gias follows:
fetching. Work by J. M. Smith proposed the use of reference 1. Unless file Gi is already stored locally (e.g., due to
history trees to anticipate references in storage hierarchies previous pre-fetch), proxy server S retrieves file Gi from
where there was some historical data. Recent work by Touch 5 an appropriate information server and stores it locally.
and the Berkeley work addressed the case of data on the 2. Proxy server S timestamps its local copy of file Gi as
World-WideWeb, where the large size of images and the long having just been pre-fetched, so that file Gi will be
latencies provide extra incentive to pre-fetch; Touch’s tech retained in local storage for a minimum of approxi
nique is to pre-send when large bandwidths permit some 10
mately t minutes before being deleted.
speculation using HTML storage references embedded in Whenever user U (or, in principle, any other user registered
WEB pages, and the Berkeley work uses techniques similar to with proxy server S) requests proxy server S to retrieve a file
J. M. Smith's reference histories specialized to the semantics that has been pre-fetched and not yet deleted, proxy server S
of HTML data. can then retrieve the file from local storage rather than from
Successful pre-fetching depends on the ability of the sys 15 another server. In a variation on steps 1-2 above, proxy server
tem to predict the next action or actions of the user. In the S pre-fetches a file Gi somewhat differently, so that pre
context of the system for customized electronic identification fetched files are stored on the user's client processor q rather
of desirable objects, it is possible to cluster users into groups than on server S:
according to the similarity of their user profiles. Any of the 1. If proxy server S has not pre-fetched file Gi in the past t
well-known pre-fetching methods that collect and utilize minutes, it retrieves file Gi and transmits it to user Us
aggregate statistics on past user behavior, in order to predict client processor q
future user behavior, may then be implemented in So as to 2. Upon receipt of the message sent in step 1, client q stores
collect and utilize a separate set of statistics for each cluster of a local copy of file Gi if one is not currently stored.
users. In this way, the system generalizes its access pattern 3. Proxy server S notifies client q that client q should
statistics from each user to similarusers, without generalizing 25 timestamp its local copy of file Gi; this notification may
among users who have substantially different interests. The be combined with the message transmitted in step 1, if
system may further collect and utilize a similar set of statistics any.
that describes the aggregate behavior of all users; in cases 4. Upon receipt of the message sent in step 3, client q
where the system cannot confidently make a prediction as to timestamps its local copy of file Gi as having just been
what a particular user will do, because the relevant statistics 30 pre-fetched, so that file Gi will be retained in local stor
concerning that user's user cluster are derived from only a age for a minimum of approximately t minutes before
small amount of data, the system may instead make its pre being deleted.
dictions based on the aggregate statistics for all users, which During the period that client q retains file Gi in local storage,
are derived from a larger amount of data. For the sake of client q can respond to any request for file Gi (by user U or, in
concreteness, we now describe a particular instantiation of a 35 principle, any other user of client q) immediately and without
pre-fetching system, that both employs these insights and that the assistance of proxy server S.
makes its pre-fetching decisions through accurate measure The difficult task is for proxy server S, each time it retrieves
ment of the expected cost and benefit of each potential pre a file Fin response to a request, to identify the files G1 . . . Gk
fetch. that should be triggered by the request for file F and pre
Pre-fetching exhibits a cost-benefit tradeoff. Let t denote 40 fetched immediately. Proxy server S employs a cost-benefit
the approximate number of minutes that pre-fetched files are analysis, performing each pre-fetch whose benefit exceeds a
retained in local storage (before they are deleted to make user-determined multiple of its cost; the user may set the
room for other pre-fetched files). If the system elects to pre multiplier low for aggressive prefetching or high for conser
fetch a file corresponding to a target object X, then the user vative prefetching. These pre-fetches may be performed in
benefits from a fast response at no extra cost, provided that the 45 parallel. The benefit of pre-fetching file Gi immediately is
user explicitly requests target object X soon thereafter. How defined to be the expected number of seconds saved by such
ever, if the user does not request target object X within t a pre-fetch, as compared to a situation where Gi is left to be
minutes of the pre-fetch, then the pre-fetch was worthless, retrieved later (either by a later pre-fetch, or by the user's
and its cost is an added cost that must be borne (directly or request) if at all. The cost of pre-fetching file Gi immediately
indirectly) by the user. The first scenario therefore provides 50 is defined to be the expected cost for proxy server S to retrieve
benefit at no cost, while the second scenario incurs a cost at no file Gi, as determined for example by the network locations of
benefit. The system tries to favor the first scenario by pre server S and file Gi and by information provider charges,
fetching only those files that the user will access anyway. times 1 minus the probability that proxy server S will have to
Depending on the user's wishes, the system may pre-fetch retrieve file Gi within t minutes (to satisfy either a later
either conservatively, where it controls costs by pre-fetching 55 pre-fetch or the user's explicit request) if it is not pre-fetched
only files that the user is extremely likely to request explicitly OW.
(and that are relatively cheap to retrieve), or more aggres The above definitions of cost and benefit have some attrac
sively, where it also pre-fetches files that the user is only tive properties. For example, if users tend to retrieve either file
moderately likely to request explicitly, thereby increasing F1 or file F2 (say) after file F, and tend only in the former case
both the total cost and (to a lesser degree) the total benefit to 60 to subsequently retrieve file G1, then the system will gener
the user. ally not pre-fetch G1 immediately after retrieving file F: for,
In the system described herein, pre-fetching for a user U is to the extent that the user is likely to retrieve file F2, the cost
accomplished by the user's proxy server S. Whenever proxy of the pre-fetch is high, and to the extent that the user is likely
server S retrieves a user-requested file F from an information to retrieve file F1 instead, the benefit of the pre-fetch is low,
server, it uses the identity of this file F and the characteristics 65 since the system can save as much or nearly as much time by
of the user, as described below, to identify a group of other waiting until the user chooses F1 and pre-fetching G1 only
files G1 ... Gk that the user is likely to access soon. The user's then.
US 8,171,032 B2
43 44
The proxy server S may estimate the necessary costs and 5. Server S computes the cost of triggering file G to be
benefits by adhering to the following discipline: expected cost of retrieving file Gi, times 1 minus the
1. Proxy server S maintains a set of disjoint clusters of the quotient of the target-count of <C.F.G> by the trigger
users in its user base, clustered according to their user count of <C.F.G.
profiles. 5 6. Server S computes the benefit of triggering file G to be
2. Proxy server S maintains an initially empty set PFT of the total benefit of <C.F.G> divided by the count of
“pre-fetch triples”<C,F,G>, where F and Gare files, and <CFG).
where Cidentifies eithera cluster of users or the set of all 7. Finally, proxy server S uses the computed cost and
users in the user base of proxy server S. Each pre-fetch benefit, as described earlier, to decide whether file G
triple in the set PFT is associated with several stored
10 should be triggered. The approach to pre-fetching just
values specific to that triple. Pre-fetch triples and their described has the advantage that all data storage and
manipulation concerning pre-fetching decisions by
associated values are maintained according to the rules proxy server S is handled locally at proxy server S.
in 3 and 4.
However, this “user-based' approach does lead to dupli
3. Whenever a user U in the user base of proxy server S 15 cated Storage and effort across proxy servers, as well as
makes a request R2 for a file G, or a request R2 that incomplete data at each individual proxy server. That is,
triggers file G, then proxy server Stakes the following the information indicating what files are frequently
actions: retrieved after file F is scattered in an uncoordinatedway
a. For C being the user cluster containing user U, and then across numerous proxy servers. An alternative, “file
again for C being the set of all users: based' approach is to store all such information with file
b. For any request R0 for a file, say file F, made by user U F itself. The difference is as follows. In the user-based
during the t minutes strictly prior to the request R2: approach, a pre-fetch triple <C.F.G> in server Ss set
c. If the triple <C.F.G. is not currently a member of the set PFT may mention any file F and any file G on the
PFT, it is added to the set PFT with a count of 0, a network, but is restricted to clusters C that are subsets of
trigger-count of0, a target-count of0, a total benefit of 0. 25 the user base of server S. By contrast, in the file-based
and a timestamp whose value is the current date and approach, a pre-fetch triple <C.F.G> in server Ss set
time. PFT may mention any user cluster C and any file G on
d. The count of the triple <C.F.G. is increased by one. the network, but is restricted to files F that are stored on
e. If file G was not triggered or explicitly retrieved by any server S. (Note that in the file-based approach, user
request that userU made strictly in between requests R0 30 clustering is network wide, and user clusters may
and R2, then the target-count of the triple <C.F.G> is include users from different proxy servers.) When a
increased by one. proxy server S2 sends a request to server Sto retrieve file
f. Ifrequest R2 was a request for file G, then the total benefit F for a user U, server S2 indicates in this message the
userUs user cluster C0, as well as the userUs value for
of triple <C.F.G> is increased either by the time elapsed 35 the user-determined multiplier that is used in cost-ben
between request R0 and request R2, or by the expected efit analysis. Server S can use this information, together
time to retrieve file G, whichever is less. with all its triples in its set PFT of the form <C0.F.G>
g. If request R2 was a request for file G, and G was trig and <C1.F.G>, where C1 is the set of all users every
gered or explicitly retrieved by one or more requests that where on the network, to determine (exactly as in the
user U made strictly in between requests R0 and R2, 40 user-based approach) which files G1 ... Gkare triggered
with R1 denoting the earliest such request, then the total by the request for file F. When server S sends file F back
benefit of triple <C.F.G> is decreased either by the time to proxy server S2, it also sends this list of files G1 . . .
elapsed between request R1 and request R2, or by the Gk, so that proxy server S2 can proceed to pre-fetch files
expected time to retrieve file G, whichever is less. G1 . . . Gk.
4. If a user U requests a file F, then the trigger-count is 45 The file-based approach requires some additional data
incremented by one for each triple currently in the set transmission. Recall that under the user-based approach,
PFTsuch that the triple has form <C.F.G>, where user U server S must execute steps 3c-3g above for any ordered pair
is in the set or cluster identified by C. ofrequests R0 and R2 made withint minutes of each other by
5. The “age” of a triple <C.F.G>is defined to be the number a user who employs server S as a proxy server. Under the
of days elapsed between its timestamp and the current 50 file-based approach, server S must execute steps 3c-3g above
date and time. If the age of any triple <C.F.G> exceeds a for any ordered pair of requests R0 and R2 made within t
fixed constant number of days, and also exceeds a fixed minutes of each other, by any user on the network, such that
constant multiple of the triple’s count, then the triple R0 requests a file stored on server S. Therefore, when a user
may be deleted from the set PFT. makes a request R2, the user's proxy server must send a
Proxy server S can therefore decide rapidly which files G 55 notification of request R2 to all servers S such that, during the
should be triggered by a request for a given file F from a precedingt minutes (where the variablet may now depend on
given user U, as follows. server S), the user has made a request R0 for a file stored on
1. Let C0 be the user cluster containing user U, and C1 be server S. This notification need not be sent immediately, and
the set of all users. it is generally more efficient for each proxy server to buffer up
2. Server S constructs a list L of all triples <C0.F.G> such 60 Such notifications and send them periodically in groups to the
that <C0.F.G. appears in set PFT with a count exceeding appropriate servers.
a fixed threshold. Access And Reachability Control of Users and User-Specific
3. Server S adds to list L all triples <C1.F.G. Such that Information
<C0.F.G> does not appear on list L and <C1.F.G. Although users’ true identities are protected by the use of
appears in set PFT with a count exceeding another fixed 65 secure mix paths, pseudonymity does not guarantee complete
threshold. privacy. In particular, advertisers can in principle employ
4. For each triple <C.F.G> on list L: user-specific data to barrage users with unwanted Solicita
US 8,171,032 B2
45 46
tions. The general solution to this problem is for proxy server target profile interest Summaries of at least nother users
S2 to act as a representative on behalf of each user in its user in the user base of the proxy server
base, permitting access to the user and the user's private data (f) the content of the request is to send the user a target
only in accordance with criteria that have been set by the user. object, and this target object has a particular attribute
Proxy server S2 can restrict access in two ways: 5 (such as high reading level, or low Vulgarity, or an
1. The proxy server S2 may restrict access by third parties authenticated Parental Guidance rating from the MPAA)
to server S2's pseudonymous database of user-specific (g) the content of the request is to send the user a target
object, and this target object has been digitally signed
information. When a third party such as an advertiser with a particular private key (Such as the private key used
sends a message to server S2 requesting the release of by the National Pharmaceutical Association to certify
user-specific information for a pseudonym P. server S2 10 approved documents)
refuses to honor the request unless the message includes (h.) the content of the request is to send the user a target
credentials for the accessor adequate to prove that the object, and the target profile has been digitally signed by
accessor is entitled to this information. The user associ- a profile authentication agency, guaranteeing that the
ated with pseudonym P may at any time send signed is target profile 1S a true and accurate profile of the target
XO Y XO the cre-
control messages to proxy server S2, Specifying object it claims to describe, with all attributes authenti
cated
dentials or Boolean combinations of credentials that (i.) the content of the request is to send the user a target
proxy server S2 should thenceforth consider to be object, and the target profile of this target object is within
adequate grounds for releasing a specified Subset of the a specified distance of a particular search profile speci
information associated with pseudonym P. Proxy server 20 fied by the user
S2 stores these access criteria with its database record (i.) the content of the request is to send the user a target
for pseudonym P. For example, a user might wish to object, and the proxy server S2, by using the user's
proxy server S2 to release purchasing information only stored target profile interest Summary, estimates the
to selected information providers, to charitable organi- user's likely interest in the target object to be above a
Zations (that is, organizations that can provide a govern- 25 specified threshold
ment-issued credential that is issued only to registered (k.) the accessor indicates its willingness to make a par
charities), and to market researchers who have paid user ticular payment to the user in exchange for the fulfill
U for the right to study user U's purchasing habits. ment of the request
2. The proxy server S2 may restrict the ability of third The steps required to create and maintain the user's access
parties to send electronic messages to the user. When a 30 control requirements are as follows:
third party Such as an advertiser attempts to send infor- 1. The user composes a boolean combination of predicates
mation (such as a textual message or a request to enter that apply to requests; the resulting complex predicate
into spoken or written real-time communication) to should be true when applied to a request that the user
pseudonym P. by sending a message to proxy server S2 wants proxy server S2 to honor, and false otherwise. The
requesting proxy server S2 to forward the information to 35 complex predicate may be encoded in another form, for
the user at pseudonym P. proxy server S2 will refuse to efficiency.
honor the request, unless the message includes creden- 2. The complex predicate is signed with SK, and trans
tials for the accessor adequate to meet the requirements mitted from the user's client processor C3 to the proxy
the user has chosen to impose, as above, on third parties server S2 through the mix path enclosed in a packet that
who wish to send information to the user. If the message 40 also contains the user's pseudonym P.
does include adequate credentials, then proxy server S2 3. The proxy server S2 receives the packet, verifies its
removes a single-use pseudonymous return address authenticity using PK and stores the access control
envelope from its database record for pseudonym P. and instructions specified in the packet as part of its database
uses the envelope to send a message containing the record for pseudonym P.
specified information along a secure mix path to the user 45 The proxy server S2 enforces access control as follows:
of pseudonym P. If the envelope being used is the only 1. The third party (accessor) transmits a request to proxy
envelope stored for pseudonym P. or more generally if server S2 using the normal point-to-point connections
the supply of such envelopes is low, proxy server S2 adds provided by the network N. The request may be to access
a notation to this message before sending it, which nota- the target profile interest Summaries associated with a
tion indicates to the user's local server that it should send 50 set of pseudonyms P1 . . . Pn, or to access the user
additional envelopes to proxy server S2 for future use. profiles associated with a set of pseudonyms P1 ... Pn,
In a more general variation, the user may instruct the proxy or to forward a message to the users associated with
server S2 to impose more complex requirements on the grant pseudonyms P1 . . . Pn. The accessor may explicitly
ing of requests by third parties, not simply boolean combina specify the pseudonyms P1 . . . Pn, or may ask that
tions of required credentials. The user may impose any Bool- 55 P1 . . . Pn be chosen to be the set of all pseudonyms
ean combination of simple requirements that may include, but registered with proxy server S2 that meet specified con
are not limited to, the following: ditions.
(a.) the accessor (third party) is a particular party 2. The proxy server S2 indexes the database record for each
(b.) the accessor has provided a particular credential pseudonym Pi (1<=I<=n), retrieves the access require
(c.) satisfying the request would involve disclosure to the 60 ments provided by the user associated with Pi, and deter
accessor of a certain fact about the user's user profile mines whether and how the transmitted request should
(d.) satisfying the request would involve disclosure to the be satisfied for Pi. If the requirements are satisfied, S2
accessor of the user's target profile interest Summary proceeds with steps 3a-3c.
(e.) Satisfying the request would involve disclosure to the 3a. If the request can be satisfied but only upon payment of
accessor of statistical Summary data, which data are 65 a fee, the proxy server S2 transmits a payment request to
computed from the user's user profile or target profile the accessor, and waits for the accessor to send the
interest Summary together with the user profiles and payment to the proxy server S2. Proxy server S2 retains
US 8,171,032 B2
47 48
a service fee and forwards the balance of the payment to known: for example, “FAX trees” are in common use in
the user associated with pseudonym Pi, via an anony political organizations, and multicast trees are widely used in
mous return packet that this user has provided. distribution of multimedia data in the Internet; for example,
3b. If the request can be satisfied but only upon provision of See “Scalable Feedback Control for Multicast Video Distri
a credential, the proxy server S2 transmits a credential 5 bution in the Internet.” (Jean Chrysostome Bolot, Thierry
request to the accessor, and waits for the accessor to send Turletti, & Ian Wakeman, Computer Communication Review,
the credential to the proxy server S2. Vol. 24, #4, Oct. 94, Proceedings of SIGCOMM94, pp.
3c. The proxy server S2 satisfies the request by disclosing 58-67) or “An Architecture For Wide-Area Multicast Rout
user-specific information to the accessor, by providing ing.” (Stephen Deering, Deborah Estrin, Dino Farinacci, Van
the accessor with a set of single-use envelopes to com 10 Jacobson, Ching-Gung Liu, & Liming Wei, Computer Com
municate directly with the user, or by forwarding a mes munication Review, Vol. 24, it 4, Oct. 94, Proceedings of
Sage to the user, as requested. SIGCOMM 94, pp. 126-135). While there are many possible
4. Proxy server S2 optionally sends a message to the acces trees that can be overlaid on a graph representation of a
sor, indicating why each of the denied requests for P1 .. network, both the nature of the networks (e.g., the cost of
. Pn was denied, and/or indicating how many requests 15 transmitting data over a link) and their use (for example,
were satisfied. certain nodes may exhibit more frequent intercommunica
5. The active and/or passive relevance feedback provided tion) can make one choice of tree better than another for use
by any user U with respect to any target object sent by as a multicast tree. One of the most difficult problems in
any path from the accessor is tabulated by the above practical network design is the construction of “good mul
described tabulating process resident on user U's client ticast trees, that is, tree choices which exhibit low cost (due to
processor C3. data not traversing links unnecessarily) and good perfor
As described above, a Summary of Such information is mance (due to data frequently being close to where it is
periodically transmitted to the proxy server S2 to enable the needed)
proxy server S2 to update that user's target profile interest Constructing a Multicast Tree
Summary and user profile. 25 Algorithms for constructing multicast trees have either
The access control criteria can be applied to Solicited as been ad-hoc, as is the case of the Deering, et al. Internet
well as unsolicited transmissions. That is, the proxy server multicast tree, which adds clients as they request service by
can be used to protect the user from inappropriate or misrep grafting them into the existing tree, or by construction of a
resented target objects that the user may request. If the user minimum cost spanning tree. A distributed algorithm for cre
requests a target object from an information server, but the 30 ating a spanning tree (defined as a tree that connects, or
target object turns out not to meet the access control criteria, “spans, all nodes of the graph) on a set of Ethernet bridges
then the proxy server will not permit the information server to was developed by Radia Perlman ("Interconnections:
transmit the target object to the user, or to charge the user for Bridges and Routers.” Radia Perlman, Addison-Wesley,
Such transmission. For example, to guard against target 1992). Creating a minimal-cost spanning tree for a graph
objects whose profiles have been tampered with, the user may 35 depends on having a cost model for the arcs of the graph
specify an access control criterion that requires the provider (corresponding to communications I inks in the communica
to prove the target profile’s accuracy by means of a digital tions network). In the case of Ethernet bridges, the default
signature from a profile authentication agency. As another cost (more complicated costing models for path costs are
example, the parents of a child user may instruct the proxy discussed on pp. 72-73 of Perlman) is calculated as a simple
server that only target objects that have been digitally signed 40 distance measure to the root; thus the spanning tree mini
by a recognized child protection organization may be trans mizes the cost to the root by first electing a unique root and
mitted to the user; thus, the proxy server will not let the user then constructing a spanning tree based on the distances from
retrieve pornography, even from a rogue information server the root. In this algorithm, the root is elected by recourse to a
that is willing to provide pornography to users who have not numeric ID contained in "configuration messages': the server
Supplied an adulthood credential. 45 whose ID has minimum numeric value is chosen as the root.
Distribution of Information with Multicast Trees Several problems exist with this algorithm in general. First,
The graphical representation of the network N presented in the method of using an ID does not necessarily select the best
FIG. 3 shows that at least one of the data communications root for the nodes interconnected in the tree. Second, the cost
links can be eliminated, as shown in FIG. 4, while still model is simplistic.
enabling the network N to transmit messages among all the 50 We first show how to use the similarity-based methods
servers A-D. By elimination, we mean that the link is unused described above to select the servers most interested in a
in the logical design of the network, rather than a physical group of target objects, herein termed “core servers' for that
disconnection of the link. The graphs that result when all group. Next we show how to construct an unrooted multicast
redundant data communications links are eliminated are tree that can be used to broadcast files to these core servers.
termed “trees” or “connected acyclic graphs.” A graph where 55 Finally, we show how files corresponding to target objects are
a message could be transmitted by a server through other actually broadcast through the multicast tree at the initiative
servers and then return to the transmitting server over a dif of a client, and how these files are later retrieved from the core
ferent originating data communications link is termed a servers when clients request them.
“cycle. A tree is thus an acyclic graph whose edges (links) Since the choice of core servers to distribute a file to
connect a set of graph “nodes' (servers). The tree can be used 60 depends on the set of users who are likely to retrieve the file
to efficiently broadcast any data file to selected servers in a set (that is, the set of users who are likely to be interested in the
of interconnected servers. corresponding target object), a separate set of core servers and
The tree structure is attractive in a communications net hence a separate multicast tree may be used for each topical
work because much information distribution is multicast in group of target objects. Throughout the description below,
nature—that is, a piece of information available at a single 65 servers may communicate among themselves through any
source must be distributed to a multiplicity of points where path over which messages can travel; the goal of each multi
the information can be accessed. This technique is widely cast tree is to optimize the multicast distribution of files
US 8,171,032 B2
49 50
corresponding to target objects of the corresponding topic. cluster diameter of cluster C. The slope and/or intercept of
Note that this problem is completely distinct from selecting a this affine function are chosen to be smaller (thereby increas
multiplicity of spanning trees for the complete set of inter ing w(Si, C)) for servers Si for which the target object pro
connected nodes as disclosed by Sincoskie in U.S. Pat. No. vider wishes to improve performance, as may be the case if
4,706,080 and the publication titled “Extended Bridge Algo the users in the user base of proxy server Si pay a premium for
rithms for Large Networks” by W. D. Sincoskie and C. J. improved performance, or if performance at Si will otherwise
Cotton, published January 1988 in IEEE Network on pages be unacceptably low due to slow network connections.
16-24. The trees in this disclosure are intentionally designed In another variation, the proxy server Si is modified so that
to interconnect a selected Subset of nodes in the system, and it maintains not only target profile interest Summaries for each
are successful to the degree that this subset is relatively small. 10 user in its user base, but also a single aggregate target profile
Multicast Tree Construction Procedure interest Summary for the entire user base. This aggregate
A set of topical multicast trees for a set of homogenous target profile interest Summary is determined in the usual way
target objects may be constructed or reconstructed at any from relevance feedback, but the relevance feedback on a
time, as follows. The set of target objects is grouped into a target object, in this case, is considered to be the frequency
fixed number of topical clusters C1 ... Cp with the methods 15 with which users in the user base retrieved the target object
described above, for example, by choosing C1... Cp to be the when it was new. Whenever a user retrieves a target object by
result of a k-means clustering of the set of target objects, or means of a request to proxy server Si, the aggregate target
alternatively a covering set of low-level clusters from a hier profile interest summary for proxy server Siis updated. In this
archical cluster tree of these target objects. A multicast tree variation, w(Si, C) I estimated by the following steps:
MT(c) is then constructed from each cluster C in C1 ... Cp, (a) Proxy server Sirandomly selects a target object T from
by the following procedure: cluster C.
1. Given a set of proxy servers, S1 . . . Sn, and a topical (b) Proxy server Si applies the techniques disclosed above
cluster C. It is assumed that a general multicast tree to its stored aggregate target profile interest Summary in
MT, that contains all the proxy servers S1 ... Sn has order to estimate the aggregate interest w(Si, T) that its
previously been constructed by well-known methods. 25 aggregated user base had in the selected target object T.
2. Each pair <Si, C> is associated with a weight, w(Si, C), when new; this may be interpreted as an estimate of the
which is intended to covary with the expected number of likelihood that at least one member of the user base will
users in the user base of proxy server Si who will sub retrieve a new target object similar to T.
sequently access a target object from cluster C. This (c) Proxy server Si repeats steps (a)-(b) for several target
weight is computed by proxy server Si in any of several 30 objects T selected randomly from cluster C, and aver
ways, all of which make use of the similarity measure ages the several values of w(Si, T) thereby computed in
ment computation described herein. step (b) to determine the desired quantity w(Si, C),
One variation makes use of the following steps: (a) Proxy which quantity represents the expected aggregate inter
server Sirandomly selects a target object T from cluster C. (b) est by the user base of proxy server Si in the target
For each pseudonym in its local database, with associated 35 objects of cluster C.
userU, proxy server Siapplies the techniques disclosedabove 3. Those servers Si from among S1 . . . Sin with the greatest
to user U's stored user profile and target profile interest sum weights w(Si, C) are designated “core servers' for cluster C.
mary in order to estimate the interest w(U, T) that user U has In one variation, where it is desired to selecta fixed number of
in the selected target object T. The aggregate interest w(Si, T) core servers, those servers Si with the greatest values ofw(Si,
that the user base of proxy server Si has in the target object T 40 C) are selected. In another variation, the value of w(Si, C) for
is defined to be the sum of these interest values w(U, T). each server Si is compared against a fixed threshold W., and
Alternatively, w(Si, T) may be defined to be the sum of values those servers Sisuch that w(Si, C) equals or exceeds ware
s(w(UT)) overall U in the user base. Here s(*) is a sigmoidal selected as core servers. If cluster C represents a narrow and
function that is close to 0 for Small arguments and close to a specialized set of target objects, as often happens when the
constant p for large arguments; thus S(w(UT)) estimates 45 clusters C1 . . . Cp are numerous, it is usually adequate to
the probability that user U will access target object T, which select only a small number of core server cluster C, thereby
probability is assumed to be independent of the probability obtaining Substantial advantages in computational efficiency
that any other user will access target object T. In a variation, in steps 4-5 below
w(Si, T) is made to estimate the probability that at least one 4. A complete graph G (C) is constructed whose vertices are
user from the user base of Si will access target object T: then 50 the designated core servers for cluster C. For each pair of core
w(Si, T) may be defined as the maximum of values w(U, T), servers, the cost of transmitting a message between those core
or of 1 minus the product over the users U of the quantity servers along the cheapest pathis estimated, and the weight of
(1-s(w(U, T))). (c) Proxy server Si repeats steps (a)-(b) for the edge connecting those core servers is taken to be this cost.
several target objects Tselected randomly from cluster C, and The cost is determined as a suitable function of average
averages the several values of w(Si, T) thereby computed in 55 transmission charges, average transmission delay, and worst
step (b) to determine the desired quantity w(Si, C), which case or near-worst-case transmission delay.
quantity represents the expected aggregate interest by the user 5. The multicast tree MT(C) is computed by standard meth
base of proxy server Si in the target objects of cluster C. ods to be the minimum spanning tree (or a near-minimum
In another Variation, where target profile interest Summa spanning tree) for G(C), where the weight of an edge between
ries are embodied as search profile sets, the following proce 60 two core servers is taken to be the cost of transmitting a
dure is followed to compute w(Si, C): (a). For each search message between those two core servers. Note that MT(C)
profile Ps in the locally stored search profile set of any user in does not contain as vertices all proxy servers S1 ... Sn, but
the user base of proxy server Si, proxy server Sicomputes the only the core servers for cluster C.
distance d(Ps, P.) between the search profile and the cluster 6. A message Misformed describing the cluster profile for
profile P of cluster C. (b). w(Si, C) is chosen to be the 65 cluster C, the core servers for cluster Cand the topology of the
maximum value of (-dPSP)/r) across all Such search pro multicast tree MT(C) constructed on those core servers. Mes
files Ps, where r is computed as an affine function of the sage M is broadcast to all proxy servers S1 ... Sn by means
US 8,171,032 B2
51 52
of the general multicast tree MTan. Each proxy server Si, Si in list L, transmit a copy of message M from server S to
upon receipt of message M, extracts the cluster profile of server Si over a virtual point-to-point connection, where the
cluster C, and stores it on a local storage device, together with S field of the copy of message M has been altered to S.
certain other information that it determines from message M. If Si cannot be reached in a reasonable amount of time by any
as follows. If proxy server Si is named in message Mas a core Virtual point-to-point connection (for example, server Si is
server for cluster C, then proxy server Si extracts and stores broken), recurse to step (c) above with S. bound to S.and
the subtree of MT(C) induced by all core servers whose path S. bound to S {\sub I for the duration of the recursion.
distance from Si in the graph MT(C) is less than or equal to d, When server S'in step 1 ora server Si in step 2(e) receives
where d is a constant positive integer (usually from 1 to 3). If a copy of the global request message M, it acts according to
message M does not name proxy server Si as a core server for 10 exactly the same steps. As a result, all core servers eventually
MT(C), then proxy server Si extracts and stores a list of one or receive a copy of global request message M and act on the
more nearby core servers that can be inexpensively contacted embedded request R, unless some core servers cannot be
by proxy server Si over virtual point-to-point links. reached. Even if a core server is unreachable, step (e) ensures
In the network of FIG. 3, to illustrate the use of trees, as that the broadcast can continue to other core servers in most
applied to the system of the present invention, consider the 15 circumstances, provided that del; higher values of d provide
following simple example where it is assumed that client r additional insurance against unreachable core servers.
provides on-line information for the network, Such as an Multicasting Files
electronic newspaper. This information can be structured by The system for customized electronic information of desir
client rinto a prearranged form, comprising a number of files, able objects executes the following steps in order to introduce
each of which is associated with a different target object. In a new target object into the system. These steps are initiated
the case of an electronic newspaper, the files can contain by an entity E, which may be eithera user entering commands
textual representations of Stock prices, weather forecasts, via a keyboard at a client processor q, as illustrated in FIG. 3,
editorials, etc. The system determines likely demand for the oran automatic Software process resident on a client or server
target objects associated with these files in order to optimize processor q. 1. Processor q forms a signed request R, which
the distribution of the files through the network N of inter 25 asks the receiver to store a copy of a file Fon its local storage
connected clients p-S and proxy servers A-D. Assume that device. File F, which is maintained by client q on storage at
cluster C consists of text articles relating to the aerospace client q or on storage accessible by client q over the network,
industry; further assume that the target profile interest Sum contains the informational content of or an identifying
maries stored at proxy servers A and B for the users at clients description of a target object, as described above. The request
p and rindicate that these users are strongly interested in Such 30 Ralso includes an address at which entity E may be contacted
articles. Then the proxy servers A and B are selected as core (possibly a pseudonymous address at Some proxy server D),
servers for the multicast tree MT(C). The multicast tree and asks the receiver to store the fact that file F is maintained
MT(C) is then computed to consist of the core servers, A and by an entity at said address. 2. Processor q embeds request R
B, connected by an edge that represents the least costly virtual in a message M1, which it pseudonymously transmits to the
point-to-point link between A and B (either the direct path 35 entity Es proxy server D as described above. Message M1
A-B or the indirect path A-C-B, depending on the cost). instructs proxy server D to broadcast request R along an
Global Requests to Multicast Trees appropriate multicast tree. 3. Upon receipt of message M1,
One type of message that may be transmitted to any proxy proxy server D examines the doubly embedded file F and
server S is termed a 'global request message.” Such a mes computes a target profile P for the corresponding target
sage M triggers the broadcast of an embedded request R to all 40 object. It compares the target profile P to each of the cluster
core servers in a multicast tree MT(C). The content of request profiles for topical clusters C1 ... Cp described above, and
Rand the identity of cluster Care included in the message M. chooses Ck to be the cluster with the smallest similarity
as is a field indicating that message M is a global request distance to profile P. 4. Proxy server D sends itself a global
message. In addition, the message M contains a field S request message M instructing itself to broadcast request R
which is unspecified except under certain circumstances 45 along the topical multicast tree MT(Ck). 5. Proxy server D
described below, when it names a specific core server. A notifies entity E through a pseudonymous communication
global request message M may be transmitted to proxy server that file F has been multicast along the topical multicast tree
S by a user registered with proxy server S, which transmission for cluster Ck. As a result of the procedure that server Dand
may take place along a pseudonymous mix path, or it may be other servers follow for acting on global request messages,
transmitted to proxy server S from another proxy server, 50 step 4 eventually causes all core servers for topic Ck to act on
along a virtual point-to-point connection. request Rand therefore store a local copy of file F. In order to
When a proxy server S receives a message M that is marked make room for file F on its local storage device, a core server
as a global request message, it acts as follows: 1. If proxy Si may have to delete a less useful file. There are several ways
server S is not a core server for topic C, it retrieves its locally to choose a file to delete. One option, well known in the art, is
stored list of nearby core servers for topic C, selects from this 55 for Si to choose to delete the least recently accessed file. In
list a nearby core server S', and transmits a copy of message M another variation, Si deletes a file that it believes few users
over a virtual point-to-point connection to core server S'. If will access. In this variation, whenever a server Si stores a
this transmission fails, proxy server S repeats the procedure copy of a file F, it also computes and stores the weight w(Si,
with other core servers on its list. 2. If proxy server S is a core C), where C is a cluster consisting of the single target object
server for topic C, it executes the following steps: (a) Act on 60 associated with file F. Then, when server Si needs to delete a
the request R that is embedded in message M. (b) Set Sto file, it chooses to delete the file F with the lowest weight w(Si,
be S(C) Retrieve the locally stored subtree of MT(C), and C). To reflect the fact that files are accessed less as they age,
extract from it a list L of all core servers that are directly server Si periodically multiplies its stored value of W(Si, C.)
linked to Sin this subtree. (d) If the message M specifies a by a decay factor, such as 0.95, for each file F that it then
value for Sand Sappears on the list L. remove S from 65 stores. Alternatively, insteadofusinga decay factor, server Si
the list L. Note that list L. may be empty before this step, or may periodically recompute aggregate interest w(Si, C) for
may become empty as a result of this step. (e) For each server each file F that it stores; the aggregate interest changes over
US 8,171,032 B2
53 54
time because target objects typically have an age attribute that may be either a user entering commands via a keyboard at a
the system considers in estimating user interest, as described client q, as illustrated in FIG. 3, or an automatic software
above. process resident on a client or server processor q. 1. Processor
If entity E later wishes to remove file F from the network, q forms a query Q that asks whether the recipient (a core
for example because it has just multicast an updated version, server for cluster C) still stores a file F that was previously
it pseudonymously transmits a digitally signed global request multicast to the multicast tree MT(C); if so, the recipient
message to proxy server D, requesting all proxy servers in the server should reply with its own server name. Note that pro
multicast tree MT(Ck) to delete any local copy of file F that cessor q must already know the name of file F and the identity
they may be storing. of cluster C; typically, this information is provided to entity E
Queries to Multicast Trees 10 by a service Such as the news clipping service or browsing
In addition to global request messages, another type of system described below, which must identify files to the user
message that may be transmitted to any proxy server S is by (name, multicast topic) pair. 2. Processor q forms a query
termed a "query message. When transmitted to a proxy message Mthat poses query Q to the multicast tree MT(C). 3.
server, a query message causes a reply to be sent to the Processor q pseudonymously transmits message M to the
originator of the message; this reply will contain an answer to 15 user's proxy server D, as described above. 4. Processor q
a given query Q if any of the servers in a given multicast tree receives a response M2 to message M. 5. If the response M2
MT(C) are able to answer it, and will otherwise indicate that is “positive that is, it names a server S that still stores file F.
no answer is available. The query and the cluster Care named then processor q pseudonymously instructs the user's proxy
in the query message. In addition, the query message contains server D to retrieve file F from server S. If the retrieval fails
a field S., which is unspecified except under certain circum because server Shas deleted file F since it answered the query,
stances described below, when it names a specific core server. then client q returns to step 1.6. If the response M2 is “nega
When a proxy server S receives a message M that is marked tive that is, it indicates that no server in MT(C) still stores
as a query message, it acts as follows: 1. Proxy server S sets file F, then processor q forms a query Q that asks the recipient
A to be the return address for the client or server that trans for the address A of the entity that maintains file F; this entity
mitted message M to server S. A may be either a network 25 will ordinarily maintain a copy of file F indefinitely. All core
address or a pseudonymous address 2. If proxy server S is not servers in MT(C) ordinarily retain this information (unless
a core server for cluster C, it retrieves its locally stored list of instructed to delete it by the maintaining entity), even if they
nearby core servers for topic C, selects from this list a nearby delete file F for space reasons. Therefore, processor q should
core server S', and transmits a copy of the locate message M receive a response providing address A, whereupon processor
over a virtual point-to-point connection to core server S'. If 30 q pseudonymously instructs the user's proxy server D to
this transmission fails, proxy server S repeats the procedure retrieve file F from address A.
with other core servers on its list. Upon receiving a reply, it When multiple versions of a file F exist on local servers
forwards this reply to address A. 3. If proxy server S is a core throughout the data communication network N, but are not
server for cluster C, and it is able to answer query Q using marked as alternate versions of the same file, the systems
locally stored information, then it transmits a “positive' reply 35 ability to rapidly locate files similar to F (by treating them as
to A, containing the answer. 4. If proxy server S is a core target objects and applying the methods disclosed in "Search
server for topic C, but it is unable to answer query Q using ing for Target Objects’ above) makes it possible to find all the
locally stored information, then it carries out a parallel depth alternate versions, even if they are stored remotely. These
first search by executing the following steps: (a) Set L to be related data files may then be reconciled by any method. In a
the empty list. (b) Retrieve the locally stored subtree of 40 simple instantiation, all versions of the data file would be
MT(C). For each server Si directly linked to S. in this replaced with the version that had the latest date or version
subtree, other than Ss (if specified), add the ordered pair (Si, number. In another instantiation, each version would be auto
S) to the list L. (c) If L is empty, transmit a “negative' reply to matically annotated with references or pointers to the other
address A, saying that server S cannot locate an answer to versions.
query Q, and terminate the execution of step 4: otherwise 45
proceed to step (d). (d) Select a list L1 of one or more server News Clipping Service
pairs (Ai, Bi) from the list L. For each server pair (Ai, Bi) on
the list L1, form a locate message M(Ai, Bi), which is a copy The system for customized electronic identification of
of message M whose S field has been modified to specify desirable objects of the present invention can be used in the
Bi, and transmit this message M(Ai, Bi) to server Ai over a 50 electronic media system of FIG. 1 to implement an automatic
virtual point-to-point connection. (e) For each reply received news clipping service which learns to select (filter) news
(by S) to a message sent in Step (d), act as follows: (I) If a articles to match a users interests, based solely on which
“positive' reply arrives to a locate message M(Ai, Bi), then articles the user chooses to read. The system for customized
forward this reply to A, and terminate step 4, immediately. (ii) electronic identification of desirable objects generates a tar
If a “negative' reply arrives to a locate message M(Ai, Bi). 55 get profile for each article that enters the electronic media
then remove the pair (Ai, Bi) from the list L1. (iii) If the system, based on the relative frequency of occurrence of the
message M(Ai, Bi) could not be successfully delivered to Ai, words contained in the article. The system for customized
then remove the pair (Ai, Bi) from the list L1, and add the pair electronic identification of desirable objects also generates a
(Ci, Ai) to the list L1 for each Ciother than Bi that is directly search profile set for each user, as a function of the target
linked to Ai in the locally stored subtree of MT(C). (f) Once 60 profiles of the articles the user has accessed and the relevance
L1 no longer contains any pair (Ai, Bi) for which a message feedback the user has provided on these articles. As new
M(Ai, Bi) has been sent, or after a fixed period of time has articles are received for storage on the mass storage systems
elapsed, return to step (c). SS-SS of the information servers I-I, the system for cus
Retrieving Files from a Multicast Tree tomized electronic identification of desirable objects gener
When a processor q in the network wishes to retrieve the 65 ates their target profiles. The generated target profiles are later
file associated with a given target object, it executes the fol compared to the search profiles in the users search profile
lowing steps. These steps are initiated by an entity E, which sets, and those new articles whose target profiles are closest
US 8,171,032 B2
55 56
(most similar) to the closest search profile in a user's search branches further down the tree represent divisions of the set of
profile set are identified to that user for possible reading. The target objects into Successively smaller Subclusters of target
computer program providing the articles to the user monitors objects. Each cluster has a cluster profile, so that at each node
how much the user reads (the number of screens of data and of the tree, the average target profile (centroid) of all target
the number of minutes spent reading), and adjusts the search 5 objects stored in the subtree rooted at that node is stored. This
profiles in the user's search profile set to more closely match average of target profiles is computed over the representation
what the user apparently prefers to read. The details of the of target profiles as vectors of numeric attributes, as described
method used by this system are disclosed in flow diagram above.
form in FIG. 5. This method requires selecting a specific Compare Current Articles Target Profiles to a User's Search
method of calculating user-specific search profile sets, of 10 Profiles
measuring similarity between two profiles, and of updating a The process by which a user employs this apparatus to
user's search profile set (or more generally target profile retrieve news articles of interest is illustrated in flow diagram
interest Summary) based on what the user read, and the form in FIG. 11. At step 1101, the user logs into the data
examples disclosed herein are examples of the many possible communication network N via their client processor C and
implementations that can be used and should not be construed 15 activates the news reading program. This is accomplished by
to limit the scope of the system. the user establishing a pseudonymous data communications
Initialize Users Search Profile Sets connection as described above to a proxy server S, which
The news clipping service instantiates target profile inter provides front-end access to the data communication network
est summaries as search profile sets, so that a set of high N. The proxy server S. maintains a list of authorized pseud
interest search profiles is stored for each user. The search onyms and their corresponding public keys and provides
profiles associated with a given user change over time. As in access and billing control. The user has a search profile set
any application involving search profiles, they can be initially stored in the local data storage medium on the proxy server
determined for a new user (or explicitly altered by an existing S. When the user requests access to “news” at step 1102, the
user) by any of a number of procedures, including the follow profile matching module 203 resident on proxy server S2
ing preferred methods: (1) asking the user to specify search 25 sequentially considers each search profile p from the user's
profiles directly by giving keywords and/or numeric search profile set to determine which news articles are most
attributes, (2) using copies of the profiles of target objects or likely of interest to the user. The news articles were automati
target clusters that the user indicates are representative of his cally clustered into a hierarchical cluster tree at an earlier step
or her interest, (3) using a standard set of search profiles so that the determination can be made rapidly for each user.
copied or otherwise determined from the search profile sets of 30 The hierarchical cluster tree serves as a decision tree for
people who are demographically similar to the user. determining which articles target profiles are most similar to
Retrieve New Articles from Article Source search profile p: the search for relevant articles begins at the
Articles are available on-line from a wide variety of top of the tree, and at each level of the tree the branch or
sources. In the preferred embodiment, one would use the branches are selected which have cluster profiles closest to p.
current days news as Supplied by a news source, such as the 35 This process is recursively executed until the leaves of the tree
AP or Reuters news wire. These news articles are input to the are reached, identifying individual articles of interest to the
electronic media system by being loaded into the mass stor user, as described in the section “Searching for Target
age system SS4 ofan information server S4. The article profile Objects” above.
module 201 of the system for customized electronic identifi A variation on this process exploits the fact that many users
cation of desirable objects can reside on the information 40 have similar interests. Rather than carry out steps 5-9 of the
server S and operates pursuant to the steps illustrated in the above process separately for each search profile of each user,
flow diagram of FIG. 5, where, as each article is received at it is possible to achieve added efficiency by carrying out these
step 501 by the information server S, the article profile steps only once for each group of similar search profiles,
module 201 at step 502 generates a target profile for the article thereby satisfying many users” needs at once. In this varia
and stores the target profile in an article indexing memory 45 tion, the system begins by non-hierarchically clustering all
(typically part of mass storage system SS for later use in the search profiles in the search profile sets of a large number
selectively delivering articles to users. This method is equally of users. For each cluster k of search profiles, with cluster
useful for selecting which articles to read from electronic profilep, it uses the method described in the section “Search
news groups and electronic bulletin boards, and can be used ing for Target Objects to locate articles with target profiles
as part of a system for Screening and organizing electronic 50 similar to p. Each located article is then identified as of
mail (“e-mail'). interest to each user who has a search profile represented in
Calculate Article Profiles cluster k of search profiles.
A target profile is computed for each new article, as Notice that the above variation attempts to match clusters
described earlier. The most important attribute of the target of search profiles with similar clusters of articles. Since this is
profile is a textual attribute that stands for the entire text of the 55 asymmetrical problem, it may instead be given asymmetrical
article. This textual attribute is represented as described ear Solution, as the following more general variation shows. At
lier, as a vector of numbers, which numbers in the preferred Some point before the matching process commences, all the
embodiment include the relative frequencies (TF/IDF scores) news articles to be considered are clustered into a hierarchical
of word occurrences in this article relative to other compa tree, termed the “target profile cluster tree,” and the search
rable articles. The server must count the frequency of occur 60 profiles of all users to be considered are clustered into a
rence of each word in the article in order to compute the second hierarchical tree, termed the “search profile cluster
TF/IDF scores. tree.” The following steps serve to find all matches between
These news articles are then hierarchically clustered in a individual target profiles from any target profile cluster tree
hierarchical cluster tree at step 503, which serves as a decision and individual search profiles from any search profile cluster
tree for determining which news articles are closest to the 65 tree: 1. For each child subtree S of the root of the search
users interest. The resulting clusters can be viewed as a tree profile cluster tree (or, let S be the entire search profile cluster
in which the top of the tree includes all target objects and tree if it contains only one search profile): 2. Compute the
US 8,171,032 B2
57 58
cluster profile Ps to be the average of all search profiles in measure of article attractiveness=0.2 if the second
subtree S 3. For each subcluster (child subtree) T of the root page is accessed--0.2 if all pages are accessed
0.2 if more than 30 seconds was spent on the
of the target profile cluster tree (or, let T be the entire target article+0.2 if more than one minute was spent on
profile cluster tree if it contains only one target profile): 4. the article+0.2 if the minutes spent in the article
Compute the cluster profile P, to be the average of all target are greater than half the number of pages.
profiles in subtree T 5. Calculate d(Ps, P, the distance The computed measure of article attractiveness can then be
between P and P. 6. If d(Ps, P,)<t, a threshold, 7. If S used as a weighting function to adjust the user's search profile
contains only one search profile and T contains only one set to thereby more accurately reflect the user's dynamically
target profile, declare a match between that search profile and 10
changing interests.
that target profile, 8. otherwise recurse to step 1 to find all Update User Profiles
matches between search profiles in tree S and target profiles in Updating of a user's generated search profile set can be
tree T. done at step 1108 using the method described in copending
The threshold used in step 6 is typically an affine function U.S. patent application Ser. No. 08/346,425. When an article
or other function of the greater of the cluster variances (or 15
is read, the server S shifts each search profile in the set
cluster diameters) of S and T. Whenever a match is declared slightly in the direction of the target profiles of those nearby
between a search profile and a target profile, the target object articles for which the computed measure of article attractive
that contributed the target profile is identified as being of ness was high. Given a search profile with attributes u, from
interest to the user who contributed the search profile. Notice a user's search profile set, and a set of Jarticles available with
that the process can be applied even when the set of users to attributes d (assumed correct for now), where I indexes
be considered or the set of target objects to be considered is users, indexes articles, and kindexes attributes, user I would
be predicted to picka set of P distinct articles to minimize the
very Small. In the case of a single user, the process reduces to sum of d(u, b) over the chosen articlesj. The user's desired
the method given for identifying articles of interest to a single attributes u, and an article's attributes d would be some
user. In the case of a single target object, the process consti form of word frequencies such as TF/IDF and potentially
tutes a method for identifying users to whom that target object 25
other attributes such as the source, reading level, and length of
is of interest. the article, while d(u, d) is the distance between these two
Present List of Articles to User attribute vectors (profiles) using the similarity measure
Once the profile correlation step is completed for a selected described above. If the user picks a different set of Particles
user or group of users, at step 1104 the profile processing 30
than was predicted, the user search profile set generation
module 203 stores a list of the identified articles for presen module should try to adjust u and/or d to more accurately
tation to each user. At a user's request, the profile processing predict the articles the user selected. In particular, u, and/or d,
system 203 retrieves the generated list of relevant articles and should be shifted to increase their similarity if user I was
presents this list of titles of the selected articles to the user, predicted not to select article j but did select it, and perhaps
who can then select at step 1105 any article for viewing. (If no 35
also to decrease their similarity if user I was predicted to
titles are available, then the first sentence(s) of each article select article but did not. A preferred method is to shift u for
can be used.) The list of article titles is sorted according to the each wrong prediction that user I will not selectarticle, using
the formula:
degree of similarity of the article's target profile to the most
similar search profile in the user's search profile set. The u iii-' u ri-e(ulikdik)
resulting sorted list is either transmitted in real time to the user 40 Here u is chosen to be the search profile from user I's
client processor C, if the user is present at their client pro search profile set that is closest to target profile. Ife is posi
cessor C, or can be transmitted to a users mailbox, resident tive, this adjustment increases the match between user I's
on the user's client processor C or stored within the server S. search profile set and the target profiles of the articles user I
for later retrieval by the user; other methods of transmission actually selects, by making u, closer to d, for the case where
include facsimile transmission of the printed list or telephone 45 the algorithm failed to predict an article that the viewer
transmission by means of a text-to-speech system. The user selected. The size of e determines how many example articles
can then transmit a request by computer, facsimile, or tele one must see to change the search profile Substantially. Ife is
phone to indicate which of the identified articles the user too large, the algorithm becomes unstable, but for sufficiently
wishes to review, if any. The user can still access all articles in Small e, it drives u to its correct value. In general, e should be
any information server S to which the user has authorized 50 proportional to the measure of article attractiveness; for
access, however, those lower on the generated list are simply example, it should be relatively high if user I spends a long
further from the users interests, as determined by the user's time reading article j. One could in theory also use the above
search profile set. The server S. retrieves the article from the formula to decrease the match in the case where the algorithm
local data storage medium or from an information server S. predicted an article that the user did not read, by making e
and presents the article one screen at a time to the user's client 55 negative in that case. However, there is no guarantee that u
processor C. The user can at any time select another article will move in the correct direction in that case. One can also
for reading or exit the process. shift the attribute weights w, of user I by using a similar
Monitor Which Articles Are Read algorithm: W =(w,-elu-d)/S (W,-elu, -dil)
The user's search profile set generator 202 at step 1107 This is particularly important if one is combining word
monitors which articles the user reads, keeping track of how 60 frequencies with other attributes. As before, this increases the
many pages of text are viewed by the user, how much time is matchife is positive for the case where the algorithm failed
spent viewing the article, and whether all pages of the article to predict an article that the user read, this time by decreasing
were viewed. This information can be combined to measure the weights on those characteristics for which the user's target
the depth of the user's interest in the article, yielding a passive profile u, differs from the article's profiled. Again, the size of
relevance feedback score, as described earlier. Although the 65 e determines how many example articles one must see to
exact details depend on the length and nature of the articles replace what was originally believed. Unlike the procedure
being searched, a typical formula might be: for adjusting u, one also make use of the fact that the above
US 8,171,032 B2
59 60
algorithm decreases the match if e is negative—for the case Source of advertisements or other messages, which take the
where the algorithm predicted an article that the user did not place of the news articles in the news clipping service. A
read. The denominator of the expression prevents weights consumer who buys a product is deemed to have provided
from shrinking to Zero over time by renormalizing the modi positive relevance feedback on advertisements for that prod
fied weights w, so that they sum to one. Both u and w can be uct, and a consumer who buys a product apparently because
adjusted for each article accessed. When e is small, as it of a particular advertisement (for example, by using a coupon
should be, there is no conflict between the two parts of the clipped from that advertisement) is deemed to have provided
algorithm. The selected user's search profile set is updated at particularly high relevance feedback on that advertisement.
step 1108. Such feedback may be communicated to a proxy server by the
Further Applications of the Filtering Technology 10 consumer's client processor (if the consumer is making the
The news clipping service may deliver news articles (or purchase electronically), by the retail vendor, or by the credit
advertisements and coupons for purchasables) to off-line card reader (at the vendor's establishment) that the consumer
users as well as to users who are on-line. Although the off-line uses to pay for the purchase. Given a database of Such rel
users may have no way of providing relevance feedback, the evance feedback, the disclosed technology is then used to
user profile of an off-line userU may be similar to the profiles 15 match advertisements with those users who are most inter
of on-line users, for example because user U is demographi ested in them; advertisements selected fora user are presented
cally similar to these other users, and the level of user Us to that user by any one of several means, including electronic
interest in particular target objects can therefore be estimated mail, automatic display on the users screen, or printing them
via the general interest-estimation methods described earlier. on a printer at a retail establishment where the consumer is
In one application, the news clipping service chooses a set of paying for a purchase. The threshold distance used to identify
news articles (respectively, advertisements and coupons) that interest may be increased fora particular advertisement, caus
are predicted to be of interest to user U, thereby determining ing the system to present that advertisement to more users, in
the content of a customized newspaper (respectively, adver accordance with the amount that the advertiser is willing to
tising/coupon circular) that may be printed and physically ???.
sent to user U via other methods. In general, the target objects 25 A further use of the capabilities of this system is to manage
included in the printed document delivered to user U are those a users investment portfolio. Instead of recommending
with the highest median predicted interest among a group G articles to the user, the system recommends target objects that
of users, where group G consists of either the single off-line are investments. As illustrated above by the example of stock
user U, a set of off-line users who are demographically similar market investments, many different attributes can be used
to user U, or a set of off-line users who are in the same 30 together to profile each investment. The user's past invest
geographic area and thus on the same newspaper delivery ment behavior is characterized in the user's search profile set
route. In a variation, user group G is clustered into several or target profile interest summary, and this information is
Subgroups G... Gk: an average user profile Pi is created from used to match the user with stock opportunities (target
each subgroup Gi; for each article T and each user profile Pi, objects) similar in nature to past investments. The rapid pro
the interest in Tby a hypothetical user with user profile Pi is 35 filing method described above may be used to determine a
predicted, and the interest of article T to group G is taken to be rough set of preferences for new users. Quality attributes used
the maximum interest in article Tby any of these k hypotheti in this system can include negatively weighted attributes,
cal users; finally, the customized newspaper for user group G Such as a measurement of fluctuations in dividends histori
is constructed from those articles of greatest interest to group cally paid by the investment, a quality attribute that would
G. 40 have a strongly negative weight for a conservative investor
The filtering technology of the news clipping service is not dependent on a regular flow of investment income. Further
limited to news articles provided by a single source, but may more, the user can set filter parameters so that the system can
be extended to articles or target objects collected from any monitor Stock prices and automatically take certain actions,
number of sources. For example, rather than identifying new Such as placing buy or sell orders, or e-mailing or paging the
news articles of interest, the technology may identify new or 45 user with a notification, when certain stock performance char
updated World WideWeb pages of interest. In a second appli acteristics are met. Thus, the system can immediately notify
cation, termed “broadcast clipping, where individual users the user when a selected Stock reaches a predetermined price,
desire to broadcast messages to all interested users, the pool without the user having to monitor the stock market activity.
of news articles is replaced by a pool of messages to be The users investments can be profiled in part by a “type of
broadcast, and these messages are sent to the broadcast-clip 50 investment' attribute (to be used in conjunction with other
ping-service Subscribers most interested in them. In a third attributes), which distinguishes among bonds, mutual funds,
application, the system scans the transcripts of all real-time growth stocks, income stocks, etc., to thereby segment the
spoken or written discussions on the network that are cur user's portfolio according to investment type. Each invest
rently in progress and designated as public, and employs the ment type can then be managed to identify investment oppor
news-clipping technology to rapidly identify discussions that 55 tunities and the user can identify the desired ratio of invest
the user may be interested in joining, or to rapidly identify and ment capital for each type, e.g., in accordance with the
notify users who may be interested in joining an ongoing system's automatic recommendation for relative distribution
discussion. In a fourth application, the method is used as a of investment capital as indicated by the relative level of user
post-process that filters and ranks in order of interest the many interest for each type.
target objects found by a conventional database search, Such 60 In one application, the system may also keep track of and
as a search for all homes selling for under $200,000 in a given recommend, notify (or page for new releases and new
area, for all 1994 news articles about Marcia Clark, or for all articles) of important articles which are most interesting to
Italian-language films. In a fifth application, the method is other users who have a similar stock portfolio to that of the
used to filter and rank the links in a hypertext document by user. Relevance feedback in this application determines the
estimating the users interest in the document or other object 65 relevance of the associative attributes (each stock) with the
associated with each link. In a sixth application, paying relevant textual attribute contained in the free text of the
advertisers, who may be companies or individuals, are the article's other descriptors plus any relevant numeric attributes
US 8,171,032 B2
61 62
contained in the articles. Additionally, one could bias the receipt of a high priority message, automatically responding
weighting values of users providing relevance feedback to to a message. The e-mail filter system must not require too
favor those who have invested in similar types of stocks and great an investment on the part of the user to learn and use, and
who have a proven track record of Success through their the user must have confidence in the appropriateness of the
trading decisions. Another application for which this pre actions automatically taken by the system. The same filter
adjusted relevance feedback is useful is in recommending may be applied to Voice mail messages or facsimile messages
and/or automatically trading the most interesting Stocks to that have been converted into electronically stored text,
users using the collaborative filtering methods above whether automatically or at the user's request, via the use of
described. However, biasing the relevance feedback to the well-known techniques for speech recognition or optical
system by those users who had been most successful in their 10 character recognition.
trading decisions in the past with regards to similar types of The filtering problem can be defined as follows: a message
stocks. Accordingly, in accordance with the similarity tech processing function MPF(*) maps from a received message
niques of articles and stocks which are most relevant to one (document) to one or more of a set of actions. The actions,
another. which may be quite specific, may be either predefined or
Because there are numerous methods which are used to 15 customized by the user. Each action A has an appropriateness
attempt to predict for users both stocks and optimal times to function F (*, *) such that F (UD) returns a real number,
buy or trade, the current user customization techniques are representing the appropriateness of selecting action A on
best implemented as an enhancement feature to not only behalf of user U when user U is in receipt of message D. For
provide the user with quality but also customization. example, if D comes from a credible source and is marked
In the preferred implementation for an on-line newspaper urgent, then discarding the message has a high cost to the user
or news filter, each of these capabilities for customized rec and has low appropriateness, so that F (U.D) is small,
ommendation notification of invested related articles, stock whereas alerting the user of receipt of the message is highly
recommendations and automated monitoring and trading fea appropriate, so that F (U.D) is large. Given the determined
tures are provided to the user as an integrated financial news appropriateness function, the function MPF(D) is used to
and investment service. Additionally, in accordance with the 25 automatically select the appropriate action or actions. As an
virtual communities section below described, users sharing example, the following set of actions might be useful:
common portfolios may wish to correspond on-line to share 1. Urgently notify user of receipt of message
advice or experiences with other similar users. Again, users 2. Insert message into queue for user to read later
would have a past track record of Success may also be iden 3. Insert message into queue for user to read later, and
tifiable through these virtual communities in conjunction 30 Suggest that user reply
with their participation in these communities or their com 4. Insert message into queue for user to read later, and
ments and advice relating to specific stocks may be ascribed suggest that user forward it to individual R
to those stocks (and made publicly available). 5. Summarize message and insert Summary into queue
Other On-Line Newspaper Interface Features: 6. Forward message to user's secretary
In accordance with current on-line news interface features, 35 7. File message in directory X
several implementation features of the present system include 8. File message in directory Y
the following: 9. Delete message (i.e., ignore message and do not save)
1. Automatically create a “customized newspaper”. 10. Notify sender that further messages on this subject are
User profiling enabling custom recommendations may be unwanted
achieved by purely passive means of user activity data or if 40 Notice that actions 8 and 9 in the sample list above are
desired, it can refine and automate the selection process of designed to filter out messages that are undesirable to the user
articles within user selected categories of interest as well as or that are received from undesirable sources, such as pesky
recommend articles within different categories which the salespersons, by deleting the unwanted message and/or send
user is likely to prefer as evidenced through past behaviors. ing a reply that indicates that messages of this type will not be
Applications include: 45 read. The appropriateness functions must be tailored to
(a) Presentation of new articles and corresponding adver describe the appropriateness of carrying out each action given
tisements which are of highest interest to the user. the target profile for a particular document, and then a mes
(b) Recommending (highlighting) these articles from the sage processing function MPF can be found which is in some
directory. sense optimal with respect to the appropriateness function.
2. A customized search engine which offers search results 50 One reasonable choice of MPF always picks the action with
which are tailored and relevancy ranked to user preferences. highest appropriateness, and in cases where multiple actions
3. Using a Survey for off-line users for Subsequent issues, are highly appropriate and are also compatible with each
an inserted card inserted into each issue identifies or priori other, selects more than one action: for example, it may auto
tizes the most interesting articles/ads. matically reply to a message and also file the same message in
E-Mail Filter 55 directory X, so that the value of MPF(D) is the set \{reply, file
In addition to the news clipping service described above, in directory X\}. In cases where the appropriateness of even
the system for customized electronic identification of desir the most appropriate action falls below a user-specified
able objects functions in an e\ mail environment in a similar threshold, as should happen for messages of an unfamiliar
but slightly different manner. The news clipping service type, the system asks the user for confirmation of the action(s)
selects and retrieves news information that would not other 60 selected by MPF. In addition, in cases where MPF selects one
wise reach its Subscribers. But at the same time, large num action over another action that is nearly as appropriate, the
bers of e-mail messages do reach users, having been gener system also asks the user for confirmation: for example, mail
ated and sent by humans or automatic programs. These users should not be deleted if it is nearly as appropriate to let the
need an e-mail filter, which automatically processes the mes user see it.
sages received. The necessary processing includes a determi 65 It is possible to write appropriateness functions manually,
nation of the action to be taken with each message, including, but the time necessary and lack of user expertise render this
but not limited to: filing the message, notifying the user of Solution impractical. The automatic training of this system is
US 8,171,032 B2
63 64
preferable, using the automatic user profiling system well-specified goal. The tree's division of target objects into
described above. Each received document is viewed as a coherent clusters provides an efficient method whereby the
target object whose profile includes such attributes as the user can locate a target object of interest. The user first
entire text of the document (represented as TF/IDF scores), chooses one of the highest level (largest) clusters from a
document sender, date sent, document length, date of last menu, and is presented with a menu listing the Subclusters of
document received from this sender, key words, list of other said cluster, whereupon the user may select one of these
addressees, etc. It was disclosed above how to estimate an Subclusters. The system locates the Subcluster, via the appro
interest function on profiled target objects, using relevance priate pointer that was stored with the larger cluster, and
feedback together with measured similarities among target allows the user to select one of its subclusters from another
objects and among users. In the context of the e-mail filter, 10 menu. This process is repeated until the user comes to a leaf
the task is to estimate several appropriateness functions F. of the tree, which yields the details of an actual target object.
(*, *), one per action. This is handled with exactly the same Hierarchical trees allow rapid selection of one target object
method as was used earlier to estimate the topical interest from a large set. In ten menu selections from menus often
function f(*, *). Relevance feedback in this case is provided items (subclusters) each, one can reach 109-10,000,000,000
by the user's observed actions over time: whenever user U 15 (ten billion) items. In the preferred embodiment, the user
chooses action A on document D, either freely or by choosing views the menus on a computer screen or terminal screen and
or confirming an action recommended by the system, this is selects from them with a keyboard or mouse. However, the
taken to mean that the appropriateness of action A on docu user may also make selections over the telephone, with a
ment D is high, particularly if the user takes this action A Voice synthesizer reading the menus and the user selecting
immediately after seeing document D. A presumption of no Subclusters via the telephone's touch-tone keypad. In another
appropriateness (corresponding to the earlier presumption of variation, the user simultaneously maintains two connections
no interest) is used so that action A is considered inappropri to the server, a telephone Voice connection and a fax connec
ate on a document unless the user or similar users have taken tion; the server sends successive menus to the user by fax,
actionA on this document or similar documents. In particular, while the user selects choices via the telephone's touch-tone
if no similar document has been seen, no action is considered 25 keypad.
especially appropriate, and the e-mail filter asks the user to Just as user profiles commonly include an associative
specify the appropriate action or confirm that the action cho attribute indicating the user's degree of interest in each target
sen by the e-mail filter is the appropriate one. object, it is useful to augment user profiles with an additional
Thus, the e-mail filter learns to take particular actions on associative attribute indicating the user's degree of interest in
e-mail messages that have certain attributes or combinations 30 each cluster in the hierarchical cluster tree. This degree of
of attributes. For example, messages from John Doe that interest may be estimated numerically as the number of sub
originate in the (212) area code may prompt the system to clusters or target objects the user has selected from menus
forward a copy by fax transmission to a given fax number, or associated with the given cluster or its Subclusters, expressed
to file the message in directory X on the user's client proces as a proportion of the total number of Subclusters or target
sor. A variation allows active requests of this form from the 35 objects the user has selected. This associative attribute is
user, Such as a request that any message from John Doe be particularly valuable if the hierarchical tree was built using
forwarded to a desired fax number until further notice. This “soft' or “fuzzy' clustering, which allows a subcluster or
active user input requires the use of a natural language or target object to appear in multiple clusters: if a target docu
form-based interface for which specific commands are asso ment appears in both the “sports” and the “humor clusters,
ciated with particular attributes and combinations of 40 and the user selects it from a menu associated with the
attributes. "humor cluster, then the system increases its association
Update Notification between the user and the "humor cluster but not its associa
A very important and novel characteristic of the architec tion between the user and the “sports' cluster.
ture is the ability to identify new or updated target objects that Labeling Clusters
are relevant to the user, as determined by the user's search 45 Since a user who is navigating the cluster tree is repeatedly
profile set or target profile interest Summary. (“Updated target expected to select one of several Subclusters from a menu,
objects include revised versions of documents and new mod these subclusters must be usefully labeled (at step 503), in
els of purchasable goods.) The system may notify the user of Such a way as to suggest their content to the human user. It is
these relevant target objects by an electronic notification Such straightforward to include Some basic information about each
as an e-mail message or facsimile transmission. In the Varia 50 subcluster in its label, such as the number of target objects the
tion where the system sends an e-mail message, the user's subcluster contains (possibly just 1) and the number of these
e-mail filter can then respond appropriately to the notifica that have been added or updated recently. However, it is also
tion, for instance, by bringing the notification immediately to necessary to display additional information that indicates the
the user's personal attention, or by automatically Submitting cluster's content. This content-descriptive information may
an electronic request to purchase the target object named in 55 be provided by a human, particularly for large or frequently
the notification. A simple example of the latter response is for accessed clusters, but it may also be generated automatically.
the e-mail filter to retrieve an on-line document at a nominal The basic automatic technique is simply to display the clus
or Zero charge, or request to buy a purchasable of limited ter’s “characteristic value” for each of a few highly weighted
quantity Such as a used product or an auctionable. attributes. With numeric attributes, this may be taken to mean
60 the cluster's average value for that attribute: thus, if the “year
Active Navigation (Browsing) of release' attribute is highly weighted in predicting which
movies a user will like, then it is useful to display average year
Browsing by Navigating Through a Cluster Tree of release as part of each cluster's label. Thus the user sees
A hierarchical cluster tree imposes a useful organization on that one cluster consists of movies that were released around
a collection of target objects. The tree is of direct use to a user 65 1962, while another consists of movies from around 1982.
who wishes to browse through all the target objects in the tree. For short textual attributes, such as "title of movie' or "title of
Such a user may be exploring the collection with or without a document, the system candisplay the attribute’s value for the
US 8,171,032 B2
65 66
cluster member (target object) whose profile is most similar to options in a 6-dimensional space; in this case the user may
the cluster's profile (the mean profile for all members of the view the geometric projection of the 6-dimensional layout
cluster), for example, the title of the most typical movie in the onto any plane passing through the origin, and may rotate this
cluster. For longer textual attributes, a useful technique is to viewing plane in order to see differing configurations of the
select those terms for which the amount by which the terms options, which emphasize similarity with respect to differing
average TF/IDF score across members of the cluster exceeds attributes in the profiles of the associated clusters. In the
the terms average TF/IDF score across all target objects is visual representation, the sizes of the cluster labels can be
greatest, either in absolute terms or else as a fraction of the varied according to the number of objects contained in the
standard deviation of the terms TF/IDF score across all target corresponding clusters. In a further variation, all options from
objects. The selected terms are replaced with their morpho 10 the parent menu are displayed in some number of dimensions,
logical stems, eliminating duplicates (so that if both "slept as just described, but with the option corresponding to the
and “sleeping were selected, they would be replaced by the current menu replaced by a more prominent Subdisplay of the
single term 'sleep') and optionally eliminating close Syn options on the current menu, optionally, the scale of this
onyms or collocates (so that if both “nurse' and “medical composite display may be gradually increased over time,
were selected, they might both be replaced by a single term 15 thereby increasing the area of the screen devoted to showing
such as “nurse.” “medical.” “medicine, or “hospital'). The the options on the current menu, and giving the visual impres
resulting set of terms is displayed as part of the label. Finally, sion that the user is regarding the parent cluster and "Zooming
if freely redistributable thumbnail photographs or other in on the current cluster and its subclusters.
graphical images are associated with some of the target Further Navigational
objects in the cluster for labeling purposes, then the system It should be appreciated that a hierarchical cluster-tree may
can display as part of the label the image or images whose be configured with multiple cluster selections branching from
associated target objects have target profiles most similar to each node or the same labeled clusters presented in the form
the cluster profile. of single branches for multiple nodes ordered in a hierarchy.
Users’ navigational patterns may provide some useful In one variation, the user is able to perform lateral navigation
feedback as to the quality of the labels. In particular, if users 25 between neighboring clusters as well, by requesting that the
often select a particular cluster to explore, but then quickly system search for a cluster whose cluster profile resembles
backtrack and try a different cluster, this may signal that the the cluster profile of the currently selected cluster. If this type
first cluster's label is misleading. Insofar as other terms and of navigation is performed at the level of individual objects
attributes can provideo “next-best” alternative labels for the (leafends), then automatic hyperlinks may be then created as
first cluster, such “next-best” labels can be automatically 30 navigation occurs. This is one way that nearest neighbor
Substituted for the misleading label. In addition, any user can clustering navigation may be performed. For example, in a
locally relabel a cluster for his or her own convenience. domain where target objects are home pages on the World
Although a cluster label provided by a user is in general WideWeb, a collection of such pages could be laterally linked
visible only to that user, it is possible to make global use of to create a “virtual mall.”
these labels via a “user labels' textual attribute for target 35 The simplest way to use the automatic menuing system
objects, which attribute is defined for a given target object to described above is for the user to begin browsing at the top of
be the concatenation of all labels provided by any user for any the tree and moving to more specific Subclusters. However, in
cluster containing that target object. This attribute influences a variation, the user optionally provides a query consisting of
similarity judgments: for example, it may induce the system textual and/or other attributes, from which query the system
to regard target articles in a cluster often labeled “Sports 40 constructs a profile in the manner described herein, optionally
News’ by users as being mildly similar to articles in an altering textual attributes as described herein before decom
otherwise dissimilar cluster often labeled “International posing them into numeric attributes. Query profiles are simi
News’ by users, precisely because the “user labels' attribute lar to the search profiles in a user's search profile set, except
in each cluster profile is strongly associated with the term that their attributes are explicitly specified by a user, most
“News.” The “user label' attribute is also used in the auto 45 often for one-time usage, and unlike search profiles, they are
matic generation of labels, just as other textual attributes are, not automatically updated to reflect changing interests. A
so that if the user-generated labels for a cluster often include typical query in the domain of text articles might have “Tell
“Sports, the term “Sports' may be included in the automati me about the relation between Galileo and the Medici family
cally generated label as well. as the value of its "text of article' attribute, and 8 as the value
It is not necessary for menus to be displayed as simple lists 50 of its “reading difficulty' attribute (that is, 8th-grade level).
of labeled options; it is possible to display or print a menu in The system uses the method of section “Searching for Target
a form that shows in more detail the relation of the different Objects’ above to automatically locate a small set of one or
menu options to each other. Thus, in a variation, the menu more clusters with profiles similar to the query profile, for
options are visually laid out in two dimensions or in a per example, the articles they contain are written at roughly an
spective drawing of three dimensions. Each option is dis 55 8th-grade level and tend to mention Galileo and the Medicis.
played or printed as a textual or graphical label. The physical The user may start browsing at any of these clusters, and can
coordinates at which the options are displayed or printed are move from it to Subclusters, Superclusters, and other nearby
generated by the following sequence of steps: (1) construct clusters. For a user who is looking for something in particular,
for each option the cluster profile of the cluster it represents, it is generally less efficient to start at the largest cluster and
(2) construct from each cluster profile its decomposition into 60 repeatedly select smaller subclusters than it is to write a brief
a numeric vector, as described above, (3) apply singular value description of what one is looking for and then to move to
decomposition (SVD) to determine the set of two or three nearby clusters if the objects initially recommended are not
orthogonal linear axes along which these numeric vectors are precisely those desired.
most greatly differentiated, and (4) take the coordinates of Although it is customary in information retrieval systems
each option to be the projected coordinates of that option’s 65 to match a query to a document, an interesting variation is
numeric vector along said axes. Step (3) may be varied to possible where a query is matched to an already answered
determine a set of say, 6 axes, so that step (4) lays out the question. The relevant domain is a customer Service center,
US 8,171,032 B2
67 68
electronic newsgroup, or Better Business Bureau where ques automatically reorganize the menu in a user-specific way, the
tions are frequently answered. Each new question-answer system first attempts automatically to identify existing clus
pair is recorded for future reference as a target object, with a ters that are of interest to the user. The system may identify a
textual attribute that specifies the question together with the cluster as interesting because the user often accesses target
answer provided. As explained earlier with reference to docu objects in that cluster—or, in a more Sophisticated variation,
ment titles, the question should be weighted more heavily because the user is predicted to have high interest in the
than the answer when this textual attribute is decomposed into cluster's profile, using the methods disclosed herein for esti
TF/IDF scores. A query specifying “Tell me about the relation mating interest from relevance feedback.
between Galileo and the Medici family' as the value of this Several techniques can then be used to make interesting
attribute therefore locates a cluster of similar questions 10 clusters more easily accessible. The system can at the user's
together with their answers. In a variation, each question request or at all times display a special list of the most inter
answer pair may be profiled with two separate textual esting clusters, or the most interesting Subclusters of the
attributes, one for the question and one for the answer. A current cluster, so that the user can select one of these clusters
query might then locate a cluster by specifying only the
question attribute, or for completeness, both the question based on its label and jump directly to it. In general, when the
attribute and the (lower-weighted) answer attribute, to be the 15 system constructs a list of interesting clusters in this way, the
text "Tell me about the relation between Galileo and the I'most prominent choice on the list, which choice is denoted
Medici family” Top (I), is found by considering all appropriate clusters C that
The filtering technology described earlier can also aid the are further than a threshold distance t from all of Top(1),
user in navigating among the target objects. When the system Top(2), ... Top(I-1), and selecting the one in which the user's
presents the user with a menu of subclusters of a cluster C of interest is estimated to be highest. Here the threshold distance
target objects, it can simultaneously present an additional t is optionally dependent on the computed cluster variance or
menu of the most interesting target objects incluster C, so that cluster diameter of the profiles in the latter cluster. Several
the user has the choice of accessing a subcluster or directly techniques that reorganize the hierarchical menu tree are also
accessing one of the target objects. If this additional menu useful. First, menus can be reorganized so that the most
lists in target objects, then for each I between 1 and n inclusive, 25 interesting Subcluster choices appear earliest on the menu, or
in increasing order, the I" most prominent choice on this are visually marked as interesting; for example, their labels
additional menu, which choice is denoted Top (C.i), is found are displayed in a special color or type face, or are displayed
by considering all target objects in cluster C that are further together with a number or graphical image indicating the
than a threshold distance t from all of Top (C.1), likely level of interest. Second, interesting clusters can be
Top (C.2). . . . Top (C, I-1), and selecting the one in which the 30 moved to menus higher in the tree, i.e., closer to the root of the
users interest is estimated to be highest. If the threshold tree, so that they are easier to access if the user starts browsing
distance t is 0, then the menu resulting from this procedure at the root of the tree. Third, uninteresting clusters can be
simply displays then most interesting objects incluster C, but moved to menus lower in the tree, to make room for interest
the threshold distance may be increased to achieve more ing clusters that are being moved higher. Fourth, clusters with
variety in the target objects displayed. Generally the thresh 35 an especially low interest score (representing active dislike)
old distance t is chosen to be an affine function or other can simply be suppressed from the menus; thus, a user with
function of the cluster variance or cluster diameter of the children may assign an extremely negative weight to the
cluster C. “vulgarity’ attribute in the determination of q, so that Vulgar
As a novelty feature, the user U can “masquerade' as clusters and documents will not be available at all. As the
another user V. Such as a prominent intellectual or a celebrity 40 interesting clusters and the documents in them migrate
Supermodel; as long as user U is masquerading as user. V, the toward the top of the tree, a customized tree develops that can
filtering technology will recommend articles not according to be more efficiently navigated by the particular user. If menus
user U's preferences, but rather according to user V’s prefer are chosen so that each menu item is chosen with approxi
ences. Provided that user U has access to the user-specific mately equal probability, then the expected number of
data of user V. for example because user V has leased these 45 choices the user has to make is minimized. If, for example, a
data to user U for a financial consideration, then user U can user frequently accessed target objects whose profiles
masquerade as user V by instructing user US proxy server S resembled the cluster profile of cluster (a, b, d) in FIG. 8 then
to temporarily substitute user V’s user profile and target pro the menu in FIG. 9 could be modified to show the structure
file interest summary for user Us. In a variation, user U has illustrated in FIG. 10.
access to an average user profile and an composite target 50 In the variation where the general techniques disclosed
profile interest summary fora group G ofusers; by instructing herein for estimating a user's interest from relevance feed
proxy server S to substitute these for user Us user-specific back are used to identify interesting clusters, it is possible for
data, user U can masquerade as a typical member of group G, a userU to supply “temporary relevance feedback” to indicate
as is useful in exploring group preferences for Sociological, a temporary interest that is added to his or her usual interests.
political, or market research. More generally, user U may 55 This is done by entering a query as described above, i.e., a set
"partially masquerade' as another user V or group G, by of textual and other attributes that closely match the user's
instructing proxy server S to temporarily replace user US interests of the moment. This query becomes “active.” and
user-specific data with a weighted average of user US user affects the system's determination of interest in either of two
specific data and the user-specific data for user Vandgroup G. ways. In one approach, an active query is treated as if it were
Menu Organization 60 any other target object, and by virtue of being a query, it is
Although the topology of a hierarchical cluster tree is fixed taken to have received relevance feedback that indicates espe
by the techniques that build the tree, the hierarchical menu cially high interest. In an alternative approach, target objects
presented to the user for the user's navigation need not be X whose target profiles are similar to an active query's profile
exactly isomorphic to the cluster tree. The menu is typically a are simply considered to have higher quality q(U. X), in that
somewhat modified version of the cluster tree, reorganized 65 q(U. X) is incremented by a term that increases with target
manually or automatically so that the clusters most interest object Xs similarity to the query profile. Either strategy
ing to a user are easily accessible by the user. In order to affects the usual interest estimates: clusters that match user
US 8,171,032 B2
69 70
US usual interests (and have high quality q()) are still con Electronic Mall
sidered to be of interest, and clusters whose profiles are In one application, the browsing techniques described
similar to an active query are adjudged to have especially high above may be applied to a domain where the target objects are
interest. Clusters that are similar to both the query and the purchasable goods. When shoppers look for goods to pur
users usual interests are most interesting of all. The user may chase over the Internet or other electronic media, it is typi
modify or deactivate an active query at any time while brows cally necessary to display thousands or tens of thousands of
ing. In addition, if the user discovers a target object or cluster products in a fashion that helps consumers find the items they
X of particular interest while browsing, he or she may replace are looking for. The current practice is to use hand-crafted
or augment the original (perhaps vague) query profile with the menus and Sub-menus in which similar items are grouped
target profile of target object or cluster X, thereby amplifying 10
together. It is possible to use the automated clustering and
or refining the original query to indicate an particular interest browsing methods described above to more effectively group
in objects similar to X. For example, Suppose the user is and present the items. Purchasable items can be hierarchically
browsing through documents, and specifies an initial query
containing the word “Lloyds, so that the system predicts clustered using a plurality of different criteria. Useful
documents containing the word “Lloyd's to be more inter 15 attributes for a purchasable item include but are not limited to
esting and makes them more easily accessible, even to the a textual description and predefined category labels (if avail
point of listing Such documents or clusters of Such docu able), the unit price of the item, and an associative attribute
ments, as described above. In particular, certain articles about listing the users who have bought this item in the past. Also
insurance containing the phrase “Lloyd's of London” are useful is an associative attribute indicating which other items
made more easily accessible, as are certain pieces of Welsh are often bought on the same shopping “trip’ as this item;
fiction containing phrases like “Lloyd's father.” The user items that are often bought on the same trip will be judged
browses while this query is active, and hits upon a useful similar with respect to this attribute, so tend to be grouped
article describing the relation of Lloyd's of London to other together. Retailers may be interested in utilizing a similar
British insurance houses; by replacing or augmenting the technique for purposes of predicting both the nature and
query with the full text of this article, the user can turn the 25 relative quantity of items which are likely to be popular to
attention of the system to other documents that resemble this their particular clientele. This prediction may be made by
article, such as documents about British insurance houses, using aggregate purchasing records as the search profile set
rather than Welsh folktales. from which a collection of target objects is recommended.
In a system where queries are used, it is useful to include in Estimated customer demand which is indicative of (relative)
the target profiles an associative attribute that records the 30 inventory quantity for each target object item is determined
associations between a target object and whatever terms are by measuring the cluster variance of that item compared to
employed in queries used to find that target object. The asso another target object item (which is in stock).
ciation score of target object X with a particular query term T As described above, hierarchically clustering the purchas
is defined to be the mean relevance feedback on target object able target objects results in a hierarchical menu system, in
X, averaged over just those accesses of target object X that 35 which the target objects or clusters of target objects that
were made while a query containing term T was active, mul appear on each menu can be labeled by names or icons and
tiplied by the negated logarithm of term T's global frequency displayed in a two-dimensional or three-dimensional menu in
in all queries. The effect of this associative attribute is to which similar items are displayed physically near each other
increase the measured similarity of two documents if they are or on the same graphically represented “shelf.” As described
good responses to queries that contain the same terms. A 40 above, this grouping occurs both at the level of specific items
further maneuver can be used to improve the accuracy of (such as standard size Ivory Soap or large Breck shampoo)
responses to a query: in the Summation used to determine the and at the level of classes of items (such as Soaps and sham
quality q(U. X) of a target objectX, a term is included that is poos). When the user selects a class of items (for instance, by
proportional to the Sum of association scores between target clicking on it), then the more specific level of detail is dis
objectX and each term in the active query, ifany, so that target 45 played. It is neither necessary nor desirable to limit each item
objects that are closely associated with terms in an active to appearing in one group; customers are more likely to find
query are determined to have higher quality and therefore an object if it is in multiple categories. Non-purchasable
higher interest for the user. To complement the systems objects such as artwork, advertisements, and free samples
automatic reorganization of the hierarchical cluster tree, the may also be added to a display of purchasable objects, if they
user can be given the ability to reorganize the tree manually, 50 are associated with (liked by) Substantially the same users as
as he or she sees fit. Any changes are optionally saved on the are the purchasable objects in the display.
user's local storage device so that they will affect the presen Network Context of the Browsing System
tation of the tree in future sessions. For example, the user can The files associated with target objects are typically dis
choose to move or copy menu options to other menus, so that tributed across a large number of different servers S1-So and
useful clusters can thereafter be chosen directly from the root 55 clients C1-Cn. Each file has been entered into the data storage
menu of the tree or from other easily accessed or topically medium at Some server or client in any one of a number of
appropriate menus. In an other example, the user can select ways, including, but not limited to: Scanning, keyboard input,
clusters C, C2, . . . Ck listed on a particular menu M and e-mail, FTP transmission automatic synthesis from another
choose to remove these clusters from the menu, replacing file under the control of another computer program. While a
them on the menu with a single aggregate cluster M contain 60 system to enable users to efficiently locate target objects may
ing all the target objects from clusters C. C. . . . C. In this store its hierarchical cluster tree on a single centralized
case, the immediate subclusters of new cluster Mare either machine, greater efficiency can be achieved if the storage of
taken to be clusters C, C, Ck themselves, or else, in a the hierarchical cluster tree is distributed across many
variation similar to the "scatter-gather method, are automati machines in the network. Each cluster C, including single
cally computed by clustering the set of all the subclusters of 65 member clusters (target objects), is digitally represented by a
clusters C. C. . . . Ck according to the similarity of the file F, which is multicast to a topical multicast tree MT(C1);
cluster profiles of these subclusters. here clusterCl is either cluster C itselforsome supercluster of
US 8,171,032 B2
71 72
cluster C. In this way, file F is stored at multiple servers, for file G that represents the merged tree. Add this pointer to list
redundancy. The file F that represents cluster C contains at M. 15. For each file Fifrom among F1 ... Fn: 16. Iflist M does
least the following data: not include a pointer to file Fi, send a message to the server or
1. The cluster profile for cluster C, or data sufficient to servers storing Fi instructing them to delete file Fi. 17. Create
reconstruct this cluster profile. 2. The number of target and store a file F that represents a new cluster, whose sub
objects contained in cluster C. 3. A human-readable label for cluster pointers are exactly the subcluster pointers on list M.
cluster C, as described in section “Labeling Clusters’ above. 18. Send a reply message to server S0, which reply message
4. If the cluster is divided into subclusters, a list of pointers to contains a pointer to file F and indicates that file F represents
files representing the Subclusters. Each pointer is an ordered the merged cluster tree.
pair containing naming, first, a file, and second, a multicast 10 With the help of the above procedure, and the multicast tree
tree or a specific server where that file is stored. 5. If the MT full that includes all proxy servers in the network, the
cluster consists of a single target object, a pointer to the file distributed hierarchical cluster tree for a particular domain of
corresponding to that target object. target objects is constructed by merging many local hierar
The process by which a client machine can retrieve the file chical cluster trees, as follows. 1. One server S (preferably
F from the multicast tree MT(C1) is described above in sec 15 one with good connectivity) is elected from the tree. 2. Server
tion “Retrieving Files from a Multicast Tree. Once it has S sends itself a global request message that causes each proxy
retrieved file F, the client can perform further tasks pertaining server in MTim (that is, each proxy server in the network) to
to this cluster, Such as displaying a labeled menu of Subclus ask its clients for files for the cluster tree. 3. The clients of
ters, from which the user may select subclusters for the client each proxy server transmit to the proxy server any files that
to retrieve next. they maintain, which files represent target objects from the
The advantage of this distributed implementation is three appropriate domain that should be added to the cluster tree. 4.
fold. First, the system can be scaled to larger cluster sizes and Server S forms a request R1 that, upon receipt, will cause the
numbers of target objects, since much more searching and recipient server S1 to take the following actions: (a) Build a
data retrieval can be carried out concurrently. Second, the hierarchical cluster tree of all the files stored on server S1 that
system is fault-tolerant in that partial matching can be 25 are maintained by users in the user base of S1. These files
achieved even if portions of the system are temporarily correspond to target objects from the appropriate domain.
unavailable. It is important to note here the robustness due to This cluster tree is typically stored entirely on S1, but may in
redundancy inherent in our design-data is replicated at tree principle be stored in a distributed fashion. (b) Wait until all
sites so that even if a server is down, the data can be located servers to which the server S1 has propagated request R have
elsewhere. 30 sent the recipient reply messages containing pointers to clus
The distributed hierarchical cluster tree can be created in a ter trees. (c) Merge together the cluster tree created in step
distributed fashion, that is, with the participation of many 5(a) and the cluster trees supplied in step 5(b), by sending any
processors. Indeed, in most applications it should be recre server (such as S1 itself) a message requesting Such a merge,
ated from time to time, because as users interact with target as described above. (d) Upon receiving a reply to the message
objects, the associative attributes in the target profiles of the 35 sentin (c), which reply includes a pointerto a file representing
target objects change to reflect these interactions; the sys the merged cluster tree, forward this reply to the sender of
tem's similarity measurements can therefore take these inter request R1, unless this is SI itself 5. Server S sends itself a
actions into account when judging similarity, which allows a global request message that causes all servers in MT to act
more perspicuous cluster tree to be built The key technique is on embedded request R1. 6. Server S receives a reply to the
the following procedure for merging n disjoint cluster trees, 40 message it sent in 5(c). This reply includes a pointer to a file
represented respectively by files F1 . . . Fn in distributed F that represents the completed hierarchical cluster tree.
fashion as described above, into a combined cluster tree that Server S multicasts file F to all proxy servers in MT. Once
contains all the target objects from all these trees. The files the hierarchical cluster tree has been created as above, server
F1 ... Fn are described above, except that the cluster labels S can send additional messages through the cluster tree, to
are not included in the representation. The following steps are 45 arrange that multicast trees MT(C) are created for sufficiently
executed by a server S1, in response to a request message large clusters C, and that each file F is multicast to the tree
from another server S0, which request message includes MT(C), where C is the smallest cluster containing file F.
pointers to the files F1 ... Fn. 1. Retrieve files F1 ... Fn. 2. Let
L and M be empty lists. 3. For each file Fi from among Matching Users for Virtual Communities
F1 ... Fn: 4. Iffile Ficontains pointers to subcluster files, add 50
these pointers to list L. 5. If file Firepresents a single target Virtual Communities
object, add a pointer to file Fito list L. 6. For each pointer X Computer users frequently join other users for discussions
on list L. retrieve the file that pointer P points to and extract on computer bulletin boards, newsgroups, mailing lists, and
the cluster profile POX) that this file stores. 7. Apply a clus real-time chat sessions over the computer network, which
tering algorithm to group the pointers X on list Laccording to 55 may be typed (as with Internet Relay Chat (IRC)), spoken (as
the distances between their respective cluster profiles POX). 8. with Internet phone), or videoconferenced. These forums are
For each (nonempty) resulting group C of pointers: 9. If C herein termed “virtual communities.” In current practice,
contains only one pointer, add this pointer to list M: 10. each virtual community has a specified topic, and users dis
otherwise, if C contains exactly the same Subcluster pointers cover communities of interest by word of mouth or by exam
as does one of the files Fi from among F1 ... Fn, then add a 60 ining a long list of communities (typically hundreds or thou
pointer to file Fi to list M: 11. otherwise: 12. Select an arbi sands). The users then must decide for themselves which of
trary server S2 on the network, for example by randomly thousands of messages they find interesting from among
selecting one of the pointers in group C and choosing the those posted to the selected virtual communities, that is, made
serverit points to. 13. Senda request message to server S2 that publicly available to members of those communities. If they
includes the Subcluster pointers in group C and requests 65 desire, they may also write additional messages and post them
server S2 to merge the corresponding Subcluster trees. 14. to the virtual communities of their choice. The existence of
Receive a response from server S2, containing a pointer to a thousands of Internet bulletin boards (also termed news
US 8,171,032 B2
73 74
groups) and countless more Internet mailing lists and private tained by America Online, Prodigy, or CompuServe, or a
bulletin board services (BBS’s) demonstrates the very strong smaller set of bulletin boards that might be local to a single
interest among members of the electronic community in organization, for example a large company, a law firm, or a
forums for the discussion of ideas about almost any subject university. The Scanning activity need not be confined to
imaginable. Presently, virtual community creation proceeds bulletin boards and mailing lists that were created by Virtual
in a haphazard form, usually instigated by a single individual Community Service, but may also be used to scan the activity
who decides that a topic is worthy of discussion. There are of communities that predate Virtual Community Service or
protocols on the Internet for voting to determine whether a are otherwise created by means outside the Virtual Commu
newsgroup should be created, but there is a large hierarchy of nity Service system, provided that these communities are
newsgroups (which begin with the prefix "alt.) that do not 10 public or otherwise grant their permission.
follow this protocol. The target profile of each message includes textual
The system for customized electronic identification of attributes specifying the title and body text of the message. In
desirable objects described herein can of course function as a the case of a spoken rather than written message, the latter
browser for bulletin boards, where target objects are taken to attribute may be computed from the acoustic speech data by
be bulletin boards, or subtopics of bulletin boards, and each 15 using a speech recognition system. The target profile also
target profile is the cluster profile for a cluster of documents includes an associative attribute listing the author(s) and des
posted on some bulletin board. Thus, a user can locate bulletin ignated recipient(s) of the message, where the recipients may
boards of interest by all the navigational techniques described be individuals and/or entire virtual communities; if this
above, including browsing and querying. However, this attribute is highly weighted, then the system tends to regard
method only serves to locate existing virtual communities. messages among the same set of people as being similar or
Because people have varied and varying complex interests, it related, even if the topical similarity of the messages is not
is desirable to automatically locate groups of people with clear from their content, as may happen when some of the
common interests in order to form virtual communities. The messages are very short. Other important attributes include
Virtual Community Service (VCS) described below is a net the fraction of the message that consists of quoted material
work-based agent that seeks out users of a network with 25 from previous messages, as well as attributes that are gener
common interests, dynamically creates bulletin boards or ally useful in characterizing documents, such as the mes
electronic mailing lists for those users, and introduces them to Sage's date, length, and reading level.
each other electronically via e-mail. It is useful to note that Virtual Community Identification
once virtual communities have been created by VCS, the Next, Virtual Community Service attempts to identify
other browsing and filtering technologies described above 30 groups of pseudonymous users with common interests. These
can Subsequently be used to help a user locate particular groups, herein termed “pre-communities.” are represented as
virtual communities (whether pre-existing or automatically sets of pseudonyms. Whenever Virtual Community Service
generated by VCS); similarly, since the messages sent to a identifies a pre-community, it will Subsequently attempt to
given virtual community may vary in interest and urgency for put the users in said pre-community in contact with each
a user who has joined that community, these browsing and 35 other, as described below. Each pre-community is said to be
filtering technologies (such as the e-mail filter) can also be “determined by a cluster of messages, pseudonymous users,
used to alert the user to urgent messages and to Screen out search profiles, or target objects.
uninteresting ones. In the usual method for determining pre-communities, Vir
The functions of the Virtual Community Service are gen tual Community Service clusters the messages that were
eral functions that could be implemented on any network 40 scanned and profiled in the above step, based on the similarity
ranging from an office network in a small company to the of those messages computed target profiles, thus automati
World Wide Web or the Internet. The four main steps in the cally finding threads of discussion that show common inter
procedure are: 1. Scan postings to existing virtual communi ests among the users. Naturally, discussions in a single virtual
ties. 2. Identify groups of users with common interests. 3. community tend to show common interests; however, this
Match users with virtual communities, creating new virtual 45 method uses all the texts from every available virtual com
communities when necessary. 4. Continue to enroll additional munity, including bulletin boards and electronic mailing lists.
users in the existing virtual communities. Indeed, a user who wishes to initiate or join a discussion on
More generally, users may post messages to virtual com Some topic may send a “feeler message' on that topic to a
munities pseudonymously, even employing different pseud special mailing list designated for feeler mess ages; as a
onyms for different virtual communities. (Posts not employ 50 consequence of the scanning procedure described above, the
ing a pseudonymous mix path may, as usual, be considered to feeler message is automatically grouped with any similarly
be posts employing a non-secure pseudonym, namely the profiled messages that have been sent to this special mailing
user's true network address.) Therefore, the above steps may list, to topical mailing lists, or to topical bulletin boards. The
be expressed more generally as follows: 1. Scan pseudony clustering step employs “soft clustering in which a message
mous postings to existing virtual communities. 2. Identify 55 may belong to multiple clusters and hence to multiple virtual
groups of pseudonyms whose associated users have common communities. Each cluster of messages that is found by Vir
interests. 3. Match pseudonymous users with virtual commu tual Community Service and that is of sufficient size (for
nities, creating new virtual communities when necessary. 4. example, 10-20 different messages) determines a pre-com
Continue to enroll additional pseudonymous users in the munity whose members are the pseudonymous authors and
existing virtual communities. Each of these steps can be car 60 recipients of the messages in the cluster. More precisely, the
ried out as described below. pre-community consists of the various pseudonyms under
Scanning which the messages in the cluster were sent and received.
Using the technology described above, Virtual Community Alternative methods for determining a pre-community,
Service constantly scans all the messages posted to all the which do not require the Scanning step above, include the
newsgroups and electronic mailing lists on a given network, 65 following: 1. Pre-communities can be generated by grouping
and constructs a target profile for each message found. The together users who have similar interests of any sort, not
network can be the Internet, or a set of bulletin boards main merely Individuals who have already written or received mes
US 8,171,032 B2
75 76
sages about similar topics. If the user profile associated with message informs user U of the existence of virtual community
each pseudonym indicates the users interests, for example V, and provides instructions which userU may follow in order
through an associative attribute that indicates the documents to join virtual community V if desired; these instructions vary
or Web sites a user likes, then pseudonyms can be clustered depending on whether virtual community V is an existing
based on the similarity of their associated user profiles, and community or a new community. The message includes a
each of the resulting clusters of pseudonyms determines a credential, granted to pseudonym P, which credential must be
pre-community comprising the pseudonyms in the cluster. 2. presented by user U upon joining the virtual community V, as
If each pseudonym has an associated search profile set formed proof that user U was actually invited to join. If user Uwishes
through participation in the news clipping service described to join virtual community V under a different pseudonym Q.
above, then all search profiles of all pseudonymous users can 10 userU may first transfer the credential from pseudonym P to
be clustered based on their similarity, and each cluster of pseudonym Q, as described above. The e-mail message fur
search profiles determines a pre-community whose members ther provides an indication of the common interests of the
are the pseudonyms from whose search profile sets the search community, for example by including a list of titles of mes
profiles in the cluster are drawn. Such groups of people have sages recently sent to the community, or a charter or intro
been reading about the same topic (or, more generally, access 15 ductory message provided by the community (if available), or
ing similar target objects) and so presumably share an inter a label generated by the methods described above that iden
est. 3. If users participate in a news clipping service or any tifies the content of the cluster of messages, user profiles,
other filtering or browsing system for target objects, then an search profiles, or target objects that was used to identify the
individual user can pseudonymously request the formation of pre-community M.
a virtual community to discuss a particular cluster of one or If Virtual Community Service must create a new commu
more target objects known to that system. This cluster of nity V, several methods are available for enabling the mem
target objects determines a pre-community consisting of the bers of the new community to communicate with each other.
pseudonyms of users determined to be most interested in that If the pre-community Mis large, for example containing more
cluster (for example, users who have search profiles similar to than 50 users, then Virtual Community Service typically
the cluster pro file), together with the pseudonym of the user 25 establishes either a multicast tree, as described below, or a
who requested formation of the virtual community. widely-distributed bulletin board, assigning a name to the
Matching Users with Communities new bulletin board. If the pre-community M has fewer mem
Once Virtual Community Service identifies a cluster C of bers, for example 2-50, Virtual Community Service typically
messages, users, search profiles, or target objects that deter establishes either a multicast tree, as described below, or an
mines a pre-community M, it attempts to arrange for the 30 e-mail mailing list. If the new virtual community V was
members of this pre-community to have the chance to par determined by a cluster of messages, then Virtual Community
ticipate in a common virtual community V. In many cases, an Service kicks off the discussion by distributing these mes
existing virtual community V may suit the needs of the pre sages to all members of virtual community V. In addition to
community M. Virtual Community Service first attempts to bulletin boards and mailing lists, alternative form that can be
find Such an existing community V. In the case where cluster 35 created and in which virtual communities can gather include
C is a cluster of messages, V may be chosen to be any existing real-time typed or spoken conversations (or engagement or
virtual community such that the cluster profile of cluster C is distributed multi-user applications including video games)
within a threshold distance of the mean profile of the set of over the computer network and physical meetings, any of
messages recently posted to virtual community V: in the case which can be scheduled by a partly automated process
where cluster C is a cluster of users, V may be chosen to be 40 wherein Virtual Community Service requests meeting time
any existing virtual community Such that the cluster profile of preferences from all members of the pre-community M and
cluster C is within a threshold distance of the mean user then notifies theseindividuals ofan appropriate meeting time.
profile of the active members of virtual community V: in the Continued Enrollment
case where the cluster C is a cluster of search profiles, V may Even after creation of a new virtual community, Virtual
be chosen to be any existing virtual community Such that the 45 Community Service continues to scan other virtual commu
cluster profile of cluster C is withina threshold distance of the nities for new messages whose target profiles are similar to
cluster profile of the largest cluster resulting from clustering the community's cluster profile (average message profile).
all the search profiles of active members of virtual community Copies of any such messages are sent to the new virtual
V; and in the case where the cluster C is a cluster of one or community, and the pseudonymous authors of these mes
more target objects chosen from a separate browsing or fil 50 sages, as well as users who show high interest in reading Such
tering system, V may be chosen to be any existing virtual messages, are informed by Virtual Community Service (as for
community initiated in the same way from a cluster whose pre-community members, above) that they may want to join
cluster profile in that other system is within a threshold dis the community. Each such user can then decide whether or
tance of the cluster profile of cluster C. The threshold distance not to join the community. In the case of Internet Relay Chat
used in each case is optionally dependent on the cluster Vari 55 (IRC), if the target profile of messages in a real time dialog are
ance or cluster diameter of the profile sets whose means are (or become) similar to that of a user, VCS may also send an
being compared. urgent e-mail message to Such user whereby the user may be
If no existing virtual community V meets these conditions automatically notified as soon as the dialog appears, if
and is also willing to accept all the users in pre-community M desired.
as new members, then Virtual Community Service attempts to 60 With these facilities, Virtual Community Service provides
create a new virtual community V. Regardless of whether automatic creation of new virtual communities in any local or
virtual community V is an existing community or a newly wide-area network, as well as maintenance of all virtual com
created community, Virtual Community Service sends an munities on the network, including those not created by Vir
e-mail message to each pseudonym P in pre-community M tual Community Service. The core technology underlying
whose associated user U does not already belong to virtual 65 Virtual Community Service is creating a search and cluster
community V (under pseudonym P) and has not previously ing mechanism that can find articles that are 'similar in that
turned down a request to join virtual community V. The e-mail the users share interests. This is precisely what was described
US 8,171,032 B2
77 78
above. One must be sure that Virtual Community Service retrieve a particular message, but rather wants to retrieve all
does not bombard users with notices about communities in new messages sent to virtual community V, then user U
which they have no real interest. On a very small network a pseudonymously instructs its proxy server (which is a core
human could be “in the loop'. Scanning proposed virtual server for V) to send it all messages that were multicast to
communities and perhaps even giving them names. But on 5 MT(V) after a certain date. In either case, userU must provide
larger networks Virtual Community Service has to run in fully a credential proving user U to be a member of virtual com
automatic mode, since it is likely to find a large number of munity V, or otherwise entitled to access messages on virtual
virtual communities. community V.
Delivering Messages to a Virtual Community
Once a virtual community has been identified, it is straight 10 SUMMARY
forward for Virtual Community Service to establish a mailing
list so that any member of the virtual community may distrib A method has been presented for automatically selecting
ute e-mail to all other members. Another method of distribu
tion is to use a conventional network bulletin board or news articles of interest to a user. The method generates sets of
group to distribute the messages to all servers in the network, 15 search profiles for the users based on such attributes as the
where they can be accessed by any member of the virtual relative frequency of occurrence of words in the articles read
community. However, these simple methods do not take into by the users, and uses these search profiles to efficiently
account cost and performance advantages which accrue from identify future articles of interest. The methods is character
optimizing the construction of a multicast tree to carry mes ized by passive monitoring (users do not need to explicitly
sages to the virtual community. Unlike a newsgroup, a mul rate the articles), multiple search profiles per user (reflecting
ticast tree distributes messages to only a selected set of Serv interest in multiple topics) and use of elements of the search
ers, and unlike an e-mail mailing list, it does so efficiently. profiles which are automatically determined from the data
A separate multicast tree MT(V) is maintained for each (notably, the TF/IDF measure based on word frequencies and
virtual community V, by use of the following four procedures. descriptions of purchasable items). A method has also been
1. To construct or reconstruct this multicast tree, the core 25 presented for automatically generating menus to allow users
servers for virtual community V are taken to be those proxy to locate and retrieve articles on topics of interest. This
servers that serve at least one pseudonymous member of method clusters articles based on their similarity, as measured
virtual community V. Then the multicast tree MT(V) is estab by the relative frequency of word occurrences. Clusters are
lished via steps 4-6 in the section “Multicast Tree Construc labeled either with article titles or with key words extracted
tion Procedure above. 2. When a new user joins virtual 30 from the article. The method can be applied to large sets of
community V, which is an existing virtual community, the articles distributed over many machines.
user sends a message to the user's proxy server S. If user's It has been further shown how to extend the above methods
proxy server S is not already a core server for V, then it is from articles to any class of target objects for which profiles
designated as a core server and is added to the multicast tree can be generated, including news articles, reference or work
MT(V), as follows. If more thank servers have been added 35 articles, electronic mail, product or service descriptions,
since the last time the multicast tree MT(V) was rebuilt, people (based on the articles they read, demographic data, or
where k is a function of the number of core servers already in the products they buy), and electronic bulletin boards (based
the tree, then the entire tree is simply rebuilt via steps 4-6 in on the articles posted to them). A particular consequence of
the section "Multicast Tree Construction Procedure” above. being able to group people by their interests is that one can
Otherwise, server S retrieves its locally stored list of nearby 40 form virtual communities of people of common interest, who
core servers for V, and chooses a server S1. Server S sends a can then correspond with one another via electronic mail.
control message to S1, indicating that it would like to be
added to the multicast tree MT(V). Upon receipt of this mes I claim:
sage, server SI retrieves its locally stored subtree G1 of 1. A method for providing a user with customized elec
MT(V), and forms a new graph G from G1 by removing all 45 tronic information accessible via an electronic data commu
degree-1 vertices other than S1 itself. Server S1 transmits nication network, wherein user devices of said user and other
graph G to server S, which stores it as its locally stored subtree users are connected via said communication network to a
of MT(V). Finally, server S sends a message to itself and to all server system which provides the user devices with access to
servers that are vertices of graph G, instructing these servers a plurality of target information objects, said method com
to modify their locally stored subtrees of MT(V) by adding S 50 prising:
as a vertex and adding an edge between S1 and S. 3. When a generating target profiles for said target information
user at a client q wishes to send a message F to virtual objects, each said target profile being generated by a
community V. client q embeds message F in a request R computer system running a profile generation algorithm,
instructing the recipient to store message F locally, for a from the contents of an associated one of said target
limited time, for access by members of virtual community V. 55 information objects, each said target profile having a set
Request R includes a credential proving that the user is a of numeric values each of which represents a degree to
member of virtual community V or is otherwise entitled to which a corresponding attribute is present in the target
post messages to virtual community V (for example is not information object;
“black marked by that or other virtual community mem generating at least one user target profile interest Summary
bers). Client q then broadcasts request R to all core servers in 60 for a user, each said user target profile interest Summary
the multicast tree MT(V), by means of a global request mes being generated based on said target profiles associated
sage transmitted to the user's proxy server as described with said target information objects accessed by the user,
above. The core servers satisfy request R, provided that they each said user target profile interest Summary having a
can verify the included credential. 4. In order to retrieve a set of numeric values each of which represents the user's
particular message sent to virtual community V, a user U at 65 preference for a corresponding attribute, wherein said
client q initiates the steps described in section “Retrieving numeric values are generated without requiring the user
Files from a Multicast Tree,” above. If user U does not want to to explicitly indicate the user's preference;
US 8,171,032 B2
79 80
determining differences between corresponding numeric a target profile generator which generates target profiles for
values of said at least one user target profile interest said plurality of target information objects, said genera
Summary and said target profiles; and tor comprising a computer system running a profile gen
automatically creating a customized electronic delivery of eration algorithm on the contents of one or more target
target information objects by presenting said user with a 5 information objects, each said target profile having a set
customized selection, as a function of said differences, of numeric values each of which represents a degree to
of selected ones of the plurality of target information which a corresponding attribute is present in the target
objects. information object;
2. The method of claim 1, wherein one or more of said a user target profile interest Summary generator which
target information objects comprise advertisements. 10 generates, for a user at a user terminal, based on target
3. The method of claim 1, further comprising an act of: information object profiles, a user target profile interest
automatically transmitting to the user a notification to sig Summary, said user target profile interest summary hav
nal newly available target information objects of poten ing a set of numeric values each of which represents the
tial interest to the user, as determined by at least one user user's preference for a corresponding attribute, wherein
target profile interest Summary. 15 said numeric values are generated without requiring the
4. The method of claim 1, further comprising an act of: user to explicitly indicate the user's preference;
identifying to the user the customized selection of target a correspondence-determining module which determines
information objects by providing hyperlinks to said differences between the numeric values of at least one
Selected target information objects. user target profile interest summary and the target pro
5. The method of claim 1, further comprising an act of: files; and
enabling the user to perform an on-line search of said a customized information delivery Subsystem which auto
plurality of target information objects via said commu matically creates and delivers to a user via said commu
nication network; and nication networka customized selection, as a function of
delivering results of said on-line search based upon a pre said differences, of target information objects.
dicted level of interest by the user in the target informa 25 8. The system of claim 7, further comprising a module
tion objects within the search results. which automatically transmits a notification to the user to
6. The method of claim 1, wherein said target information identify newly received ones of target information objects of
objects comprise news articles. possible interest to the user, as determined using at least one
7. A system for providing a user with a customized elec user target profile interest Summary.
tronic information service accessible via an electronic data 30 9. The system of claim 7, wherein said target information
communication network, wherein users are connected via objects comprise news articles.
user terminals and said communication network to a server
system which provides a user with access to target informa
tion objects, the system comprising:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy