Goldstein 2006
Goldstein 2006
Abstract
the categorization and summarization of email. There
We define genres of email as well as a subset of has been a great deal of work in email filtering and
“speech acts” relevant to email enhanced for email spam detection [29]. The more demanding task of
specific discourse. After creating a ground truth set of categorizing email is receiving more attention [21].
emails based on these email acts, we compare the And with the availability of large corpora, there is
performance of two classifiers (Random Forests and growing interest in using email analysis to determine
SVM-light) in identifying the primary communicative the structure of social networks [16].
intent of the email and its corresponding genre. We Historically, email structure and function has been
experiment with using feature sets derived from two of interest to a variety of research communities.
verb lexicons as well as a feature set containing Among them are linguists, social scientists studying
selected characteristics of email. Results show better communication and organizational behavior, and
classifier accuracy using the verb lexicon with the those interested in genre studies. With the availability
smaller number of classes over the larger, and that of the Enron and other large email corpora, utilizing
using part of speech tagging to focus on selecting only the analytical techniques from both the computational
verbs, causes a slight drop in performance. Using the linguistics and computer science communities is more
email characteristics set alone results in better feasible. A long-standing question involves the basic
performance than either of the verb lexicons alone, structure of email: is email a form of writing, a form
but the best results are obtained using a combination or speech, or a new, hybrid genre [5] [33]? It is
of the smaller verb lexicon and the email appealing to characterize email as a genre. Writing
characteristics set. and, less-frequently, speech can be characterized by
its genre: .
1. Introduction and related work
[A genre is] a patterning of communication created by a
Today the Internet is used on a daily basis by over combination of the individual, social and technical forces
implicit in a recurring communicative situation. A genre
70% of adult Americans and almost 60% use email on structures communication by creating shared
a typical day [27]. These figures represent a dramatic expectations about the form and content of the
increase over usage levels of only five years ago. interaction, thus easing the burden of production and
Recently, several events have focused public attention interpretation. [18]
on email. On both the national and local level, email
communication is being combed for information that If, in fact, email is a distinctive genre that has
may be used for legal or other purposes [26]. The emerged as the result of a new communicative
most prominent of these events, the collapse of the medium [30], researchers can be guided by an
Enron Corporation, resulted in the courts making expected form and content. Myka [24] argues that
available in 2004 to public access a corpus of over email is an amalgam of several genres, a theory we
500,000 corporate Enron emails. The presence of this support, and Crystal [15] observes that the email
corpus and the need to develop tools for email genre is still evolving.
processing has stimulated interest in research into Email can be considered to be an amalgam of
email. speech and writing. Biber [7] used statistical
Understanding the structure and functions of email techniques to analyze the linguistic features of
will aid in the development of much needed tools for twenty-three spoken and written genres and found that
2
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Y Backward
N N Non-personal Communicative
Is S an individual? Function
Y S = sender
R = recipient
T1 Includes a request for response
Is the primary intent of Y T2 Attachment IS a response
S to forward another T3 Explanation included
document? T4 Simple FYI
T5 Re-transmission
Is S providing info in
Is S responding Y Y R1 No reply
response to an explicit
S is initiating. to R’s expected
request?
dialogue communication? F1 Reply exp.
N
N
Perform some
Does S want R to do Y Is S offering or
action other than Y R2 No reply exp
something? provide info? committing to do
F2 Reply exp
something?
N Y
N
N
Does S offer or D1 Directive
commit to do D2 Open option
something? Is S agreeing or Y R3 No reply exp
Y I1 Directive for disagreeing? F3 Reply exp
info
N
I2 Request info
C1 Commit
N
C2 Offer
Does S make a clear
statement or give info Y
about the world? Is S suggesting R Y
A1 Statement R4 No reply exp
take some other F4 Reply exp
A2 Opinion action?
N
N
Does S express
Y
his/her feelings?
B1 Apology Other response
B2 Thank you R5 No reply expected
N B3 Other F5 Reply expected
Figure 1: Email Speech Acts for Email. 12 Main Categories, 30 Subcategories consisting of 23
traditional speech acts and 7 email specific acts. The gray color shading indicates a backward
communicative function.
3
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
4
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
method matches stemmed words in the text to the In the future, we plan to do further analysis to expand
verbs in the verb classes, the other uses the TnT Part this list.
of Speech tagger [8] to identify that the part of speech
class is indeed a verb (since noun roots can appear Table 3: List of 16 email speech act features.
identical to verb roots). Membership in each of the EMAIL CHARACTERISTICS FEATURES
LCS verb classes is one feature of the feature vector. (EF)
We make a pass through the text, identifying verbs Presence of Re:
(by comparing a stemmed word in the text to a Presence of Fwd:
stemmed version of the verbs in LCS). If a verb is in Attachment signified in header info or by an insertion in text body
multiple classes, each class is proportionally Fraction of interrogative sentences (sentences ending in ‘?’/total
sent)
incremented so that each verb contributes one unit. Fraction of “I” or “we” (count of words / total word count)
The final counts are normalized by the word count of Fraction of “You” (count of words / total word count)
the sender’s body of the email (that excludes Attachment indicators such as “attached, here is, enclosed”
punctuation, signature block, included text from Apology indicators such as “sorry, apology, apologies”
another email, etc.). Opinion indicators: “think, feel, believe, opinion, think, comment”
Politeness indicators such as “please”
Ballmer and Brennenstuhl (BB) [4] present a
Gratitude indicators such as “thank”
classification of speech acts based on linguistic Action indicators such as “can you”, “would you”
activities and aspects. These were originally devised Commitment indicators such as “I can”, “I will”
for German verbs and then translated into English, Information indicators such as “information”, “info”, “send”
resulting in many multi-word “verbs” in the 600 Auto-reply indicators such as “out of the office”, “away”
categories and 24 classes. As a first pass, all multi- Email length
word entries in the BB classes were ignored. As with
LCS, each BB class was supplemented by irregular 5. Classifiers
tenses of verbs and each class is used as a single
feature in the feature set, resulting in 24 features. Two different classifiers were used for our
Verb counts for the features were computed using the experiments: Support Vector Machine [19] and
same process as for LCS. Random Forests [9]. We used SVM-light [32] with a
radial basis function and the default settings. SVM-
4.2 Other Characteristics light builds binary models, so in our cases, where we
have five classes, a model must be produced for each
Certain types of email contain features that class.
indicate particular communicative intents. These In contrast, Random Forests [28] grows many
features can be presentation oriented (which include classification trees and each tree gives a vote for a
features found in header information or in particular class. The forest chooses the classification
punctuation) or text oriented. For example, the having the most votes. We use 100 trees in our
presence of “Re:” in the subject line usually indicates experiments. Random Forests takes far less training
a response, although sometimes users will shift topics time than SVM-light.
or introduce new topics and not change the subject
line. Similarly, “Fwd:” often signals a transmissive 6. Experiments
intent. The presence of interrogative sentences
(detected by the presence of question marks in the The experiments on the data described in Section 4
body of the email) can identify an information
were run using the ten-fold cross-validation method.
request.
This splits the data into training and test sets with a
In addition, certain words can indicate various
90% training, 10% testing portion. Experiments are
email speech acts. For example, “Thanks” and its
repeated ten times, so that all the data is used both in
variants can indicate an acknowledgement.
training and testing but not all at the same time.
“Attached”, “Enclosed”, “Here is” signal the presence
We compared classifier performance using the
of an attachment, whose name may appear in the
LCS verb features, the BB verb features (Section 4.1)
header or in the body of the text depending on the
and the email characteristics feature set EF (Table 3),
email software.
as well as combinations of these feature sets. The
Table 3 summarizes an initial list of additional results for the Random Forests classifier are presented
email features, based on form and content – to assist in Table 4. Note that in experiments where a
in genre identification (refer to Table 2). All features
classification is required, i.e., there is no “unknown”,
are either binary or were normalized over the
Recall (percentage of emails correctly classified),
document length (word count or sentence count). All
Precision (percentage of classifications that were
words/phrases that were used are listed in the table.
5
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
correct) and F1 (the harmonic mean of recall and or use verb classes of the type found in BB to assist in
precision, F1= 2*R*P / (R + P) ) are all equal. such characterizations. However, using the verb
classes of BB combined with EF resulted in a
Table 4: Results (precision) of Random decrease in performance for Responses with
Forests Classifier for identifying the five expectation of reply. We need to investigate why.
email act classes (T, I&D, A, R, F) without Only for Assertions (A) do both the feature sets
and with part-of-speech tagging TnT. LCS and BB perform clearly better than EF. Both
No TnT TnT verb lexicons have difficult with response (both R and
EF .57 N/A F). LCS outperformed BB only for responses R. It is
BB .38 .30 the intent of LCS to classify verbs by analyzing the
BB+EF .63 .52 grammatical structure in which they appear. We did
LCS .35 .34 not grammatically parse the text of the email; this
may have contributed to the poor functioning of LCS.
LCS+EF .57 .54
Table 6: Confusion matrix for EF (Random
Table 4 shows that the email characteristics feature
Forests) – 16 features.
set (16 features) does very well (.57), outperforming
Class # Items T R F I&D A
both LCS and BB (.32 and .38 respectively). This is
not a surprise since many of the features give clear T 55 81% 2% 6% 7% 3%
indications of the appropriate category – such as Re: R 56 11% 67% 15% 2% 4%
for a response. Adding the verbs features to the email F 56 3% 11% 83% 2% 2%
characteristic feature set slightly increases I&D 53 14% 6% 33% 27% 21%
performance, for both verb feature sets. BB+EF
A 50 29% 11% 11% 22% 28%
results in the best feature set combination with a
precision of .63. TnT decreases performance when
used in combination with BB, but not when combined Table 7: Confusion matrix for BB only no
with LCS. In the future we hope to determine what TnT (Random Forests) – 24 features.
characteristics of the verb lexicons are contributing to Class # Items T R F I&D A
such results. T 55 68% 5% 10% 8% 8%
We also compared Random Forests and SVM-light R 56 38% 10% 19% 17% 16%
for the LCS feature set (Table 5). The results indicate F 56 14% 14% 32% 24% 16%
that these two classifiers are often very close in I&D 53 8% 14% 26% 31% 21%
performance, results that we have seen in other
A 50 9% 8% 18% 18% 48%
unpublished experiments on different data sets. The
one exception to this was LCS.
Table 8: Confusion matrix for LCS only no
Table 5: Results (precision) of Random TnT (Random Forests) – 77 features.
Forests compared to SVM-light for LCS on Class # Items T R F I&D A
the five email act classes (T, I&D, A, R, F). T 55 62% 13% 6% 12% 7%
SVM-light Random Forests R 56 27% 19% 18% 19% 16%
LCS .28 .35 F 56 17% 22% 18% 25% 18%
LCS + TnT .35 .36 I&D 53 16% 18% 24% 32% 10%
LCS+EF .59 .57
A 50 12% 13% 12% 18% 44%
LCS+EF+TnT .55 .54
The confusion matrix for Random Forests on the Table 9: Confusion matrix for BB+EF no TnT
email characteristics feature set EF is displayed in (Random Forests) – 40 features.
Table 6. Tables 7 and 8 show the matrix for the verb Class # Items T R F I&D A
features sets BB and LCS respectively. The results in T 55 82% 2% 4% 8% 4%
Table 9 clearly indicate the improved performance of R 56 12% 65% 14% 2% 8%
the combination of BB and the email feature set EF. F 56 5% 14% 71% 7% 3%
From Tables 6, 7 and 8 we can see that using BB I&D 53 8% 9% 18% 43% 22%
alone and LCS alone results in higher classification
A 50 9% 11% 4% 23% 53%
accuracy for Directives (I&D) and Assertions (A)
than that of just EF. This indicates that we would
either need to expand our email characteristics set to For the email feature set (EF), Responses with
include more distinguishing features for I&D and A expectation of reply (F) are also often confused with
6
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
7
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Is this email from S to S? A tickler or reminder with some information content. Attachments S
forwarded to oneself should be coded as T4.
Is S an individual? If S is a business, listserv, or some other institution. N
Is the primary intent of S S may add some minimal explanation but not enough to warrant a separate
to forward another email. The value is in the attachment(s).
document?
S is sending info and asks for a response. T1
S is sending info as a response to a request from R. T2
S includes a short explanation (not just a list of topics). T3
S is sending information “FYI”. T4
S is re-transmitting after a failed attempt (“Here it is again.”) or with T5
correction (“Here’s the corrected version.”).
To enable social network analysis, give precedence in annotating to backwards functions, i.e., if an Rn or Fn is
encountered, code the entire email as Rn or Fn.
Is S responding to R’s R may have used email or another method originally. The presence of Re: in the
communication? subject may indicate a response. The request from R must be explicitly evident. “As
you asked, …”
8
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
In these “backward function” annotations, we differentiate between those cases where a response is expected, i.e.
includes a forward function (Fn) and where none is expected (Rn)—a dead-end response. The forward function
should be explicitly stated and may be a request for information, statement or directive directed at recipient, who is
expected to reply. Open-ended invitations to reply (“Let me know if you need more”) should be coded as Rn.
Is S providing info in S is providing an answer to a question or providing info that R requested. R1
response to an explicit “The sales figures are…”; consists of statements of fact; the refusal to F1
request?1 provide info or declaration of lack of knowledge should be coded as R5.
Is S offering or committing S is offering or committing to do something, including providing R2
to do something?1 additional information or attending a meeting. “I’ll get the data to you by F2
Friday.” Include here statements of S’s already completed actions in
response to R’s initiative, “I’ve sent that report this AM.” Include
conditional commitments, “If you send the report, I’ll go.”
Is S agreeing or S may give a “yea,”, “nay,” or a mixed response or commentary to R’s R3
disagreeing? statement(s), idea(s), directive(s), or commitment(s); “on topic”. Includes F3
corrections.
Is S suggesting R take Include directions that R take some action (other than a commitment R4
some other action? previously proposed by R). “See HR about that.” “You need to send the F4
report to…” Include suggestions that R provide additional info. “Please
send me the data to which you’re referring.”
Some other response? Includes simple acknowledgement, thanks. Include emotional responses R5
and “chit-chat.” F5
(S is initiating dialogue)
Code as the forward speech act first encountered in the series of questions. This gives “precedence” to those speech
acts that require action or response from the recipient.
Does S want R to do something? S wants the dialog to result in R’s doing something, verbally or otherwise.
Perform some action other S tells R to do something other than assembling and transmitting data. D1
than provide info.? “Please send this to all committee members.” “Please clean the labs.”
S asks R to do something and R can reject the suggestion. “Would you D2
send me the book?” “Can you come…” Verification from R is expected.
Request information? A directive to provide information. The form of the information is I1
assumed to be unknown to the sender: requests for specific reports or
documents (containing desired information) should be coded as D1 or D2.
“Send me the figures.”
Answer a simple question or provide other information, including I2
feedback (comments). “Who should we invite?”, “Can you tell me what is
required?”. “What do you think about this?”
Does S offer or commit to Offer—may or may not. “Would you like me to…”, “I can…” Include C1
do something? conditional commitments, “If you send me the report, I’ll go to the
meeting.”
Commit—definitely will. “I’ll send it tonight”, “I’ll attend.” C2
Does S make a clear statement S making statements of fact. A1
or give info about the S is stating his/her opinions. A2
world?
Does S express his/her “I’m sorry…” B1
feelings? “Thanks so much.” B2
“I’m exhausted” “Welcome” “It’s a shame.” B3
Does the message perform The statement of the words actually accomplishes an act. “You are V1
the act? awarded the contract”, “You’re guilty”. Probably very rare in email.
Another forward function Includes general friendly hellos, introductions. O1
1
Exception to the chronology of annotation implicit in the flowchart: In cases where a response includes both information (R1 or
F1) AND an offer or commitment (R2 or F2), code the entire email according to the act that appeared first in the email.
9
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Appendix B: Email Classes as Related to Shallow Discourse Annotation and Speech Acts
Dialogue Act (D A) DAM SL Switchboard S W BD - Speech Acts
Email C Category M odeling DAM SL Annotation Austin Vendler Bach&Harnish
Forward-Com municative Function
Statem ents Statem ent-Assert Expositive E xpositive Constative
Statem ent-R eassert " " "
Statem ent-Other " " "
A1 Statement Statem ent-non-opinion " " "
A2 Opinion Statem ent-opinion " " "
Influencing-addressee-fut.-action
I1 & I2 For information Info-request Exercitive Interrog. Directive-Query
I1 & I2 Yes/N o Question Yes/No Question " " "
W h-question W h-question " " "
Declarative W h-Ques " " "
Open-Question Open-Question " " "
Or-Question " " "
Or-clause Or-clause " " "
Declarative Yes-No-Ques Declarative-Q uestion " " "
Tag-Question Tag-Question " " "
Rhetorical questions Rhetorical questions
For action Action-Directive " E xercitive Directive-R equest
D2 Open-option Open-option " " "
D1 Action-directive Action-directive " " "
Comm itting-speaker-fut.-action
C1 Offer Offer Com m issiveC om m issive Com m issive
C2 Com mit Com m it " " "
Offers, Options, & Com m its " " "
Other Conv-opening Conv-opening Conv-opening " " "
Conv-closing Conv-closing Conv-closing "
V1 Explicit-performance Verdictive V erdictive
B2 Thanking Thanking/You're welcom e Behabitive B ehabitive Interpersonal
B1 Apology Apology " " "
B3 Exclamation Exclam ation " " "
O1 Other Other-forward-funct Other-forward-function
10