0% found this document useful (0 votes)
23 views10 pages

Goldstein 2006

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

Goldstein 2006

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Using Speech Acts to Categorize Email and Identify Email Genres


Jade Goldstein Roberta Evans Sabin
U.S. Department of Defense Computer Science Department
jadeg@acm.org Loyola College in Maryland
Baltimore, MD 21210
res@loyola.edu

Abstract
the categorization and summarization of email. There
We define genres of email as well as a subset of has been a great deal of work in email filtering and
“speech acts” relevant to email enhanced for email spam detection [29]. The more demanding task of
specific discourse. After creating a ground truth set of categorizing email is receiving more attention [21].
emails based on these email acts, we compare the And with the availability of large corpora, there is
performance of two classifiers (Random Forests and growing interest in using email analysis to determine
SVM-light) in identifying the primary communicative the structure of social networks [16].
intent of the email and its corresponding genre. We Historically, email structure and function has been
experiment with using feature sets derived from two of interest to a variety of research communities.
verb lexicons as well as a feature set containing Among them are linguists, social scientists studying
selected characteristics of email. Results show better communication and organizational behavior, and
classifier accuracy using the verb lexicon with the those interested in genre studies. With the availability
smaller number of classes over the larger, and that of the Enron and other large email corpora, utilizing
using part of speech tagging to focus on selecting only the analytical techniques from both the computational
verbs, causes a slight drop in performance. Using the linguistics and computer science communities is more
email characteristics set alone results in better feasible. A long-standing question involves the basic
performance than either of the verb lexicons alone, structure of email: is email a form of writing, a form
but the best results are obtained using a combination or speech, or a new, hybrid genre [5] [33]? It is
of the smaller verb lexicon and the email appealing to characterize email as a genre. Writing
characteristics set. and, less-frequently, speech can be characterized by
its genre: .
1. Introduction and related work
[A genre is] a patterning of communication created by a
Today the Internet is used on a daily basis by over combination of the individual, social and technical forces
implicit in a recurring communicative situation. A genre
70% of adult Americans and almost 60% use email on structures communication by creating shared
a typical day [27]. These figures represent a dramatic expectations about the form and content of the
increase over usage levels of only five years ago. interaction, thus easing the burden of production and
Recently, several events have focused public attention interpretation. [18]
on email. On both the national and local level, email
communication is being combed for information that If, in fact, email is a distinctive genre that has
may be used for legal or other purposes [26]. The emerged as the result of a new communicative
most prominent of these events, the collapse of the medium [30], researchers can be guided by an
Enron Corporation, resulted in the courts making expected form and content. Myka [24] argues that
available in 2004 to public access a corpus of over email is an amalgam of several genres, a theory we
500,000 corporate Enron emails. The presence of this support, and Crystal [15] observes that the email
corpus and the need to develop tools for email genre is still evolving.
processing has stimulated interest in research into Email can be considered to be an amalgam of
email. speech and writing. Biber [7] used statistical
Understanding the structure and functions of email techniques to analyze the linguistic features of
will aid in the development of much needed tools for twenty-three spoken and written genres and found that

0-7695-2507-5/06/$20.00 (C) 2006 IEEE 1


Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

the relationship among these genres is complex and 2. Annotation Guidelines


that there is no simple dichotomy between speech and
writing. Collot and Belmore [13] extended Biber’s Computational linguists have used annotation
work to examine electronic messages posted to an schemes that label speech acts at the utterance level.
electronic bulletin board. They found that they most Dialogue Act Markup in Several Layers (DAMSL)
closely resembled interviews and letters, and, in the [1] and Dialogue Act Modeling [31] are two such
dimension measuring the level of interaction and schemes. A set of detailed annotation guidelines
personal affect (Biber’s Dimension 1), they more based on DAMSL expanded the number of possible
closely resembled the spoken genres rather than the categories [20]. The table in Appendix B indicates
written ones. Baron [6] cites linguistic features of the relationship between the categories defined in
much email: informality of style, psychological these methodologies and traditional speech acts.
assumption that the medium is ephemeral, and a high We extended these schema for emails, to allow for
level of candor, that are speech-like. annotation of the primary communication intents of
Following some recent work of Cohen [12], we email, which we will refer to as the “email act.”
believe that email most closely resembles speech and Appendix B shows the comparison of our email
look to analyze email in terms of speech acts. Our acts to categories of Dialogue Acts, DAMSL and
research differs from Cohen’s in that we focus on SWBD-DAMSL and the corresponding traditional
verbs and the combination of verbs with email speech acts. In dialog analysis acts, there are two
specific features. The philosopher John Austin primary utterance characterizations: forward-
posited that a speaker in using words performs an act communicative – describing the effect on the
in making the utterance [2]. He suggested that this act subsequent dialogue and interaction, and backwards-
can be categorized as one of five “illocutionary” acts communicative – describing how the current utterance
and identified a class of verbs with each. His relates to previous discourse .
approach has been amplified by others, among them In emails, responses often contain a mixture of
Vendler [36] and Bach and Harnish [3]. (See Table 1 speech acts, e.g., answers and comments to a sender’s
for a comparison). email as well as additional questions for the sender.
An email might be considered to be a sequence of We found it necessary to expand the backward
one or more utterances (a “soliloquy” of sorts) and communicative function of speech acts to allow for
thus a sequence of speech acts. Characterizing an this additional forward-communicative function. This
email by its most important speech act and its genre resulted in the forward-backward communicative
could provide a way of categorizing email in terms of category (response with expectation of reply). We
the intended action of the sender and expected action found when annotating emails for the overall intent, it
on the part of the recipient. Such information could is often very difficult to ascertain the primary intent of
be utilized to triage large volumes of incoming email the email in the cases when there are both answers
to produce “to-do” lists for the recipient [14] or track and questions, and adding this category improved
responses to the user’s requests for information or inter-annotator reliability. If an email were allowed
action. multiple categories or annotated section by section (or
possibly paragraph by paragraph), there might not
have been this need.

Table 1: Comparison of Traditional Speech Act Categories by Author.

SA Category Austin Vendler Bach & Harnish Description


Assertive Expositive Expositive Constative expound views, state, contend, insist, deny, remind, guess
Commissive Commissive Commissive Commissive commit the speaker: promise, guarantee, refuse, decline
Behabitive Behabitive Behabitive Interpersonal reaction to others: thank, congratulate, criticize
Interrogative Interrogative Directive/Query ask, question
Exercitive Exercitive Exercitive Directive/Request exercise power, rights or influences: order, request, beg,
Verdictive &
Verdictive Verdictive Operative giving a verdict: rank, grade, define, call, analyze

2
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Are S and R distinct? N S Self

Y Backward
N N Non-personal Communicative
Is S an individual? Function

Y S = sender
R = recipient
T1 Includes a request for response
Is the primary intent of Y T2 Attachment IS a response
S to forward another T3 Explanation included
document? T4 Simple FYI
T5 Re-transmission

Is S providing info in
Is S responding Y Y R1 No reply
response to an explicit
S is initiating. to R’s expected
request?
dialogue communication? F1 Reply exp.

N
N
Perform some
Does S want R to do Y Is S offering or
action other than Y R2 No reply exp
something? provide info? committing to do
F2 Reply exp
something?
N Y
N
N
Does S offer or D1 Directive
commit to do D2 Open option
something? Is S agreeing or Y R3 No reply exp
Y I1 Directive for disagreeing? F3 Reply exp
info
N
I2 Request info
C1 Commit
N
C2 Offer
Does S make a clear
statement or give info Y
about the world? Is S suggesting R Y
A1 Statement R4 No reply exp
take some other F4 Reply exp
A2 Opinion action?
N
N
Does S express
Y
his/her feelings?
B1 Apology Other response
B2 Thank you R5 No reply expected
N B3 Other F5 Reply expected

Does the message


perform the act? Y V1 Verdictive Code by traversing the flowchart in the order shown with
(“You’re fired”) a single exception (noted in the Annotation Guidelines).
Within categories (e.g., D1, D2), code with the first
N classification that applies to the email. For example, an
email containing a thank you and an apology would be
O1 Other forward fn. coded as B1.

Figure 1: Email Speech Acts for Email. 12 Main Categories, 30 Subcategories consisting of 23
traditional speech acts and 7 email specific acts. The gray color shading indicates a backward
communicative function.

3
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

We also added three email specific categories: T 3. Corpus


(Transmissives) in which the sender’s intent is to
forward information to a recipient, S (Self) - sending We prepared a set of approximately 280 randomly
email to oneself (reminders), and N (Non-personal) - selected, redacted emails from the authors’ personal
emails such as newsletters or items from list servers. email collections. The first 160 emails were used to
We suggest that email, unique in its form, be develop and refine the annotation guidelines. One
categorized by several new genres (Table 2). Some of hundred emails were used to test inter-annotator
these genres correspond to recognized written & agreement. Using the kappa statistics [12], we obtain
spoken genres [7], e.g., email conversations to a kappa of .89 for the 30 subcategories, indicating
telephone and face-to-face conversations. Others are high agreement between the two annotators.
novel and peculiar to email, such as spam, which For such a small set of emails, it was difficult to
includes advertising and phishing. We believe that obtain enough samples for each category. (This set is
email conversations are a distinct genre and have two part of a larger annotation project, which will
specific forms – one includes explicit threads and the eventually result in sufficient samples in each
other does not. Email conversations have content- category.) Accordingly, we merged all subcategories
based subgenres that can best be characterized by the into main categories and eliminated some categories
primary intent of the email, i.e., the primary email with very small samples, namely S, N, B, V, O.
“speech” act. Email conversations can be formal or Commits were only found in responses and and were
informal, which might necessitate additional genres. annotated as R2 or F2 (see Fıgure 1). D and I
The flowchart of Figure 1 describes how to select (Directives and Information requests) were merged
the primary discourse intent of the email. Although into one category I&D. We then supplemented any
this intent does not distinguish all the genres as listed deficient main categories by hand selecting samples in
in Table 2, we believe that additional features would the resulting five categories. The result was 50-56
allow us to distinguish between cases in which there emails for each of the five main categories:
are multiple suggested genres. Appendix A provides transmissives (T), requests, either directives or
details of the annotation methodology, including information requests (I&D), assertions (A), responses
precedence among the 30 subcategories of email acts, (R), and responses with an expectation of reply (F) .
23 speech email acts (S, N and T are not speech acts).
4. Features
Table 2: The 12 main Email Acts and their
corresponding genres. To detect the overall intent of the email, each
Category Example Suggested genres
email message is represented by a set of features. The
(S) Self Emails to self Email reminders/notes
(N) Non- Bulk Emails Spam (Advertising,
focus is only on the text that the sender has written; as
Personal Phishing, etc.) such we have implemented methods to eliminate
ENewsletters signature blocks and included text. We investigate
(T) Forwarding Digital cover two different types of features sets: one based on
Transmissives documents letter/memo with verbs and the other based on email specific
attachments
(R) Responses Provide info to Email Conversations characteristics.
question.
(F) Response Provide info to Email Conversations 4.1 Verbs
w/ forward question and ask
function questions Levin [22] shows for a large set of English verbs
(I) Info Asks for Email Conversations
request information
(about 3200) the correlations between the semantics
(D) Directive Ask someone to Email Conversations of verbs and their syntactic behavior and
do something interpretation of their arguments. From these
(C) Commits Commit/offer to Email Conversations correlations, she defines classes of verbs. Each verb
do something class has subclasses; using the most general class
(A) Assertions Make Email Conversations
statements/state
results in 48 total classes. These general classes of
opinions verbs have been reformulated by Dorr [17] in the
(B) Behabitive Express feelings Email Conversations Lexical Conceptual Structure (LCS) Database by
(V) Verdictive Statement Official Digital using alternations, resulting in 29 additional classes,
accomplishments Documents, Digital for a total of 77 classes [23]. Each class is used as a
act, e..g, paper Letters
notifications
single feature in the feature set. From the email text,
(O) Other Hellos, Email Conversations we count the stemmed verbs using a simple string
introductions matching algorithm to the verbs in LCS,
supplemented with the expansions of irregular verbs
[10]. We use two methods of computing counts. One

4
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

method matches stemmed words in the text to the In the future, we plan to do further analysis to expand
verbs in the verb classes, the other uses the TnT Part this list.
of Speech tagger [8] to identify that the part of speech
class is indeed a verb (since noun roots can appear Table 3: List of 16 email speech act features.
identical to verb roots). Membership in each of the EMAIL CHARACTERISTICS FEATURES
LCS verb classes is one feature of the feature vector. (EF)
We make a pass through the text, identifying verbs Presence of Re:
(by comparing a stemmed word in the text to a Presence of Fwd:
stemmed version of the verbs in LCS). If a verb is in Attachment signified in header info or by an insertion in text body
multiple classes, each class is proportionally Fraction of interrogative sentences (sentences ending in ‘?’/total
sent)
incremented so that each verb contributes one unit. Fraction of “I” or “we” (count of words / total word count)
The final counts are normalized by the word count of Fraction of “You” (count of words / total word count)
the sender’s body of the email (that excludes Attachment indicators such as “attached, here is, enclosed”
punctuation, signature block, included text from Apology indicators such as “sorry, apology, apologies”
another email, etc.). Opinion indicators: “think, feel, believe, opinion, think, comment”
Politeness indicators such as “please”
Ballmer and Brennenstuhl (BB) [4] present a
Gratitude indicators such as “thank”
classification of speech acts based on linguistic Action indicators such as “can you”, “would you”
activities and aspects. These were originally devised Commitment indicators such as “I can”, “I will”
for German verbs and then translated into English, Information indicators such as “information”, “info”, “send”
resulting in many multi-word “verbs” in the 600 Auto-reply indicators such as “out of the office”, “away”
categories and 24 classes. As a first pass, all multi- Email length
word entries in the BB classes were ignored. As with
LCS, each BB class was supplemented by irregular 5. Classifiers
tenses of verbs and each class is used as a single
feature in the feature set, resulting in 24 features. Two different classifiers were used for our
Verb counts for the features were computed using the experiments: Support Vector Machine [19] and
same process as for LCS. Random Forests [9]. We used SVM-light [32] with a
radial basis function and the default settings. SVM-
4.2 Other Characteristics light builds binary models, so in our cases, where we
have five classes, a model must be produced for each
Certain types of email contain features that class.
indicate particular communicative intents. These In contrast, Random Forests [28] grows many
features can be presentation oriented (which include classification trees and each tree gives a vote for a
features found in header information or in particular class. The forest chooses the classification
punctuation) or text oriented. For example, the having the most votes. We use 100 trees in our
presence of “Re:” in the subject line usually indicates experiments. Random Forests takes far less training
a response, although sometimes users will shift topics time than SVM-light.
or introduce new topics and not change the subject
line. Similarly, “Fwd:” often signals a transmissive 6. Experiments
intent. The presence of interrogative sentences
(detected by the presence of question marks in the The experiments on the data described in Section 4
body of the email) can identify an information
were run using the ten-fold cross-validation method.
request.
This splits the data into training and test sets with a
In addition, certain words can indicate various
90% training, 10% testing portion. Experiments are
email speech acts. For example, “Thanks” and its
repeated ten times, so that all the data is used both in
variants can indicate an acknowledgement.
training and testing but not all at the same time.
“Attached”, “Enclosed”, “Here is” signal the presence
We compared classifier performance using the
of an attachment, whose name may appear in the
LCS verb features, the BB verb features (Section 4.1)
header or in the body of the text depending on the
and the email characteristics feature set EF (Table 3),
email software.
as well as combinations of these feature sets. The
Table 3 summarizes an initial list of additional results for the Random Forests classifier are presented
email features, based on form and content – to assist in Table 4. Note that in experiments where a
in genre identification (refer to Table 2). All features
classification is required, i.e., there is no “unknown”,
are either binary or were normalized over the
Recall (percentage of emails correctly classified),
document length (word count or sentence count). All
Precision (percentage of classifications that were
words/phrases that were used are listed in the table.

5
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

correct) and F1 (the harmonic mean of recall and or use verb classes of the type found in BB to assist in
precision, F1= 2*R*P / (R + P) ) are all equal. such characterizations. However, using the verb
classes of BB combined with EF resulted in a
Table 4: Results (precision) of Random decrease in performance for Responses with
Forests Classifier for identifying the five expectation of reply. We need to investigate why.
email act classes (T, I&D, A, R, F) without Only for Assertions (A) do both the feature sets
and with part-of-speech tagging TnT. LCS and BB perform clearly better than EF. Both
No TnT TnT verb lexicons have difficult with response (both R and
EF .57 N/A F). LCS outperformed BB only for responses R. It is
BB .38 .30 the intent of LCS to classify verbs by analyzing the
BB+EF .63 .52 grammatical structure in which they appear. We did
LCS .35 .34 not grammatically parse the text of the email; this
may have contributed to the poor functioning of LCS.
LCS+EF .57 .54
Table 6: Confusion matrix for EF (Random
Table 4 shows that the email characteristics feature
Forests) – 16 features.
set (16 features) does very well (.57), outperforming
Class # Items T R F I&D A
both LCS and BB (.32 and .38 respectively). This is
not a surprise since many of the features give clear T 55 81% 2% 6% 7% 3%
indications of the appropriate category – such as Re: R 56 11% 67% 15% 2% 4%
for a response. Adding the verbs features to the email F 56 3% 11% 83% 2% 2%
characteristic feature set slightly increases I&D 53 14% 6% 33% 27% 21%
performance, for both verb feature sets. BB+EF
A 50 29% 11% 11% 22% 28%
results in the best feature set combination with a
precision of .63. TnT decreases performance when
used in combination with BB, but not when combined Table 7: Confusion matrix for BB only no
with LCS. In the future we hope to determine what TnT (Random Forests) – 24 features.
characteristics of the verb lexicons are contributing to Class # Items T R F I&D A
such results. T 55 68% 5% 10% 8% 8%
We also compared Random Forests and SVM-light R 56 38% 10% 19% 17% 16%
for the LCS feature set (Table 5). The results indicate F 56 14% 14% 32% 24% 16%
that these two classifiers are often very close in I&D 53 8% 14% 26% 31% 21%
performance, results that we have seen in other
A 50 9% 8% 18% 18% 48%
unpublished experiments on different data sets. The
one exception to this was LCS.
Table 8: Confusion matrix for LCS only no
Table 5: Results (precision) of Random TnT (Random Forests) – 77 features.
Forests compared to SVM-light for LCS on Class # Items T R F I&D A
the five email act classes (T, I&D, A, R, F). T 55 62% 13% 6% 12% 7%
SVM-light Random Forests R 56 27% 19% 18% 19% 16%
LCS .28 .35 F 56 17% 22% 18% 25% 18%
LCS + TnT .35 .36 I&D 53 16% 18% 24% 32% 10%
LCS+EF .59 .57
A 50 12% 13% 12% 18% 44%
LCS+EF+TnT .55 .54

The confusion matrix for Random Forests on the Table 9: Confusion matrix for BB+EF no TnT
email characteristics feature set EF is displayed in (Random Forests) – 40 features.
Table 6. Tables 7 and 8 show the matrix for the verb Class # Items T R F I&D A
features sets BB and LCS respectively. The results in T 55 82% 2% 4% 8% 4%
Table 9 clearly indicate the improved performance of R 56 12% 65% 14% 2% 8%
the combination of BB and the email feature set EF. F 56 5% 14% 71% 7% 3%
From Tables 6, 7 and 8 we can see that using BB I&D 53 8% 9% 18% 43% 22%
alone and LCS alone results in higher classification
A 50 9% 11% 4% 23% 53%
accuracy for Directives (I&D) and Assertions (A)
than that of just EF. This indicates that we would
either need to expand our email characteristics set to For the email feature set (EF), Responses with
include more distinguishing features for I&D and A expectation of reply (F) are also often confused with

6
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Responses (R) and vice versa, a result we might 9. References


expect since both classes involve backward
communicative functions. We assume that the use of [1] Allen, J. and Core, M., Draft of DAMSL: Dialog Act
the question marks as a feature in EF helps to Markup in Several Layers, 1997.
distinguish between these two classes as compared to [2] Austin J. L., How to do things with words. Harvard
LCS and BB. For the verbs features only (BB and University Press, Boston, MA, 1962.
LCS), Responses (R) are most often mis-identified as [3] Bach K. and. Harnish, R. M., Linguistic
Transmissives (T) and Responses with an expection Communication and Speech Acts, The MIT Press,
of reply (F) are most often confused with I&D Cambridge, MA, 1979.
(requests and directives).
[4] Ballmer, T. T. and Brennenstuhl, W., Speech Act
Overall, the Transmissive category (T) is the Classification. A Study in the Lexical Analysis of English
easiest to identify. We believe this is due to the fact Speech Acts, Springer-Verlag, Berlin, 1981.
that the form and content of this genre often clearly
[5] Baron, N. S., Alphabet to Email: How Written English
identifies this category. The remaining four Evolved and Where It’s Heading, Routledge, London, 2000.
categories consist of the email conversation genre.
[6] Baron, N. S., “Why Email Looks Like Speech” in New
Media Language, Aitchison, J. and Lewis, D. (ed.),
7. Conclusion & Future Directions Routledge, London, 2003.
[7] Biber, D., Variation across speech and writing,
We have shown that verb classes as a feature set Cambridge University Press, Cambridge, UK. 1988.
combined with an email specific characteristics set
[8] Brants, T., “Tnt: A Statistical Part-Of-Speech
can give reasonable classification performance for Tagger,” Proceedings of ANLP-2000, Seattle, WA, 2000.
five email act categories and two genres. In the
future, we want to investigate methods to improve our [9] Breiman, L., Random Forests, U.C. Berkeley
Technical Report for Version 3, Berkley, CA, 2001.
performance results as well as develop methods to
distinguish the various genres listed for each email act [10] Byrd, P., “Irregular verbs,”
category in Table 2. We would like to incorporate http://www2.gsu.edu/~wwwesl/egw/verbs.htm 2005.
grammatical structure in identifying LCS class [11] Cohen, J.,. “A Coefficient of Agreement for Nominal
membership and fold in the multi-word “verbs” of Scales,” Education and Psychological Measurement, 20,
BB and/or replace multi-word verbs with equivalent 1960, 37-46.
single words (e.g., “anger” to replace “make angry”) [12] Cohen, W. W., Carvalho, V. R., and Mitchell, T. M. ,
to investigate whether this would improve results. “Learning to Classify Email into ‘Speech Acts,’”
We would also like to compare these results for verbs Proceedings of Empirical Methods in Natural Language
classification to that obtained by using VerbNet [37]. Processing (EMNLP2004), 2004.
We also plan to collect and annotate more email in the [13] Collot, M., and Belmore, N., “Electronic Language: A
specific categories to be able to test the classifiers for New Variety of English,” Computer-Mediated
all the subcategories of Figure 1. Communication, Herring, S. C. (ed.), John Benjamins,
We believe our findings support the Amsterdam, 1996.
characterization of email as an amalgam of unique [14] Corston-Oliver, S., Ringger, E., Gamon M. and
communicative genres, where the common genre – Campbell, R., “Task-focused summarization of email,”
email conversations is most similar to spoken Proceedings of Text Summarization Branches Out
Workshop, ACL 2004, 2004.
communication. The methods we employed, built
upon methods used for analysis of dialog, are [15] Crystal, D., Language and the Internet. Cambridge
practical and produce meaningful categorization of University Press, Cambridge, UK, 2001.
email. We hope our research spurs additional research [16] Diesner, J., and Carley, K. M., “Exploration of
to understand the form and content, the genre, of Communication Networks from the Enron Email Corpus.,”
email. Proceedings of Workshop on Link Analysis,
Counterterrorism, and Seciurity, Newport Beach, CA,
April, 2005.
8. Acknowledgments
[17] Dorr, B. and Jones,D.. “Acquisition of semantic
lexicons: Using word sense disambiguation to improve
We wish to thank Bonnie Dorr for her thoughtful precision,” Proceedings of the SIGLEX Workshop on
contributions to this paper. We also thank Gary Breadth and Depth of Semantic Lexicons, Santa Cruz, CA.
Ciany and Pat Sponaugle for their software 1996.
contributions.
[18] Erikson, T., “Making Sense of Computer-Mediated
Communication (CMC): Conversations as Genres,”
Proceedings of Hawaiian International Conference on
System Services (HICSS2000)., 2000.

7
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

[19] Joachims, T., “Text Categorization with Support [28] RandomForestwebpages:


Vector Machines: Learning with Many Relevant Features,” www.stat.berkeley.edu/users/breiman/RandomForests
Proceedings of the European Conference on Machine
Learning (ECML), 1998. [29] Schneider, K.-M., “A Comparison of Event Models for
Naïve Bayes Anti-Spam E-Mail Filtering,” Proceedings of
[20] Jurafsky, D., Shriberg, E., and Biasca, D., the 10th Conference of the European Chapter of the
Switchboard SWBD-DAMSL Shallow-Discourse-Function Association for Computational Linguistics (EACL 2003),
Annotation Coders Manual, Draft 13. Institute of Cog. Sc. Budapest, Hungary, 2003.
Tech Report 97-02. University of Colorado, Boulder CO,
1997. [30] Shepherd, M. and Watters, C., “The Functionality
Attribute of Cybergenres,” Proceedings of the 32nd Hawaii
[21] Klimit, B. and Yang, Y., “The Entron Corpus: A New International Conference on System Sciences (HICSS1999),
Dataset for Email Classification Research,” Proceedings of 1999.
the European Conference on Machine Learning (ECML),
2004. [31] Stolcke, A., et al., “Dialogue act modeling for
automatic tagging and recognition of conversational
[22] Levin, B., English Verb Classes and Alternations: A speech.,” Computational Linguistics, 26(3), 2000, 339 –
Preliminary Investigation., University of Chicago Press 373.
Chicago, IL, 1993.
[32] SVM-light webpages
[23] LCS Database webpages: http://www-ai.cs.uni-
www.umiacs.umd.edu/~bonnie/LCS_Database_Documentat dortmund.de/SOFTWARE/SVM_LIGHT/svm_light.html
ion.html
[33] Taylor, P., “Social Epistemic Rhetoric and Chaotic
[24] Myka. “Postings on a Genre of Email,” Genre and Discourse,” Re-Imagining Computers and Composition,
Writing, Bishop, W. and Ostrom, H. (eds.), Boynton/Cook- Hawisher, G and LeBlanc, P. (ed.), Boynton/Cook-
Heinemann, Portsmouth, NH, 1997. Heinemann, Portsmouth, NH, 1992.
[25] Nenkova, A. and Baggga, A., “Email Classification for [34] TnT webpages:
Contact Centers.,” Proceedings of 2003 ACM Symposium www.coli.uni-saarland.de/~thorsten/TnT
on Applied Computing, 2003, 789-792. [35] Turenne, N., “Learning Semantic Classes for
[26] Olesker, M., “E-mails show Steffen not 'irrelevant,' Improving Email Classification,” Proceedings of Text
'mid-level',” Baltimore Sunpaper, Baltimore, MD, Mar 14, Mining and Link Analysis Workshop, 2003.
2005. [36] Vendler, Z., Res cogitans, Cornell University Press,
[27] Pew Internet and American Life Project . 2005. Ithaca, NY, 1972.
http://www.pewinternet.org [37] VerbNet web pages:
www.cis.upenn.edu/~mpalmer/project_pages/VerbNet.htm

Appendix A: Annotation Guidelines


Flowchart Question Clarification on answering “Yes” Annot
ate as

Is this email from S to S? A tickler or reminder with some information content. Attachments S
forwarded to oneself should be coded as T4.
Is S an individual? If S is a business, listserv, or some other institution. N
Is the primary intent of S S may add some minimal explanation but not enough to warrant a separate
to forward another email. The value is in the attachment(s).
document?
S is sending info and asks for a response. T1
S is sending info as a response to a request from R. T2
S includes a short explanation (not just a list of topics). T3
S is sending information “FYI”. T4
S is re-transmitting after a failed attempt (“Here it is again.”) or with T5
correction (“Here’s the corrected version.”).
To enable social network analysis, give precedence in annotating to backwards functions, i.e., if an Rn or Fn is
encountered, code the entire email as Rn or Fn.
Is S responding to R’s R may have used email or another method originally. The presence of Re: in the
communication? subject may indicate a response. The request from R must be explicitly evident. “As
you asked, …”

8
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

In these “backward function” annotations, we differentiate between those cases where a response is expected, i.e.
includes a forward function (Fn) and where none is expected (Rn)—a dead-end response. The forward function
should be explicitly stated and may be a request for information, statement or directive directed at recipient, who is
expected to reply. Open-ended invitations to reply (“Let me know if you need more”) should be coded as Rn.
Is S providing info in S is providing an answer to a question or providing info that R requested. R1
response to an explicit “The sales figures are…”; consists of statements of fact; the refusal to F1
request?1 provide info or declaration of lack of knowledge should be coded as R5.
Is S offering or committing S is offering or committing to do something, including providing R2
to do something?1 additional information or attending a meeting. “I’ll get the data to you by F2
Friday.” Include here statements of S’s already completed actions in
response to R’s initiative, “I’ve sent that report this AM.” Include
conditional commitments, “If you send the report, I’ll go.”
Is S agreeing or S may give a “yea,”, “nay,” or a mixed response or commentary to R’s R3
disagreeing? statement(s), idea(s), directive(s), or commitment(s); “on topic”. Includes F3
corrections.
Is S suggesting R take Include directions that R take some action (other than a commitment R4
some other action? previously proposed by R). “See HR about that.” “You need to send the F4
report to…” Include suggestions that R provide additional info. “Please
send me the data to which you’re referring.”
Some other response? Includes simple acknowledgement, thanks. Include emotional responses R5
and “chit-chat.” F5
(S is initiating dialogue)
Code as the forward speech act first encountered in the series of questions. This gives “precedence” to those speech
acts that require action or response from the recipient.
Does S want R to do something? S wants the dialog to result in R’s doing something, verbally or otherwise.
Perform some action other S tells R to do something other than assembling and transmitting data. D1
than provide info.? “Please send this to all committee members.” “Please clean the labs.”
S asks R to do something and R can reject the suggestion. “Would you D2
send me the book?” “Can you come…” Verification from R is expected.
Request information? A directive to provide information. The form of the information is I1
assumed to be unknown to the sender: requests for specific reports or
documents (containing desired information) should be coded as D1 or D2.
“Send me the figures.”
Answer a simple question or provide other information, including I2
feedback (comments). “Who should we invite?”, “Can you tell me what is
required?”. “What do you think about this?”
Does S offer or commit to Offer—may or may not. “Would you like me to…”, “I can…” Include C1
do something? conditional commitments, “If you send me the report, I’ll go to the
meeting.”
Commit—definitely will. “I’ll send it tonight”, “I’ll attend.” C2
Does S make a clear statement S making statements of fact. A1
or give info about the S is stating his/her opinions. A2
world?
Does S express his/her “I’m sorry…” B1
feelings? “Thanks so much.” B2
“I’m exhausted” “Welcome” “It’s a shame.” B3
Does the message perform The statement of the words actually accomplishes an act. “You are V1
the act? awarded the contract”, “You’re guilty”. Probably very rare in email.
Another forward function Includes general friendly hellos, introductions. O1

1
Exception to the chronology of annotation implicit in the flowchart: In cases where a response includes both information (R1 or
F1) AND an offer or commitment (R2 or F2), code the entire email according to the act that appeared first in the email.

9
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006

Appendix B: Email Classes as Related to Shallow Discourse Annotation and Speech Acts
Dialogue Act (D A) DAM SL Switchboard S W BD - Speech Acts
Email C Category M odeling DAM SL Annotation Austin Vendler Bach&Harnish
Forward-Com municative Function
Statem ents Statem ent-Assert Expositive E xpositive Constative
Statem ent-R eassert " " "
Statem ent-Other " " "
A1 Statement Statem ent-non-opinion " " "
A2 Opinion Statem ent-opinion " " "
Influencing-addressee-fut.-action
I1 & I2 For information Info-request Exercitive Interrog. Directive-Query
I1 & I2 Yes/N o Question Yes/No Question " " "
W h-question W h-question " " "
Declarative W h-Ques " " "
Open-Question Open-Question " " "
Or-Question " " "
Or-clause Or-clause " " "
Declarative Yes-No-Ques Declarative-Q uestion " " "
Tag-Question Tag-Question " " "
Rhetorical questions Rhetorical questions
For action Action-Directive " E xercitive Directive-R equest
D2 Open-option Open-option " " "
D1 Action-directive Action-directive " " "
Comm itting-speaker-fut.-action
C1 Offer Offer Com m issiveC om m issive Com m issive
C2 Com mit Com m it " " "
Offers, Options, & Com m its " " "
Other Conv-opening Conv-opening Conv-opening " " "
Conv-closing Conv-closing Conv-closing "
V1 Explicit-performance Verdictive V erdictive
B2 Thanking Thanking/You're welcom e Behabitive B ehabitive Interpersonal
B1 Apology Apology " " "
B3 Exclamation Exclam ation " " "
O1 Other Other-forward-funct Other-forward-function

Backw ards-Comm unicative Function


R3, F3 Agreem ent Assertive E xpositive Constative
Agreem ent/Accept Accept Accept " " "
Accept-part Accept-part " " "
Maybe/Accept-Part Maybe Maybe " " "
Reject-part Reject-part " " "
Reject Reject Reject " " "
Hold before answering/agHold before answerin Hold before answering/agre " " "
Dispreffered answer " " "
Understanding
R3, F3, R5, F5 Signal-non-understandin Signal-non-understaSignal-non-understanding " " "
R3, F3, R5, F5 Signal-understanding " " "
Backchannel/Ack., Acknowledge
Backchannel/Question

R5 & F5 Response A ck. Ack-answer


Repeat-rephrase Repeat-rephrase
Collaborative Com plet. Com pletion Com pletion
R3 & F3 Sum marize/reformulate Sum m arize/reform ulate Assertive E xpositive Constative
Appreciation Appreciation Behabitive B ehabitive Interpersonal
Sym pathy " " "
Downplayer Downplayer " " "
Correct-m isspeaking Correct-m isspeaking Assertive E xpositive Constative
Repeat phrase
Answ er
R1 & F1 Yes Answers Answer Yes-Answer Assertive E xpositive Constative
R1 & F1 No Answers " No-Answer " " "
Affirm non-yes answers Affirm non-yes answers " " "
Neg non-no answers Neg non-no answers " " "
R2, F2, R4, F4 Other answers Other answers " " "
No plus expansion " " "
R5 & F5 (Other backwards) Yes plus expansion " " "
Statem ent expanding y/n an" " "
Expansions of y/n ans " " "
Dispreferred ans " " "

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy