0% found this document useful (0 votes)
64 views39 pages

Using E-Annotation Tools For Electronic Proof Correction Tion

Uploaded by

Rob Voigt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views39 pages

Using E-Annotation Tools For Electronic Proof Correction Tion

Uploaded by

Rob Voigt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

USING e-ANNOTATION TOOLS FOR ELECTRONIC PROOF CORRECTION

TION

Once you have Acrobat Reader open on your computer, click on the Comment tab at the right of the toolbar:

This will open up a panel down the right side of the document. The majority of
tools you will use for annotating your proof will be in the Annotations section,
pictured opposite. We’ve picked out some of these tools below:

1. Replace (Ins) Tool – for replacing text. 2. Strikethrough (Del) Tool – for deleting text.

Strikes a line through text and opens up a text Strikes a red line through text that is to be
box where replacement text can be entered. deleted.

How to use it How to use it


• Highlight a word or sentence. • Highlight a word or sentence.
• Click on the Replace (Ins) icon in the Annotations • Click on the Strikethrough (Del) icon in the
section. Annotations section.
• Type the replacement text into the blue box that
appears.

3. Add note to text Tool – for highlighting a section 4. Add sticky note Tool – for making notes at
to be changed to bold or italic. specific points in the text.

Highlights text in yellow and opens up a text Marks a point in the proof where a comment
box where comments can be entered. needs to be highlighted.

How to use it How to use it


• Highlight the relevant section of text. • Click on the Add sticky note icon in the
• Click on the Add note to text icon in the Annotations section.
Annotations section. • Click at the point in the proof where the comment
• Type instruction on what should be changed should be inserted.
regarding the text into the yellow box that • Type the comment into the yellow box that
appears. appears.
USING e-ANNOTATION TOOLS FOR ELECTRONIC PROOF CORRECTION
TION

5. Attach File Tool – for inserting large amounts of 6. Drawing Markups Tools – for drawing
text or replacement figures. shapes, lines and freeform annotations on
proofs and commenting on these marks.
Allows shapes, lines and freeform annotations to be
Inserts an icon linking to the attached file in the
drawn on proofs and for comment to be made on
appropriate place in the text.
these marks.
How to use it
• Click on the Attach File icon in the Annotations
section.
• Click on the proof to where you’d like the attached
How to use it
file to be linked.
• Click on one of the shapes in the Drawing Markups
• Select the file to be attached from your computer section.
or network.
• Click on the proof at the relevant point and draw the
• Select the colour and type of icon that will appear selected shape with the cursor.
in the proof. Click OK. • To add a comment to the drawn shape, move the
cursor over the shape until an arrowhead appears.
• Double click on the shape and type any text in the
red box that appears.
J O S L 12216 Dispatch: 22.10.16 CE: Wiley
Journal Code Manuscript No. No. of pages: 15 PE: Pravin Kumar A
Journal of Sociolinguistics, 2016: 1–35

THEME Interaction: Talk and beyond


1 SERIES Edited by Devyani Sharma
2
3
4
This article is part of a running Theme Series entitled ‘Interaction: Talk and Beyond’,
5
which explores non-verbal dimensions of talk in interaction. A full Series introduction
6
can be found in issue 20/3 (2016).
7
8
9
10
Cans and cants: Computational potentials for
11 multimodality with a case study in head
12
13 position1
14
15
16 Rob Voigt, Penelope Eckert, Dan Jurafsky and Robert
17 J. Podesva 1
18
19 Stanford University, California, U.S.A.
20
21
22 As the study of embodiment and multimodality in interaction grows in
importance, there is a need for novel methodological approaches to
23 understand how multimodal variables pattern together along social and
24 contextual lines, and how they systematically coalesce in communicative
25 meanings. In this work, we propose to adopt computational tools to
26 generate replicable annotations of bodily variables: these can be
27 examined statistically to understand their patterning with other
28 variables across diverse speakers and interactional contexts, and can
29 help organize qualitative analyses of large datasets. We demonstrate the
possibilities thereby with a case study in head cant (side-to-side tilt of
30 the head) in a dataset of video blogs and laboratory-collected
31 interactions, computationally extracting cant and prosody from video
32 and audio and analyzing their interactions, looking at gender in
33 particular. We find that head cant indexes an orientation towards the
34 interlocutor and a sense of shared understanding, can serve a
35 ‘bracketing’ function in interaction (for speakers to create
36 parentheticals or asides), and has gendered associations with prosodic
markers and interactional discourse particles.
37
38 具身性和多模态在话语交际研究中愈来愈受到重视。为了更好理解多模
39 态变量在不同交际场景中的变异及意义,本文引入一种新的研究方法。通
40 过具有普适性的计算方法对具身变量的标注,我们得以运用统计分析来理
41 解具身变量与其它语言学变量的互动,从而理解具身变量跨个体、跨情境
的变异,并进一步对大型语料库进行定性分析。我们采集了播客视频以及
42 在实验室中采集的交际录像作为语料,并侧重分析了侧头、语言韵律及性
43 别的互动。研究结果显示侧头与说话者对于听话者的关注与共情相关,并
44
© 2016 John Wiley & Sons Ltd
2 VOIGT ET AL.

1 提示话语中的附加信息(类似于书面语中括号的功能)。此外,侧头还与韵
2 律与话语标记有相关性,这种相关性同时受到说话者性别的影响。
3 [Chinese]
4
5 KEYWORDS: Embodiment, computer vision, multimodality, head cant,
6 body positioning, prosody, gender, interaction
7
8
9 1. INTRODUCTION
10
11 Language is a multilayered, multimodal system; in spoken talk, meanings – and
12 particularly social meanings – are conveyed not only by phonetics, syntax, and
13 pragmatics, but also by facial expression, gesture, and movement. A growing
14 body of research takes seriously the consequences of this fact by addressing the
15 central issue of embodiment: the complex ways in which the meaning-making
16 capacity of language is tied to the physical bodies of those who use language.
17 In the view from the cognitive sciences, this implies that the full sensory
18 experience of any event is deeply intertwined with, and even to an extent may
19 inescapably constitute, the mental representations of that event (Glenberg and
20 Kaschak 2003; Matlock, Ramscar and Boroditsky 2003; Barsalou 2008). In
21 linguistics, this line of analysis takes form in the concept of multimodality,
22 whereby the production of meaning is always in progress and can recruit
23 resources from diverse semiotic modes including but not necessarily privileging
24 spoken language (Kress and Van Leeuwen 2001). Multimodality in linguistics
25 has a long history, even if the term is relatively new. At least as early as
26 Birdwhistell’s kinesics (1952, 1970) linguistic anthropologists have recognized
27 the rich communicative capacity of the body, and later researchers such as
28 McNeill (1992, 2008) and Kendon (1995, 2004) argued for the integration of
29 gesture and spoken language as two parts of one system.
30 Though numerous experimental studies have provided convincing pieces of
31 evidence for the claim that speech and bodily movements and postures are
32 tightly connected (see, for instance, Mendoza-Denton and Jannedy 2011;
33 Loehr 2012; Voigt, Podesva and Jurafsky 2014), our understanding of how
34 they interact moment to moment and coalesce into meaningful signs is based
35 primarily on observational study. Scholars of conversation analysis (CA) in
36 particular have explored such moment-to-moment multimodality. The first
37 article in this Series (Mondada 2016) employs just such a CA approach to
38 consider embodiment and interactional multimodality as related to the
39 ‘ecology of the activity’ taking place in interaction, looking at full-body
40 physical positioning as a crucial resource for meaning-making. Indeed, as
41 Mondada (2016: 341) notes, in linguistic communication, ‘potentially every
42 detail can be turned into a resource for social interaction.’
43 But with this unlimited potential comes a set of daunting analytical
44 challenges. Kendon (1994) provides an early review of observational studies,

© 2016 John Wiley & Sons Ltd


THEME SERIES: INTERACTION 3

1 noting the difficulty of accurate and consistent transcription of bodily gestures.


2 In a review of this ‘embodied turn,’ Nevile (2015) notes that despite the
3 importance of preserving the structure of the original interaction via holistic
4 qualitative analysis of every detail, a consistent methodological framework for
5 the analysis of embodiment remains elusive. Given the detailed analysis that
6 qualitative work requires, analyses are based on a small number of relatively
7 brief interactions in a limited range of contexts.
8 We propose that computational methodologies, by allowing for the study of
9 particular interactional details on a large scale, will afford robust comparisons
10 of elements of bodily movement and the interaction of movement with speech,
11 across diverse communicative contexts. Linguists have long embraced the
12 utility of computational tools for analyzing sound and text in interactional and
13 non-interactional domains. In this paper we develop ways to advance the
14 linguistic study of embodiment by drawing on tools from computer vision to
15 augment sound and text with a third modality, video data.
16 Computer vision technologies are reaching a point of maturity sufficient for
17 the analysis of embodiment from video data. These technologies can carry out
18 some annotation tasks with high accuracy, including extracting broad
19 characteristics of the context from video, like recognizing objects (Viola and
20 Jones 2001; Krizhevsky, Sutskever and Hinton 2012; Girshick et al. 2014;
21 Russakovsky et al. 2015) or tracking people in a scene (McKenna et al. 2000;
22 Pellegrini, Ess and Van Gool 2010; Tang, Andriluka and Schiele 2014), which
23 may help establish the setting of a given dataset. They can also extract person-
24 level features like finding boundaries for faces and tracking head movements
25 (Kim et al. 2008; Murphy-Chutorian and Trivedi 2009), identifying smiling
26 and other emotive expressions (Shan 2012; Dhall et al. 2014), and tracking
27 hands and categorizing hand gestures (Suarez and Murphy 2012; Rautaray
28 and Agrawal 2015), which may identify axes along which meaningful
29 variation may occur.
30 Since such models can generate reliable annotations of real-world
31 phenomena, there is a sense in which they are consonant with the aims of
32 qualitative approaches; they can help us to holistically identify and capture
33 meaningful axes of variation. Annotations derived from computational tools,
34 moreover, have a number of advantages. They are replicable, allowing
35 scholars to repeat prior measures and apply them to obtain information about
36 new speakers or contexts. And they offer scale and statistical power for
37 descriptive observation: since we expect social and ideological structures to
38 show themselves in aggregate as well as in individual expression, automatic
39 tools that can operate on more data than could be analyzed by hand allow us
40 to test many hypotheses about variation across multiple contexts and speakers.
41 But beyond the possibility for statistical results, in this work we aim to
42 demonstrate that we can view computational tools as a powerful complement
43 to quantitative and qualitative analysis of smaller and more localized data sets.
44 If we accept the premises that first, every detail of an interaction is potentially
© 2016 John Wiley & Sons Ltd
4 VOIGT ET AL.

1 recruitable for meaning-making and second, that variation may reflect large-
2 scale social and ideological structures, then a broad view of the possibilities of
3 computational methodologies is inevitably a step forward. Any interactional
4 feature that can be recorded and defined cleanly is potentially available for
5 computational modeling, and such modeling allows us to put such features
6 under the microscope and uncover something about how these features
7 combine to produce social meaning.
8 We demonstrate the possibilities of such an analysis in this paper by
9 analyzing one such interactional variable – head cant (colloquially, side-to-side
10 tilt of the head) – in a multimodal dataset of 65 different speakers. We use
11 computational tools to examine how visual, textual, and acoustic properties
12 combine in interaction, and how these interactions correlate with social and
13 interactional factors. Of course, a statistical association does not directly reveal
14 social meaning, but indicates that meaning may be at work at the local level.
15 Thus, we allow our statistical analysis to guide us in our choice of specific
16 examples for qualitative analysis. Our analysis confirms head cant’s role as an
17 interactional variable, its robust connection to prosodic variation, and its
18 participation in communicative and social meanings having to do with floor
19 management and with a frame of shared understanding between the speaker
20 and interlocutor.
21 In section 2, we explain our methodology in detail, as well as the dataset
22 to which we apply it, which includes 65 speakers across two distinct
23 interactional contexts: YouTube video blog (henceforth ‘vlog’) monologues;
24 and experimentally-collected laboratory dialogues. We take advantage of a
25 computer vision algorithm to calculate head cant annotations automatically
26 and use these annotations to both generate statistical results and guide a
27 qualitative analysis, exploring the interactional functions of head cant in
28 three stages. In section 3, we consider the simple question of the
29 distribution of head canting: is cant more prevalent when an interlocutor
30 is physically present? Is head cant a listening gesture? In section 4, we
31 explore high-level statistical connections between head canting and prosodic
32 features indicative of conversational engagement. Then, in section 5, we
33 draw upon those connections to engage in a quantitatively-guided
34 qualitative analysis of head cant. This involves identifying particular
35 functions of head cant, discussing them in context, and providing
36 statistical support for these where possible.
37
38 1.1 Head movement and posture as interactional variables
39
Language researchers have long known that movements of the head can
40
participate in a diverse field of meanings. McClave (2000) provides a
41
comprehensive review, cataloguing an extensive list of functions of head
42
movement: as signals for turn-taking; as semantic and syntactic boundary
43
markers; to locate discourse referents; or to communicate meanings like
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 5

1 inclusivity, intensification, and uncertainty. Kendon (2002) looks at the


2 functions of head shakes in particular, suggesting they participate in implied
3 negation and superlative expressions, and noting in particular their common
4 usage in ‘multimodal constructions’ in which head shaking co-occurs with
5 particular linguistic features to jointly build an ‘expressive unit.’ Cvejic, Kim
6 and Davis (2010) use optical markers on the heads of participants to show that
7 head movement tends to co-occur with prosodically focused words.
8 In this study, we focus on head cant as a resource for the production of
9 meaning. By head cant, we mean left-right lateral displacement of the head.
10 Head cant is distinct from head tilt, the term in the literature for up-down
11 (raised vs. lowered) displacement, which has been shown to be associated with
12 perceived dominance (Mignault and Chaudhuri 2003; Bee, Franke and Andre
13 2009).
14 Head cant as an interactional posture is not as richly studied as head
15 movements more broadly. However, researchers such as Goffman (1979) have
16 identified gendered ideological associations of head cant in depictions of
17 women in advertising, noting that it ‘can be read as an acceptance of
18 subordination, an expression of ingratiation, submissiveness, and
19 appeasement’ (1979: 46) This association is likely not new, as evidenced by
20 art historical work from Costa, Menzani and Ricci Bitti (2001) finding that
21 women were depicted in postures with head cant more often than men in a
22 large-scale historical analysis of paintings. Moreover, these patterns are
23 consistent: a systematic analysis by Kang (1997) found few differences
24 between advertisements in 1979 and 1991, and even today gendered
25 associations of the type identified by Goffman can be easily found in
26 advertising from around the world.
27 Folk tellings, as well, tend to draw explicit links between head cant and both
28 gender and sexuality, as in this excerpt from Body Language for Dummies:
29
30 Although men tilt their heads in an upward movement, mostly as a sign of
31 recognition, women tilt their heads to the side in appeasement and as a playful or
flirtatious gesture. When a woman tilts her head she exposes her neck, making
32
herself look more vulnerable and less threatening. (Kuhnke 2012)
33
34 In fact, head cant has been shown to be gendered in its distribution in multiple
35 experimental settings. Mills (1984) found women used more head cant in self-
36 posed photographs than men, in conjunction with increased smiling and
37 postures oriented away from the camera. Grammer (1990) manually coded
38 head cant among a number of other posture and movement variables, and
39 found that when used by women – but not men – it functioned in part as an
40 indicator of romantic interest between different-sex strangers.
41 Since the gender binary abstracts over a wide range of local practices, a
42 binary gender finding is never the end of the story, but indicates that some kind
43 of meaning associated with gender is at work at the local interactional level.
44
© 2016 John Wiley & Sons Ltd
6 VOIGT ET AL.

1 Thus, head cant’s meaning-making potentials are not by any means limited to
2 associations with gender and sexuality.
3 In this work we propose that many of the gendered associations of head cant
4 may stem from a deeper relationship between head cant and what Tannen and
5 Wallat (1987), building on the work of Goffman (1974), call the ‘interactive
6 frame’ – or the definition of what is taking place at a given interactional
7 moment – as well as the entailed alignment or orientation to one’s
8 interlocutor, or what Goffman (1981) calls ‘footing.’ In particular, head cant
9 appears to participate in communicating orientation towards the interlocutor
10 and a sense of shared understanding, in some cases even serving a relatively
11 explicit ‘bracketing’ function which speakers use to create parentheticals,
12 asides, and confessions.
13
14
2. CASE STUDY METHODOLOGY
15
16 In this study we investigate head cant as an interactional feature and a
17 semiotic resource. In this section we describe the selection of data,
18 preprocessing to prepare the data for analysis, and our computational
19 methodology for extracting head cant measurements.
20
21 2.1 Data
22
We compare two interactional contexts: two-person dialogues between friends
23
recorded in a laboratory setting; and video blog monologues on YouTube with
24
no apparent physically present interlocutor. We refer to these settings
25
throughout the paper as ‘Lab’ and ‘Vlog,’ respectively. The two settings
26
allow us to compare speakers who are anticipating and getting immediate
27
feedback from an interlocutor with those who are not. Our dataset in total from
28
these sources includes more than 18 hours of speech from 65 speakers.
29
30
Laboratory dialogues. The first interactional context is dyadic interactions
31
between familiars recorded in the Interactional Sociophonetics Laboratory at
32
Stanford University in California. The lab has the acoustical specifications of a
33
sound-proof recording booth to ensure high quality audio recordings, but is
34
staged as a living room to facilitate less self-conscious interactions. In addition
35
to being audio recorded via lavalier wireless microphones, interactants were
36
videorecorded by concealed video cameras (though their presence was known
37
to all participants) positioned to capture head-on images. As many computer
38
vision algorithms have been developed for video blog data, it was imperative
39
that speakers not be positioned at a significant angle to the camera lens.
40
Participants engaged in two conversational tasks. First, speakers discussed
41
their answers to a variety of ‘would you rather . . .’ questions, such as ‘Would
42
you rather always be overdressed, or always be underdressed?’ This task,
43
which lasted approximately five minutes, gave participants an opportunity to
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 7

1 relax into the recording environment and enabled the researcher to adjust
2 audio recording levels as needed. For the remainder of the approximately
3 30-minute recording session, speakers asked each other a variety of questions
4 presented on a large rolodex on a coffee table positioned between the
5 interactants. Questions, like ‘How has the way you dress changed since high
6 school?’, were chosen to encourage speakers to reflect on identity construction
7 without asking them about it explicitly. Participants were informed that they
8 could use questions as prompts as desired, but that their conversation did not
9 need to stick to the prompts at all. Following the recording session, participants
10 filled in online surveys designed to collect demographic information as well as
11 assessments of the interaction.
12 Data for 33 speakers are considered here. Of these, 22 were women, and 11
13 men. The great majority of the dyads were between friends or close friends
14 (according to participant characterizations of the relationship), with a handful
15 between romantic partners or family members. The majority of speakers were
16 undergraduates aged 18–22; the remainder of speakers were mostly in their
17 mid to late twenties. Although the results below focus on gender, the corpus
18 was reasonably diverse with respect to several other variables. Speakers
19 represented a range of racial groups. The majority self-identified as white, a
20 sizeable minority (of nine) as multiracial, and the remainder as African
21 American, Asian American, or Latinx. The majority of speakers were from the
22 West Coast of the U.S.A., though a significant group (of eight) were from the
23 South; the remainder were from the Northeast and Midwest.
24 Data were recorded directly onto a Mac Pro located in a room outside the
25 living room space. Each speaker was recorded onto separate audio and video
26 tracks. Each audio track was orthographically transcribed in Elan (Lausberg
27 and Sloetjes 2009) and force-aligned using FAVE to automatically determine
28 the timing for each word in the transcript based on its alignment with the
29 audio file (Rosenfelder et al. 2011).
30
31 Video blog monologues. Video blogs (‘vlogs’) are a form of computer-mediated
32 communication in which people record videos of themselves discussing their
33 lives or other topics of interest, to be shared with close friends or the public at
34 large. For this study, we manually collected a dataset of 32 vlogs from different
35 speakers. Since vlogs can be about a wide variety of topics, for the greatest
36 comparability with our laboratory data we focused on vlogs about three
37 emotive topics tied up in identity: high school students discussing their first day
38 of school; students discussing their experiences studying for and taking the
39 MCATs; and pregnancy vlogs in which pregnant women discuss various stages
40 and milestones of their pregnancies. Vlogs on such topics by women are far
41 more prevalent than those by men; therefore, in this study our Vlog dataset is
42 composed entirely of women. The dataset consists of mostly white speakers
43 (with a handful of Asian American speakers and one African American
44 speaker) ranging in age from mid-teens to approximately 40 years old.
© 2016 John Wiley & Sons Ltd
8 VOIGT ET AL.

1 Web video data is in an important sense ‘naturalistic.’ YouTube has over


2 one billion users and hundreds of millions of hours of video watched per day,2
3 and individual vloggers may post often and over a long period of time. This
4 makes vlogs an everyday speech event, part of vloggers’ regular repertoire. So,
5 while the language may be highly performative, a YouTube performance is
6 naturally and regularly occurring in the world rather than elicited by a
7 researcher.
8 Digital communications researchers and anthropologists have theorized
9 about social phenomena like the construction of identity in such public online
10 spaces, including vlogs in particular (Kollock and Smith 2002; Griffith and
11 Papacharissi 2009; Biel and Gatica-Perez 2013; Burgess and Green 2013).
12 Linguistic phenomena in such data, however, are generally underexplored. On
13 the computational side, researchers have investigated tasks such as
14 multimodal sentiment prediction (W€ ollmer et al. 2013; Poria et al. 2016),
15 but these tend to focus on predictive tasks and binary judgements like positive
16 versus negative.
17 On the linguistic side, Androutsopoulos (2010) discussed some parameters
18 of the unique ‘sociolinguistic ecology’ and multimodal ‘spectacle’ of Web 2.0
19 environments, emphasizing the importance of interaction. Frobenius (2014)
20 considered audience design in vlogs, showing that the unique circumstance of
21 a monologue with no feedback from an audience leads to interesting
22 interactional phenomena; for example, the observation that prosodic shifts
23 such as differences in volume can distinguish different intended audiences for
24 particular utterances. The interactional component of vlogging is dynamic,
25 sometimes going so far as to produce an actual asynchronous conversational
26 context in a back-and-forth of videos (Harley and Fitzpatrick 2009).
27 Indeed, the vlogging world is permeated with the idea of interaction.
28 Comments, shares, and ‘likes’ are often explicitly mentioned by vloggers as a
29 crucial means of building a dialogue between the vlogger and the audience.
30 Duman and Locher (2008) explored this ‘video exchange is conversation’
31 metaphor in detail in Hillary Clinton and Barack Obama’s 2008 campaign
32 clips on YouTube.
33
34 2.2 Preprocessing
35
The Vlog data sometimes presents potential problems for the computer vision
36
algorithm to be used, due to sections of excessive cuts or additional visual
37
effects such as introductory splash screens. Therefore, we manually determined
38
appropriate start and end times for each video, clipping them to extract the
39
largest possible contiguous sections without such effects.
40
To perform our computational analyses we first needed to define our units of
41
analysis, and in this study we used pause-bounded units, henceforth referred to
42
as simply ‘phrases.’ Since we have manual transcripts for data from the Lab
43
setting, we performed forced alignment (Rosenfelder et al. 2011) to obtain
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 9

1 boundaries for each spoken word from each speaker. We then used a
2 transcript-based method to extract phrases, defining a phrase as any
3 continuous set of words such that no word is more than 100 milliseconds
4 apart from the words surrounding it.
5 We did not, however, need manual transcripts to carry out many of the
6 analyses we were interested in. Since we did not have manual transcripts for
7 the Vlog data, we used an automatic heuristic based on the silence detection
8 function in Praat (Boersma and Weenink 2015) to extract phrases. We
9 generated phrases by running silence detection on the audio channel of each
10 video, defining sounding portions as phrases. The more accurate phrases in our
11 Lab data, extracted by the forced alignment method above, had an average
12 length of 1.50 seconds. We approximated this in the Vlog data by setting the
13 same 100 millisecond minimum boundary between sounding portions used
14 above and starting with a silence threshold of -25dB. We iteratively ran silence
15 detection, increasing or decreasing the silence threshold by 1dB and
16 re-running, until the average phrase length was as close as possible to 1.50
17 seconds.
18 While this procedure may have smoothed over some individual variation in
19 phrasal pacing, our primary need was for consistent units of analysis, which
20 we defined using phonetic rather than intonational, syntactic, or discursive
21 criteria for delineating phrase boundaries. In the analyses to follow we used the
22 transcript-based phrases for the Lab data and the silence-detection-based
23 phrases for the Vlog data; however, the results presented in the following
24 sections held even if we also used silence-detection-based units for the Lab
25 data, further suggesting that these units of analysis are roughly equivalent.
26
27 2.3 Head cant feature extraction
28
We calculated head tilt by adapting the shape-fitting algorithm of Kazemi and
29
Sullivan (2014), as implemented in the open-source machine learning library
30
dlib (King 2009). This algorithm is relatively computationally efficient and
31
robust to differences in video quality, lighting, and occlusion, which made it
32
feasible for the contextual diversity of our data (Figure 1).
33
For each frame of video in the dataset, we first used the standard face-
34
detection implementation in dlib to find the speaker’s face. We then used the
35
aforementioned shape-fitting algorithm on the detected face, with a model pre-
36
trained on the facial landmark data from the 300 Videos in the Wild dataset
37
(Shen et al. 2015) which outputs locations of 68 facial landmark points per
38
frame.
39
We could then calculate head cant using the points for the far corner of the
40
left and right eyes by triangulation (Figure 1). Assuming (as we do in this
41
dataset) a speaker roughly facing the camera, the cant angle is the arctangent
42
of the vertical displacement of these eye corner points over their horizontal
43
distance. We took the absolute value of these measurements as in this work we
44
© 2016 John Wiley & Sons Ltd
10 VOIGT ET AL.

1
Colour online, B&W in print

2
3
4
5
6
7
8
9
10
11
12
13
14 Figure 1: Left, shape-fitting output from Kazemi and Sullivan (2014), showing
robustness to occlusion. Right, visualization of our adaptation for the calculation of
15
head cant angle on a vlog from our dataset, calculated by first fitting a shape model
16 of the face to find landmark points as on the left, and then triangulating cant angle
17 from the corners of the eyes
18
19
20 were interested in head cant primarily as displacement from an upright
21 posture.
22 This method allowed us to generate a continuous estimate of head cant
23 throughout all the videos in our dataset, analogous to measures of acoustic
24 prosody like pitch and loudness, albeit at a more coarse resolution of once per
25 frame (30Hz for a video at 30 frames per second). This method inevitably
26 suffers from some limitations, since by the nature of large-scale automatic
27 modeling we expect the model to introduce noise. At moments of severe
28 occlusion – such as if a speaker turns fully away from the camera – or due to
29 peculiarities of the algorithm’s classification process, we may have failed to
30 detect a face in a given frame or failed to accurately fit the shape model. We
31 handled this by simply keeping track of these failures, and found that they
32 occurred in approximately six percent of frames in the dataset. In the statistical
33 analyses correlating head cant in prosody in section 4, we removed phrases
34 from the analysis where more than half of the video frames that occurred
35 during the phrase constitute classification failures of this type and as such have
36 no accurate measurement.
37 A related limitation lies in the fact that head cant is naturally implicated in
38 other bodily movements and postures, and our measurements may have been
39 affected by this. Body cant, in particular, where the speaker’s entire body is
40 tilted and thus necessarily the head as well, presents an interesting difficulty in
41 this regard. Nevertheless, in our qualitative analyses of the data we found this
42 phenomenon to be relatively rare, and indeed this challenge is perhaps
43 inherent to the study of embodiment. Even if we were hand-labeling the entire
44 dataset, it is unclear whether a body cant of 20 degrees with a relative head
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 11

1 cant of 0 degrees should be labeled as a head cant of 0 or 20, since the head is
2 straight relative to the body but at a 20 degree angle relative to the floor.
3 This difficulty becomes even more stark when we consider the potential for
4 perceptual entanglements. If one speaker’s head is canted their spatial
5 coordinates are necessarily rotated, so should their interlocutor’s head cant
6 best be conceived of relative to that rotated perception, or relative to some
7 ‘objective’ standard like the floor or other contextual grounding? All of the above
8 likely constitutes a direction for future research in its own right, so in this work
9 we sidestepped the issue by taking our computational method at face value.
10
11
3. HEAD CANT IN AND OUT OF INTERACTION
12
13 In framing the importance of head cant as an object of study in section 1, we
14 postulated it to be an ‘interactional variable,’ playing a role in functions such
15 as turn management between interlocutors. For example, head cant could
16 function as a listening posture, signaling the listener role, or it could also signal
17 interest in what the interlocutor is saying. We expected that neither of these
18 functions would be present in the Vlogs, which have no explicit interlocutor,
19 but that either or both could be present in the Lab data.
20 To explore this potential difference between datasets, we randomly sampled
21 5,000 individual frames of video from each speaker in the dataset, and
22 determined whether the head cant measured in that frame occurred during a
23 spoken phrase or not. As shown in Figure 2, speakers in the laboratory setting
24 used more head cant overall than those in vlogs, with a mean cant of 6.4
25 degrees as compared to vloggers’ mean cant of 4.5 degrees (two-sided t-test,
26 t = -105.4, df = 323,430, p < 0.001).
27 We observed no statistical difference between speech and non-speech
28 segments in the vlogs, while the Lab participants used more head cant while
29 not speaking than while speaking (two-sided t-test, t = -21.425, df = 135,860,
30 p < 0.001). Moreover, we saw gender effects within the laboratory data. While
31 men and women appeared to use nearly the same mean head cant of around
32 six degrees during speech segments, an ANOVA analysis revealed a significant
33 interaction effect with gender: men in our dataset used more head cant while
34 not speaking than did women (F = 192.5, p < 0.001).
35 The relative low amount of cant in the vlogs suggests that the movements
36 that people make while speaking and listening in the lab dialogues have an
37 interactive signaling effect. It supports an association between listening and
38 head cant, and it may suggest that cant is playing a role in floor management.
39 It could also, though, reflect the importance of an interlocutor in supporting
40 whatever other functions cant is playing.
41 Our results provide an interesting contrast with the results of Hadar et al.
42 (1983), who used a polarized-light goniometer technique to measure head
43 movements during conversation, finding evidence for constant movement
44 during talk, while listening was marked by the absence of head movement.
© 2016 John Wiley & Sons Ltd
12 VOIGT ET AL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 Figure 2: Distribution of head cant by gender and interactional context,
27 distinguishing between speech context, that is, whether the speaker is currently
28 speaking or not. Error bars represent 95 percent confidence intervals; these intervals
29 are small since the number of observations is very high
30
31 Together these results suggest that listening may be marked more by static but
32 perhaps meaningful postures (such as head cant) while speaking may be
33 marked by dynamic movements.
34 These findings also begin to challenge the gendered associations of head cant
35 mentioned previously. In our data men use more head cant overall, an effect
36 driven by their use during non-speech portions of the interaction. Given this,
37 we raise the question of whether women and men are doing more or less of the
38 same thing, or whether they are actually using cant differently. We will
39 explore these questions in the following sections.
40
41
4. HEAD CANT AND PROSODY
42
43 In the previous section we established a relationship between head cant and
44 the simple fact of speaking, showing that this relationship is affected by
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 13

1 in-person interaction. In this section, we delve further by exploring not just


2 incidence of speech but also some aspects of the nature of the speech produced.
3 Beyond the straightforward distribution of cant, we can join our
4 computational methodology in the visual modality for measuring head cant
5 with existing common computational methodology in the acoustic modality:
6 the automatic extraction of measurements of prosodic features of speech like
7 pitch and loudness.
8 Prosodic variation on its own, of course, can communicate a number of
9 social meanings, such as emotion (Scherer 2003) and attitude (Krahmer and
10 Swerts 2005). Most relevant for our work, a number of previous studies show
11 strong correlations between increased values of various prosodic variables and
12 a speaker’s general level of interest (Jeon, Xia and Liu 2010; Schuller et al.
13 2010; Wang and Hirschberg 2011), or engagement and excitement; for
14 instance, Trouvain and Barry (2000) show that horse race commentators
15 speak with higher pitch and pitch range, greater loudness, and increased
16 speech rate as the races reach their peak of excitement in the finale.
17 Like head cant, pitch and loudness are continuously varying signals that are
18 inevitably implicated in speech: by comparing these signals on a large scale we
19 can aim to uncover cross-modal synchrony in the sense of McNeill’s view of
20 gestures and speech as fundamentally part of the same system. Existing studies
21 have detailed particular elements of the strong relationship between the two
22 modalities:
23
24 • Mendoza-Denton and Jannedy (2011) show evidence for the co-
occurrence of pitch accents and gestural apices;
25
26 • Loehr (2012) similarly shows that gestural phrases align with
intermediate phrases;
27
28 • Voigt, Podesva and Jurafsky (2014) show that increased overall body
movement in a phrase predicts greater pitch and loudness mean and
29
variability.
30
31 Our automatic annotations allow us to compare these signals – head cant and
32 acoustic prosody – statistically on a large scale, which will allow us to understand
33 both their overall relationship and how that relationship may differ or interact
34 with contextual variables like the gender of the speaker or the context.
35
36 4.1 Methodological setup
37
In order to understand the joint influence of prosodic features and contextual
38
factors, we built statistical models in which these features act as independent
39
variables predicting head cant. We used Praat (Boersma and Weenink 2015) to
40
extract F0 (hereafter ‘pitch’) and intensity (hereafter ‘loudness’) measurements
41
throughout the audio track of each video. We then z-scored (subtracted the
42
speaker’s mean and divided by their standard deviation) all pitch, loudness,
43
and head cant measurements, to convert them into an equivalent scale of units
44
© 2016 John Wiley & Sons Ltd
14 VOIGT ET AL.

1 of standard deviations across speakers. We did this so that genders and


2 speakers are comparable: variation is only relevant with reference to some
3 perceived baseline, which in this case we modelled as being speaker-specific.
4 We then calculated the mean pitch and loudness, z-scored by speaker, for every
5 phrase in the dataset (automatically identified as described in section 2.2). Our
6 preprocessing resulted in a total of 17,533 usable phrases, representing more
7 than 7.5 hours of continuous speech from our 65 speakers.
8 We modelled the interaction between these variables with a linear mixed-
9 effects regression as implemented in the lme4 package in R (Bates et al. 2015);
10 reported p-values were calculated with Satterthwate’s approximations using
11 the lmerTest package (Kuznetsova et al. 2013). Our regression used phrases as
12 observations, including speakers as random effects. We modelled mean
13 z-scored head cant in the phrase as the dependent variable, with
14 independent variables of the z-scored pitch and loudness mean in the phrase,
15 as well as the gender of the speaker, source of the video (Vlog or Lab), and the
16 log duration of the phrase. We also included interaction effects between the
17 prosodic (pitch, loudness) and contextual (gender, source, phrase duration)
18 features in the model. Visual inspection of diagnostic plots confirmed that our
19 fitted model met the model assumptions. We discuss particular results inline; a
20 full regression table is given in an appendix.
21
22 4.2 Results
23
We found associations between head cant and both pitch and loudness in a
24
phrase, again differing by both gender and interactional context, visualized in
25
Figure 3.
26
Overall, we found that the higher the pitch in a phrase, the higher the degree
27
of head cant (estimate 0.058, p < 0.001). This effect was modulated by gender
28
with a significant interaction effect (estimate -0.042, p = 0.002), such that the
29
trend held for women but substantially less for men in our dataset. For women
30
an increase of one standard deviation in mean pitch during a phrase predicted
31
an increase in head cant of nearly a degree. Pitch is quite regularly invoked as
32
a fundamental gender difference, and one might be tempted to connect pitch
33
and cant as jointly expressing something like femininity. If pitch in this case
34
were directly associated with gender, though, one would expect the men’s
35
pitch to decrease rather than to simply increase less than the women’s; this
36
recalls the body of research suggesting that linguistic features do not map
37
directly onto aspects of identity but rather that the relationship is complex and
38
indirect (for instance, Ochs 1992).
39
While women and men differed only in significance in the pitch pattern, they
40
showed opposite effects in the relation between cant and loudness. We found
41
no overall fixed effect for loudness (estimate -0.088, p = 0.58), but there were
42
strong interactions between loudness and both gender (estimate 0.083,
43
p = 0.003) and context (estimate -0.109, p < 0.001). To some degree, these
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 15

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 Figure 3: Marginal effects of pitch and loudness on head cant across genders and
34 interactional contexts holding other factors constant. All variables are z-scored by
35 speaker, and observations in the model are silence-bounded phrases. Ribbons
36 represent estimated 95 percent confidence intervals around the trend line
37 findings are in accord with prior work comparing the degree to which prosodic
38 variables like pitch and loudness are ‘socially loaded’ differentially by gender:
39 for example, in a study of speed dates McFarland, Jurafsky and Rawlings
40 (2013) found that perceived friendliness is marked by pitch maximum and
41 variability in women as compared to loudness variability in men. Women in
42 Vlogs displayed a strong negative relationship between loudness and head
43 cant: for speakers in this category a decrease in loudness of one standard
44
© 2016 John Wiley & Sons Ltd
16 VOIGT ET AL.

1 deviation during a phrase predicted an increase in head cant of more than a


2 degree. Recall that these speakers are also using significantly less cant overall
3 than those in the Lab data. One interpretation of this result might be that,
4 while cant as a socially meaningful variable for these speakers is, in general,
5 rarer, it may have a more marked effect in the times it is used.
6 Women in the Lab interactional context showed a slight negative
7 relationship between loudness and canting, but one far less strong than for
8 Vlog women. The men in the Lab context, on the other hand, showed a strong
9 positive relationship between loudness and head canting.
10 In summary, for both men and women, and for both Vlog and lab settings,
11 head cant was associated with higher pitch, although the effect was weaker for
12 men. Head cant was generally associated with quieter speech in women, but
13 louder speech in men.3 These results suggest that women and men were doing
14 quite different things with cant, since the gender difference is not a matter of
15 degree but a reversal of effect.
16 While these statistical correlations cannot tell the whole story, they point
17 out the tight connection across these modalities. To truly understand the
18 meanings we have to delve deeper.
19
20
5. QUANTITATIVELY-GUIDED EXPLORATION
21
22 In the previous section, we identified several high-level relationships between
23 head cant and prosodic features, demonstrating how these differed by gender
24 and interactional context. To explore these in greater detail, in this section we
25 use larger-scale ‘distant reading’ as a guide to facts on the ground. With this
26 method we uncover a relationship between head cant and meanings having to
27 do with shared understanding, and further show its gendered distribution with
28 particular discourse particles for women (you know and I mean) as compared to
29 conversational acknowledgements for men (mmhmm and yeah).
30 We postulate that the statistical relationships between prosody and cant
31 discovered in the previous section are in part an indication that these features
32 are participating in ‘multimodal Gestalts’ (Mondada 2014). That is, it is
33 because these features are independently important as resources for the
34 generation of social meaning that they tend to cluster in meaningful and
35 consistent ways across speakers. This clustering may reflect sites of particularly
36 rich generation of social meanings. But what is happening at these moments?
37 To answer this question, in this section we use cross-modal quantitative
38 measurements to directly observe these moments of meaningful co-variation.
39 In what we call a quantitatively-guided exploratory analysis, we developed a
40 qualitative analysis by observing selections of randomly sampled phrases in
41 context from the dataset that met some conditions observed in our broader
42 statistical trends. This is an analytic move analogous to Labov’s (2001) search
43 for the social forces in sound change by using the data on variation to identify
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 17

1 the leaders in change, and looking close up at those leaders for commonalities
2 in their social characteristics.
3 Since we found women to combine greater head cant with high pitch and
4 low intensity, and men to combine head cant with high pitch and high
5 intensity, we extracted phrases high in head cant co-occurring with high pitch
6 and low intensity on the one hand, and with high pitch and high intensity on
7 the other. We defined ‘high’ and ‘low’ as the top and bottom 30 percent,
8 respectively, and considered five categories: high pitch alone; high intensity
9 alone; low intensity alone; high pitch with high intensity; and high pitch with
10 low intensity. For each category we randomly sampled and qualitatively
11 examined at least 100 of these central exemplars.
12 One of the strongest trends we observed in this examination was that head
13 cant is implicated both in floor management and in processes of signaling
14 shared understanding – and that the two cannot be easily separated. Head cant
15 appears to frequently be called on to establish that the speaker and the
16 interlocutor share (or ought to share) some pre-existing knowledge about the
17 discourse at hand. In this way we can view head cant as participating in shifts
18 in footing, in the sense of Goffman (1981); that is, head cant may subtly
19 modify the alignment or ‘interactional frame’ (Tannen and Wallat 1987) taken
20 up by a speaker in a given utterance.
21
22 5.1 The framing of shared understanding
23
The interactional frame of ‘shared understanding’ in which head cant
24
participates can take many forms. It appears to carry overtones of
25
friendship, a sense of obviousness, or a taking into confidence, and can
26
appear in the context of repetition or restating. In turn it may be used for many
27
purposes: to induce the interlocutor to interpret a claim as properly belonging
28
to a shared understanding; to propose a presupposition of such understanding
29
that softens an utterance for stylistic purposes; or to indicate dismissiveness of
30
the obvious thing.
31
For example, in the Vlog context, consider a YouTuber named Nat in a video
32
entitled ‘5 Weeks Pregnancy Vlog.’ Nat’s vlog records the journey of her
33
pregnancy, discussing physical and emotional changes throughout and
34
chronicling milestones along the way. Considering the audience design
35
which might influence her linguistic choices, it’s worth noting that Nat’s
36
channel is surprisingly popular, with over 40,000 subscribers, and as of this
37
writing the video in question has had more than 28,000 views; however, the
38
video in question is only the third published by her channel, so perhaps it had
39
far less viewership when it was made.
40
In Example 1, below, Nat is ten minutes into the video, and is talking about
41
telling her two best friends about her new pregnancy, both of whom have
42
children of their own, as well as her husband Weston telling his friends. This
43
moment follows a long and detailed account of telling her parents, and their
44
© 2016 John Wiley & Sons Ltd
18 VOIGT ET AL.

1 excited reactions. In contrast, she gives the story of telling those friends in a
2 few brief sentences, ending with:
3
Example 1
4
5 1 Nat: and they’re of course very excited
6 2 and very supportive and Weston told his two best friends
7
8 During this segment, Nat uses head cant in alternating directions with
9 reduced loudness and variable pitch (Figure 4). The overall effect is to create a
10 sense of obviousness but gratefulness in describing the reactions of her friends
11 to hearing of her new pregnancy, which is strengthened by co-occurrence with
12 the explicit ‘of course.’ Given the excited reactions of her parents she just
13 described in detail, and the knowledge Nat expects to share with her imagined
14 interlocutor that friends are generally excited about pregnancies, these head
15 cants contribute to framing the content of her utterance as almost going
16 without saying. We note that cant here is combined with semi-closed eyes and
17 a smile. While we cannot comment authoritatively on eye and mouth features
18 since we do not have equivalent data on them, it may be these features that
19 contribute intimacy and positive affect. We note that these features can also be
20 measured automatically, and ultimately an understanding of body movement
21 is going to require careful analysis of multiple and co-occurring gestures, or
22 constructions.
23 Moments later, Nat uses head cant again as she reiterates a point made
24 earlier in the discourse: the pregnancy is still meant to be kept a secret to
25 everyone but the couple’s parents and very best friends. Earlier Nat has
26 mentioned this fact several times, but tags it on with a clearly conspiratorial
27 stylistic move generated by not only her words but the near-whispered tone,
28 head cant combined with forward tilt, sly smile, wide open eyes, and a finger to
29 the lips (Figure 5). We note that unlike the earlier uses of cant, here it
30
31
Colour online, B&W in print

32
33
34
35
36
37
38
39
40
41 'and very supportive " 'told his two best friends'
42
43 Figure 4: Nat’s two head cants from 10:43–10:50 in Example 1 (https://youtu.be/
44 fLS8RFnCcII?t=10m43s)

© 2016 John Wiley & Sons Ltd


THEME SERIES: INTERACTION 19

1
Colour online, B&W in print

2
3
4
5
6
7
8
9 'knowing they’re supposed to be quiet'
10
11 Figure 5: Nat’s cant from 11:01–11:05 (https://youtu.be/fLS8RFnCcII?t=11m1s)
12
13 participates in a combination of gestures constituting a highly enregistered or
14 conventionally ‘iconic’ sign. One question we might ask is whether complex
15 enregistered signs like this one occur more frequently in monologues than in
16 face-to-face interaction, which would support the hypothesis that the lack of
17 an interlocutor calls for less subtle gestures.
18 Nat is invoking a set of shared beliefs about the social bonds associated with
19 pregnancy. Example 2 from the Lab setting is a little more risky, as two
20 interlocutors jointly confirm shared knowledge that might be face-threatening.
21 Two friends, a White female (speaker A) and a Hispanic male (speaker B), are
22 discussing the question of whether they have ever been mistaken for a person
23 of another race. The conversation has turned to talking about racial diversity
24 in AP (‘advanced placement’ or college-level) classes, as the Hispanic male
25 describes being mistaken for Asian by virtue of being in those classes, and in
26 other circumstances being mistaken for ‘every race except White.’ After a brief
27 joking digression about how the White female speaker could never be mistaken
28 for Black, she responds to the issue by bringing up the case at her high school:
29
30 Example 2
31
1 A: it was really weird at our school, cause, like
32
2 my school was like,
33 3 B: mostly. . .
34 4 A: a hundred percent White pretty much
35 5 B: White. . . yeah.
36
37 Immediately after the first ‘like’ in line 1, above, the speaker makes a shift to
38 a head-canted posture, and simultaneously her loudness decreases, her speech
39 rate increases, and her voice gets very creaky. These conditions hold through
40 the end of line 4, and her head cant holds in the canted posture as well. Her
41 cant marks a particular type of almost conspiratorial side comment, as if she is
42 making an overly obvious confession, the content of which her interlocutor
43 already knows (indeed, he produces simultaneous speech conveying the same
44 proposition), intensified by the exaggerated ‘a hundred percent’ (Figure 6).
© 2016 John Wiley & Sons Ltd
20 VOIGT ET AL.

1
2
3
4
5
6
7
8
9
10
11
12
13 ‘our school, cause, like’ ‘a hundred percent white’
14
15
16
17
18
19
20
21
22
23
24
25
26 ‘mostly...’ ‘White… yeah.’
27
28 Figure 6: A head cant and its overlapping, softer-spoken response as both speakers
29 discuss the ethnic diversity of their high schools in Example 2.
30
31
32 At the same time, the information in the canted clause is highly relevant to
33 the following discourse and is by no means obvious. After this comment, the
34 speaker goes on to discuss more detailed specifics about the diversity of her
35 high school and childhood community overall before returning to the topic of
36 AP classes, suggesting this is information of which her interlocutor was not
37 previously aware even though they are friends.
38 This example illustrates how head cant is used to establish footing for an
39 interactional frame of shared understanding, as opposed to an indication of
40 actual common ground in the sense of Clark and Brennan (1991). The speaker
41 is drawing upon head cant as an interactional resource to frame the revelation
42 of the lack of diversity at her high school in a particular way: as obvious,
43 expected, and perhaps even somewhat embarrassing.
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 21

1 This is further evidenced by her interlocutor’s reaction during the phrase,


2 wherein he bobs his head in a minor mirroring head cant immediately
3 following the speaker’s initial cant, and speaks overlapping with her during the
4 phrase in a low and creaky voice: in line 3 canting on ‘mostly. . .’, and in line 5
5 saying ‘White’ aloud almost exactly in time with speaker A, smiling gently at
6 the end (Figure 6). Her head cant marks an invitation to the frame, and he
7 participates; he in fact already knows the point at hand and doesn’t have to
8 wait for her to say ‘White’ but instead speaks it in time with her.
9 While head cant may enable the speaker and her interlocutor to orient to,
10 more specifically distance herself from, the lack of diversity at her school,
11 growing evidence suggests that creaky voice may function similarly. Lee
12 (2015) observes that creak is often used to produce parenthetical speech, and
13 that in much the same way that creak distances parenthetical speech from the
14 primary thread of discourse, so too can speakers use it to distance themselves
15 from their interlocutors or the topics about which they are talking. Similar
16 claims have been made by D’Onofrio, Hilton and Pratt (2013), who show that
17 two adolescent girls used creak regularly to distance themselves from
18 statements that made them potentially vulnerable, and Zimman (2015),
19 who showed that a transmasculine speaker telling a narrative about visiting
20 family used more creak when referencing familial tensions. Thus, our
21 argument about the interactional function of head cant is independently
22 supported by the use of creaky voice in a completely different modality
23 (speech).
24
25 5.2 Discourse particles
26
The hypothesis we’ve been exploring is that head cant functions in part to
27
establish or index an interactional frame of shared understanding. We might,
28
therefore, expect that cant would co-occur with discourse particles serving the
29
same function. Since we have time-aligned transcripts for the Lab data, we can
30
expand our quantitatively-guided investigation beyond cant and prosody to
31
include the words used in the interactions. Schiffrin (1987) describes an
32
interactional frame of shared understanding in the discourse particles you know
33
and I mean. She suggests that you know directly acts as a mechanism for
34
reaching a state of shared understanding where the speaker knows that the
35
interlocutor has knowledge of the topic at hand. I mean focuses on other-
36
orientation in the adjustment of footing towards the production of talk that the
37
interlocutor will understand.
38
However, these particles can also serve as fillers, markers of upcoming delay
39
or hesitation, or floor holders analogous to particles such as like, um, and uh
40
(Clark and Fox Tree 2002; Fox Tree 2007). To approach this distinction, we
41
compared the use of high cant in conjunction with I mean and you know on the
42
one hand, and um, uh, and like on the other. We defined ‘high cant’ as phrases
43
with a mean cant in the top 30 percent of all phrases.
44
© 2016 John Wiley & Sons Ltd
22 VOIGT ET AL.

1 Table 1: Odds ratios for discourse particles appearing in the top 30 percent of
2 phrases with the highest head cant as compared to the bottom 70 percent. Values
3 higher than 1.0 indicate a positive association with canted phrases, while those
4 lower than 1.0 indicate association with less canted phrases. P-values with Fisher’s
5 exact test are given in small font in parentheses; values for women are significantly
different while for men there is no association
6
7 Discourse particle Women Men
8
9 Shared understanding: I mean, you know 1.58 (p=0.03) 1.30 (p=0.29)
10 Hesitation/floor: um, uh, like 0.83 (p=0.01) 0.99 (p=1.00)
11
12
13 Across all 9,038 phrases in the Lab data, Fisher’s exact test (Fisher 1922)
14 shows (Table 1) that women are significantly more likely to use you know and I
15 mean in phrases with high head cant, and less likely to use um, uh, and like in
16 those phrases. Like, in particular, is strongly associated with phrases with low
17 head cant. Andersen (1998) compiles an extensive review of research on like,
18 finding that overall it acts as a ‘loose talk marker’ from a relevance-theoretic
19 perspective – that is, the speaker is opting to signal a pragmatic ‘discrepancy
20 between the propositional form of the utterance and the thought it represents.’
21 To look at a particular example, the following section of speech occurs
22 during a discussion of finding one’s ‘true passions,’ where the speaker is
23 expressing her surprise at finding that a set of activities in high school she
24 originally participated in to pad her resume turned into a more sincere passion.
25 During this extended turn (Example 3), the speaker starts with a relatively low
26 cant, initiating a slight cant at the first ‘actually’; however, she moves to a
27 large head cant directly upon the phrase containing you know, spoken with a
28 somewhat heightened pitch. This phrase marks the beginning of a
29 conversational aside not constituting the content of her own story, but
30 rather more as an attempt to gain ‘meta-knowledge,’ in Schiffrin’s terms, that
31 her interlocutor is also familiar with the background against which she was
32 making her decisions.
33
Example 3
34
35 1 A: I kinda felt like I was doing it for the resume
36 2 like in high school to be honest
37 3 like, but then like I actually really liked it and then like –
38 4 like you know how like when –
39 5 when you wanna like fill your transcript up with like –
6 I mean resume up with like a bunch of like activities.
40
7 Like you get to choose what activities you want
41 8 B: mmhm
42 9 A: and all the activities I chose were
43
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 23

1 Both the speaker and her interlocutor are students at an elite undergraduate
2 institution: the speaker’s use of you know helps to signal that she has made the
3 very reasonable assumption that her interlocutor, too, knows about needing to
4 bolster one’s resume in high school. Her head cant pointedly marks the
5 sentence and a half that follow as an almost redundant aside, helping to put
6 this decision-making context into a frame of shared understanding that will
7 allow her interlocutor to empathize with the experiences to follow (Figure 7).
8 The speaker continues to cant her head back and forth lightly during those
9 phrases, and as she finishes saying ‘choose what activities you want’ (line 7)
10 her interlocutor responds to the frame by smiling, nodding her head up and
11 down, and backchanneling ‘mmhm.’ Precisely as the speaker returns to
12 talking about her own experiences (line 9) her head cant returns to a neutral
13 upright position, suggesting the bracketing function of the cant has come to an
14 end.
15
16 5.3 Conversational acknowledgements
17
In the preceding section, we found a statistical relation confirming the link
18
between other-oriented discourse markers such as you know and I mean, but
19
these results held only for women in our dataset. One crucial difference across
20
genders in our prosodic findings from section 4 was that men’s increased head
21
cant is associated with increased loudness, precisely the opposite relationship
22
from that found in women.
23
In our analysis of phrases in the data matching the statistical trend – in this
24
case, louder phrases spoken by men with high head cant – we found that this
25
may accompany men’s backchannels, acknowledgements, and affirmative
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 ‘actually really liked it’ ‘you know how like when’ ‘resume up with’
41
42 Figure 7: A speaker canting as she makes an aside about a shared experience of
43 resume-padding in Example 3
44
© 2016 John Wiley & Sons Ltd
24 VOIGT ET AL.

1 Table 2: Odds ratios for conversational acknowledgements appearing in the top 30


2 percent of phrases with the highest head cant as compared to the bottom 70
3 percent. Values higher than 1.0 indicate a positive association with canted phrases,
4 while those lower than 1.0 indicate association with less canted phrases. P-values
5 with Fisher’s exact test are given in small font in parentheses; values for men are
significantly different while for women there is no association
6
7 Conversational acknowledgements Women Men
8
9 mm, mmhm, yeah 0.88 (p= 0.37) 2.19 (p<0.01)
10
11 responses. Recall from section 3 that men in our dataset used more head cant
12 while not speaking than did women; in investigating the data qualitatively we
13 repeatedly encountered cases where men were listening for extended periods
14 while holding a head cant posture, and then responded to statements from the
15 interlocutor while maintaining that cant, often speaking more loudly. It may
16 be that the move out of head cant towards a neutral posture accompanied by a
17 (potentially louder) affirmation sets into motion a footing shift that brings an
18 interactant out of a ‘listener’ role and into the ‘speaker’ role by catching on to
19 moments of shared understanding. In this case, floor management merges with
20 shared understanding.
21 This hypothesis recalls previous findings about conversational
22 acknowledgements and backchannels such as mm, mmhm, and yeah that
23 signal conversational engagement and which can also function as floor-
24 grabbers (Jefferson 1984; Lambertz 2011). We again used Fisher’s exact test
25 on all 9,038 phrases in the Lab data to calculate the odds ratios for the
26 occurrence of these words in phrases with high head cant as opposed to those
27 without. We found that for the men in our dataset, conversational
28 acknowledgements were highly associated with phrases with high head
29 cant, while for women there was no relationship (Table 2).
30 Example 4 illustrates this phenomenon with an interaction between two
31 men in the dataset, discussing what they would do with unlimited money.
32 Both are undergraduates studying computer science, and clearly share a
33 disdain for ‘start-up accelerators’ – companies that provide small amounts of
34 early funding to start-ups in exchange for a substantial portion of equity in the
35 company. They’re discussing a particular such accelerator, anonymized here
36 as ‘Company X,’ in overtly positive words but with a tone of dripping sarcasm
37 that they know they both share.
38
39 Example 4
40 1 A: pay people to to think of ideas for start-ups for you
41 2 B: yeah that’s th th that’s –
42 3 A: yeah
43
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 25

1 4 B: that’s exactly what um the Company X does


2 5 A: no, I yeah I know {laughter}
3 6 it’s really funny.. oh man . . .
4 7 my teacher was like yeah Company X’s like supported all this stuff
5 {laughter}
6 8 B: supported {laughter}
7 9 A: supported, yeah it’s like we like your idea here’s some money
10 and now we get a . . .
8
11 like chunk of your company {laughter}
9
10 The sarcastic aside begins with speaker A’s marked head cant on ‘for you’
11 (line 1) after the preceding part of the phrase was spoken with no cant and his
12 eyes half-closed, looking downwards (Figure 8). He simultaneously raises his
13 gaze to meet eyes with his interlocutor as he cants his head, explicitly handing
14 off the turn as his head cant helps frame the statement within their shared
15 understanding as sarcastic. As they continue the sarcastic aside they trade
16 turns repeatedly starting with ‘yeah’ or repetitions of their interlocutor’s words
17 (‘supported’) and ending with brief moments of laughter, throughout canting
18 their heads to various degrees.
19 In this section we have explored several instances where head cant is
20 participating in multimodal Gestalts having to do with a frame of shared
21 understanding. This frame is broad and can contain numerous related
22 overtones such as obviousness, embarrassment, and sarcasm. Coupled with
23 quantitative evidence from the distribution of particular words in the
24 transcripts, we identified a potential gender difference in how this frame is
25 expressed, more commonly for women taking shape in interactional particles
26 involving meta-knowledge and checks on the interlocutor’s understanding,
27 while for men in conversational acknowledgements and affirmative responses.
28 Nonetheless, the big question remains for future work: why are certain
29 patterns of using cant more prevalent among men than women and vice versa?
30 For instance, in the case of conversational acknowledgements, women do also
31 show this pattern, simply less commonly, so the question is which women and
32 when? In other words, what is the social significance of this kind of
33 interactional move, and how does it enter into the construction of gender
34 and other aspects of identity?
35
36
37 6. DISCUSSION
38 In this study, we have presented evidence that head cant is a robust interactive
39 resource. It was more prevalent in our face-to-face Laboratory data than in
40 YouTube monologues, suggesting that it plays an important signaling role to
41 one’s interlocutor. This seems to be related both to floor management and to
42 footing in relation to conversational content, at times serving to bracket off
43 particular frames.4
44
© 2016 John Wiley & Sons Ltd
26 VOIGT ET AL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14 ‘to think of ideas’ ‘for you’ ‘yeah that’s’
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29 ‘yeah… no, I’ ‘supported {laughter}’ ‘supported, yeah’
30
31 Figure 8: Two speakers trade cants and conversational acknowledgements in an
32 extended moment of sarcastic aside (Example 4)
33
34
35 We also found that, in the dialogue but not the monologue context, head
36 cant was more prevalent during times when the interactant was not speaking,
37 suggesting an association with listening. This could be a simple signal that one
38 is listening, yielding the floor, or it could communicate the listener’s
39 orientation in relation to the content. We found high-level statistical
40 correlations between elements of engaged prosody and head cant. There was
41 an overall positive relationship between increased cant in a phrase and
42 increased pitch, and a complex relation between cant and loudness.
43 All of these correlations showed important gender effects. Men canted while
44 listening more than women, suggesting that the traditional gendered
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 27

1 associations that link head cant to hegemonic femininity are likely not telling
2 the whole story. Increased cant correlated robustly with higher pitch among
3 women, but appeared only as a trend among men. Finally, while men’s
4 loudness increased with cant, women’s decreased, particularly in the Vlog
5 setting. The latter points to a qualitative gender difference, in which cant
6 appears to be playing a more important role in floor management for men than
7 for women.
8 This appears to be supported by the relation between cant and discourse
9 particles in the Lab data. We found that women’s phrases with high head cant
10 were associated with discourse particles having to do with shared
11 understanding like you know and I mean. This did not hold for men.
12 Conversely, for men but not for women, phrases with high head cant were
13 associated with conversational acknowledgements like mmhm and yeah,
14 suggesting more of a floor management function.
15 The set of gender differences we uncovered at every stage – across speaking
16 contexts, in prosodic correlations, and in particular lexical items – suggests that
17 the distribution of the communicative uses of head cant is gendered to some
18 extent. However, the relation of this feature to gender is neither simple nor direct.
19 We note that binary gender is low hanging fruit, as very little information is
20 required to assign speakers to the male or female category. Our attention to
21 gender in this study emerged initially from the previous literature, but it is
22 possible that equally interesting patterns may emerge with other macro-social
23 categorization schemes, such as class, ethnicity, or age. Ultimately, the meaning
24 of cant is not ‘male’ or ‘female,’ but qualities and orientations that differentiate
25 among and between the binary gender categories.
26 More broadly, we have shown that head cant is an interactional resource,
27 and in this capacity it interacts with both sound and text on the one hand and
28 other body movements on the other, to build higher level structures, or
29 interactional signs. Much work is needed to uncover the nature of gestural
30 signs, and their combinations, a challenge that is shared by current work in
31 variation in speech (e.g. Eckert 2016). Ultimately, this adds an entirely new
32 medium to the study of variation, and challenges us to integrate body
33 movement into our theories of variation.
34
35 6.1 Moving forward
36
Through an extended exploration of head cant, we hope this paper has
37
illustrated the value of taking a computational approach to embodiment.
38
Computational methods facilitate the analysis of larger datasets than are
39
typically employed in research examining the role of the body in interaction.
40
While micro-analyses of interaction have been and continue to be
41
instrumental to understanding the complex orchestration of multimodal
42
interactional resources in communication, large-scale analyses enable
43
researchers to consider other types of questions.
44
© 2016 John Wiley & Sons Ltd
28 VOIGT ET AL.

1 First, beyond simple generalizability, analyses of larger datasets enable us to


2 identify the broader interactional functions for individual embodied resources.
3 For example, we have observed here that, across a relatively large number of
4 interactions, interactants cant their heads to a greater degree when listening.
5 This reveals that, even though head cant can be used to take up a variety of rather
6 different interactional positionings, as detailed in section 5, it simultaneously
7 serves a common function. It remains to be seen whether other embodied
8 resources pattern similarly, but large-scale analyses can help determine the
9 extent to which specific forms of embodiment are interactionally meaningful in
10 and of themselves, irrespective of the particulars of a given interaction.
11 While it is important to identify the general interactional functions that
12 embodied resources might serve, we emphasize that such general functions
13 only scratch the surface of the meaning potentials for these resources. Any
14 study of head cant would be incomplete without a discussion of how its
15 meaning is mediated by the other features with which it occurs. Quantitative
16 analyses facilitate the identification of collocations between embodied
17 resources (e.g. head cant) and social (e.g. gender) and linguistic factors (e.g.
18 discourse particles, prosody). We can therefore uncover trends like ‘women
19 produce higher head cant with higher pitch and lower loudness, particularly
20 when producing discourse markers like you know and I mean.’ In addition to
21 the methodological ability to identify collocations, we gain an important
22 theoretical insight: that the meaningful interactional resource (or, put another
23 way, sign object) is not simply head cant, but the ‘multimodal Gestalt’ of head
24 cant, gender, prosody, and discourse particles.
25 To take the approach advocated in this paper, sociolinguists must have
26 access to both computer vision methods of the sort used here and audiovisual
27 data. Regarding computational methodologies, we call on the research
28 community to share newly developed methods for analyzing embodiment. So
29 that future researchers can replicate our results and study head cant and other
30 visual features in new datasets moving forward, we release all corresponding
31 code at this url: nlp.stanford.edu/robvoigt/cans_and_cants
32 Although sociolinguistic work is carried out predominantly on audio
33 corpora, audiovisual data greatly expand the kinds of considerations that
34 analysts can take into account. Lab data like those used in this study afford
35 perhaps the largest variety of explorations, given the angle of video capture
36 and the high quality of audio recordings. Yet, field recordings can, in principle,
37 provide many of the same opportunities, given the right setup; though audio
38 quality would surely suffer, the ecological validity would likely be improved.
39 We also underscore the power of web data, particularly video blog data. The
40 majority of videos on YouTube are posted publicly, which allows researchers to
41 share datasets and increase replicability. In such cases, since users choose to
42 post the videos publicly, privacy concerns are less of an issue than with
43 experimental participants. However, users are also free to remove videos, so as
44 a community the organized archiving of large-scale datasets of this type
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 29

1 presents an important opportunity and challenge moving forward, as some


2 researchers have begun to explore (Cieri 2014).
3 While incorporating computational approaches to embodiment surely
4 introduces a number of methodological challenges, we hope this paper has
5 shown these challenges are surmountable and that the payoff – the ability to
6 attend to the physical world in quantitative analyses of interaction – makes it
7 worth confronting them. As Sharma (2016: 335) notes in the introduction to
8 this Series, ‘sociolinguists focus on linguistic form but have always known that
9 interaction does more than bring voices into contact. It creates momentary
10 alliances of bodies, strategies, geographies, and various other signals and
11 positionings.’ By gaining insight into interaction through the movement and
12 orientation of the body instead of the voices issued from it, we both arrive at a
13 better understanding of interaction and have more solid footing on which to
14 make claims about the ways linguistic behavior varies as a function of
15 interaction.
16
17
18
NOTES
19
20 1. Many thanks to Martin Kay for insightful comments on an early version of this
21 paper, the members of the Interactional Sociophonetics Lab for help with data
22 collection and preprocessing, and Robert Xu and Ciyang Qing for editing of the
Chinese abstract. The first author graciously acknowledges the support of the
23
Michelle and Kevin Douglas Stanford Interdisciplinary Graduate Fellowship. This
24 research was supported in part by the NSF via Award #IIS-1159679 and by the
25 DARPA Communicating with Computers (CwC) program under ARO prime
26 contract no. W911NF-15-1-0462. Data collection and annotation was
27 supported by a grant from the Roberta Bowman Denning Initiative in the
28 Digital Humanities, awarded to the last author. We also thank two anonymous
29 reviewers for their useful feedback. Any remaining errors are our own.
2. https://www.youtube.com/yt/press/en-GB/statistics.html
30
3. In spite of this robust statistical evidence at a high level, these trends are not
31 necessarily universally generalizable to every speaker. To check for individual
32 variability, we tried building separate models for each speaker in the dataset;
33 indeed, we found that for each result a small minority of speakers appeared to
34 buck the trend (for instance, a few speakers showed small negative coefficients
35 for pitch, suggesting lower pitch in phrases with higher head cant).
36 Nevertheless, these effects were non-significant; in no case did we find any
speaker with a statistically significant effect opposing the results presented here,
37
so without more data we cannot be certain if this possible variability is due to
38 inherent noise in the data or if these speakers are true outliers.
39 4. We note that the functions of cant discussed throughout this paper are not
40 meant to be taken as exhaustive; indeed, head cant can likely serve a number of
41 other interactional functions such as conveying skepticism, as pointed out by an
42 anonymous reviewer.
43
44
© 2016 John Wiley & Sons Ltd
30 VOIGT ET AL.

1 REFERENCES
2
Andersen, Gisle. 1998. The pragmatic marker like from a relevance-theoretic perspective. In
3 Andreas H. Jucker and Yael Ziv (eds.) Discourse Markers: Descriptions and Theory.
4 Amsterdam, The Netherlands: John Benjamins. 147–170.
5 Androutsopoulos, Jannis. 2010. Localizing the global on the participatory web. In Nikolas
6 Coupland (ed.) The Handbook of Language and Globalization. Chichester, U.K.: John Wiley and
7 Sons. 203–231.
8 Barsalou, Lawrence W. 2008. Grounded cognition. Annual Review of Psychology 59: 617–645.
9 Bates, Douglas, Martin Maechler, Ben Bolker and Steve Walker. 2015. Fitting linear mixed-
10 effects models using lme4. Journal of Statistical Software 67: 1–48.
11 Bee, Nikolaus, Stefan Franke and Elisabeth Andre. 2009. Relations between facial display, eye
gaze and head tilt: Dominance perception variations of virtual agents. Paper presented at
12
the 3rd International Conference on Affective Computing and Intelligent Interaction and
13
Workshops, IEEE, 10–12 September, De Rode Hoed, Amsterdam, The Netherlands.
14 Biel, Joan-Isaac and Daniel Gatica-Perez. 2013. The YouTube lens: Crowd-sourced personality
15 impressions and audiovisual analysis of vlogs. IEEE Transactions on Multimedia 15: 41–55.
16 Birdwhistell, Ray L. 1952. Introduction to Kinesics: An Annotation System for Analysis of Body
17 Motion and Gesture. Louisville, Kentucky: University of Louisville.
18 Birdwhistell, Ray L. 1970. Kinesics and Context. Philadelphia, Pennsylvania: University of
19 Pennsylvania Press.
20 Boersma, Paul and David Weenink. 2015. Praat: Doing phonetics by computer [computer
21 program]. Version 6.0.08. Available at http://www.praat.org/
22 Burgess, Jean and Joshua Green. 2013. YouTube: Online Video and Participatory Culture.
Chichester, U.K.: John Wiley & Sons.
23
Cieri, Christopher. 2014. Challenges and opportunities in sociolinguistic data and metadata
24
sharing. Language and Linguistics Compass 8: 472–485.
25 Clark, Herbert H. and Susan E. Brennan. 1991. Grounding in communication. Perspectives on
26 Socially Shared Cognition 13: 127–149.
27 Clark, Herbert H. and Jean E. Fox Tree. 2002. Using uh and um in spontaneous speaking.
28 Cognition 84: 73–111.
29 Costa, Marco, Marzia Menzani and Pio Enrico Ricci Bitti. 2001. Head canting in paintings: An
30 historical study. Journal of Nonverbal Behavior 25: 63–73.
31 Cvejic, Erin, Jeesun Kim and Chris Davis. 2010. Prosody off the top of the head: Prosodic
32 contrasts can be discriminated by head motion. Speech Communication 52: 555–564.
33 Dhall, Abhinav, Roland Goecke, Jyoti Joshi, Karan Sikka and Tom Gedeon. 2014. Emotion
recognition in the wild challenge 2014: Baseline, data and protocol. In Proceedings of the
34
16th International Conference on Multimodal Interaction. New York: Association for
35
Computing Machinery. 461–466.
36 D’Onofrio, Annette, Katherine Hilton and Teresa Pratt. 2013. Creaky voice across discourse
37 contexts: Identifying the locus of style for creak. Paper presented at New Ways of Analyzing
38 Variation 42, 10–14 October, Carnegie Mellon University, Pittsburg, Pennsylvania.
39 Duman, Steve and Miriam A. Locher. 2008. ‘So let’s talk. Let’s chat. Let’s start a dialog’: An
40 analysis of the conversation metaphor employed in Clinton’s and Obama’s YouTube
41 campaign clips. Multilingua-Journal of Cross-Cultural and Interlanguage Communication 27:
42 193–230.
43 Eckert, Penelope. 2016. Variation, meaning and social change. In Nikolas Coupland (ed.)
44 Sociolinguistics: Theoretical Debates. Cambridge, U.K.: Cambridge University Press. 68–85.

© 2016 John Wiley & Sons Ltd


THEME SERIES: INTERACTION 31

1 Fisher, Ronald A. 1922. On the interpretation of v2 from contingency tables, and the
2 calculation of P. Journal of the Royal Statistical Society 85: 87–94.
3 Fox Tree, Jean E. 2007. Folk notions of um and uh, you know, and like. Text & Talk – an
4 Interdisciplinary Journal of Language, Discourse Communication Studies 27: 297–314.
Frobenius, Maximiliane. 2014. Audience design in monologues: How vloggers involve their
5
viewers. Journal of Pragmatics 72: 59–72.
6 Girshick, Ross, Jeff Donahue, Trevor Darrell and Jitendra Malik. 2014. Rich feature hierarchies
7 for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE
8 Conference on Computer Vision and Pattern Recognition. New York: IEEE Computer
9 Society Conference Publishing Services. 580–587.
10 Glenberg, Arthur M. and Michael P. Kaschak. 2003. The body’s contribution to language.
11 Psychology of Learning and Motivation 43: 93–126.
12 Goffman, Erving. 1974. Frame Analysis: An Essay on the Organization of Experience. Cambridge,
13 Massachusetts: Harvard University Press.
14 Goffman, Erving. 1979. Gender Advertisements. New York: Harper & Row.
15 Goffman, Erving. 1981. Forms of Talk. Philadelphia, Pennsylvania: University of Pennsylvania
Press.
16
Grammer, Karl. 1990. Strangers meet: Laughter and nonverbal signs of interest in opposite-
17 sex encounters. Journal of Nonverbal Behavior 14: 209–236.
18 Griffith, Maggie and Zizi Papacharissi. 2009. Looking for you: An analysis of video blogs. First
19 Monday 15.
20 Hadar, Uri, T. J. Steiner, E. C. Grant and F. Clifford Rose. 1983. Kinematics of head movements
21 accompanying speech during conversation. Human Movement Science 2: 35–46.
22 Harley, Dave and Geraldine Fitzpatrick. 2009. Creating a conversational context through
23 video blogging: A case study of Geriatric1927. Computers in Human Behavior 25: 679–689.
24 Jefferson, Gail. 1984. Notes on a systematic deployment of the acknowledgement tokens
25 ‘yeah’; and ‘mm hm’. Papers in Linguistics 17: 197–216.
26 Jeon, Je Hun, Rui Xia and Yang Liu. 2010. Level of interest sensing in spoken dialog using
multi-level fusion of acoustic and lexical evidence. In Proceedings of the 11th Annual
27
Conference of the International Speech Communication Association (INTERSPEECH 2010).
28 Makuhari, Japan: International Speech Communication Association. 2806–2809.
29 Kang, Mee-Eun. 1997. The portrayal of women’s images in magazine advertisements:
30 Goffman’s gender analysis revisited. Sex Roles 37: 979–996.
31 Kazemi, Vahid and Josephine Sullivan. 2014. One millisecond face alignment with an
32 ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and
33 Pattern Recognition.
34 Kendon, Adam. 1994. Do gestures communicate? A review. Research on Language and Social
35 Interaction 27: 175–200.
36 Kendon, Adam. 1995. Gestures as illocutionary and discourse structure markers in Southern
37 Italian conversation. Journal of Pragmatics 23: 247–279.
Kendon, Adam. 2002. Some uses of the head shake. Gesture 2: 147–182.
38
Kendon, Adam. 2004. Gesture: Visible Action as Utterance. Cambridge, U.K.: Cambridge
39 University Press.
40 Kim, Minyoung, Sanjiv Kumar, Vladimir Pavlovic and Henry Rowley. 2008. Face tracking
41 and recognition with visual constraints in real-world videos. In proceedings of the IEEE
42 Conference on Computer Vision and Pattern Recognition (CPVR 2008). New York: IEEE
43 Conference Publications. 1–8.
44
© 2016 John Wiley & Sons Ltd
32 VOIGT ET AL.

1 King, Davis E. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research
2 10: 1755–1758.
3 Kollock, Peter and Marc Smith (eds.). 2002. Communities in Cyberspace. London/NewYork:
4 Routledge.
Krahmer, Emiel and Marc Swerts. 2005. How children and adults produce and perceive
5
uncertainty in audiovisual speech. Language and Speech 48: 29–53.
6 Kress, Gunther R. and Theo Van Leeuwen. 2001. Multimodal Discourse: The Modes and Media of
7 Contemporary Communication. New York: Oxford University Press.
8 Krizhevsky, Alex, Ilya Sutskever and Geoffrey E. Hinton. 2012. Imagenet classification with
9 deep convolutional neural networks. Advances in Neural Information Processing Systems 25:
10 1106–1114.
11 Kuhnke, Elizabeth. 2012. Body Language for Dummies. Chichester, U.K.: John Wiley & Sons.
12 Kuznetsova, Alexandra, Per Bruun Brockhoff and Rune Haubo Bojesen Christensen. 2013.
13 lmertest: Tests for random and fixed effects for linear mixed effect models (lmer objects of
14 lme4 package). R package version, 2(6).
15 Labov, William. 2001. Principles of Linguistic Change, II: Social Factors. Malden, Massachusetts:
Blackwell.
16
Lambertz, Kathrin. 2011. Back-channelling: The use of yeah and mm to portray engaged
17 listenership. Griffiths Working Papers in Pragmatics and Intercultural Communication 4: 11–18.
18 Lausberg, Hedda and Han Sloetjes. 2009. Coding gestural behavior with the NEUROGES-ELAN
19 system. Behavior Research Methods 41: 841–849.
20 Lee, Sinae. 2015. Creaky voice as a phonational device marking parenthetical segments in
21 talk. Journal of Sociolinguistics 19: 275–302.
22 Loehr, Daniel P. 2012. Temporal, structural, and pragmatic synchrony between intonation
23 and gesture. Laboratory Phonology 3: 71–89.
24 Matlock, Teenie, Michael Ramscar and Lera Boroditsky. 2003. The experiential basis of
25 meaning. In Richard Alterman and David Kirsh (eds.) Proceedings of the Twenty-fifth Annual
26 Conference of the Cognitive Science Society. Mahwah, New Jersey: Lawrence Erlbaum. 792–
797.
27
McClave, Evelyn Z. 2000. Linguistic functions of head movements in the context of speech.
28 Journal of Pragmatics 32: 855–878.
29 McFarland, Daniel A, Dan Jurafsky and Craig Rawlings. 2013. Making the connection: Social
30 bonding in courtship situations. American Journal of Sociology 118: 1596–1649.
31 McKenna, Stephen J., Sumer Jabri, Zoran Duric, Azriel Rosenfeld and Harry Wechsler. 2000.
32 Tracking groups of people. Computer Vision and Image Understanding 80: 42–56.
33 McNeill, David. 1992. Hand and Mind: What Gestures Reveal about Thought. Chicago, Illinois:
34 University of Chicago Press.
35 McNeill, David. 2008. Gesture and Thought. Chicago, Illinois: University of Chicago Press.
36 Mendoza-Denton, Norma and Stefanie Jannedy. 2011. Semiotic layering through gesture and
37 intonation: A case study of complementary and supplementary multimodality in political
speech. Journal of English Linguistics 39: 265–299.
38
Mignault, Alain and Avi Chaudhuri. 2003. The many faces of a neutral face: Head tilt and
39 perception of dominance and emotion. Journal of Nonverbal Behavior 27: 111–132.
40 Mills, Janet. 1984. Self-posed behaviors of females and males in photographs. Sex Roles 10:
41 633–637.
42 Mondada, Lorenza. 2014. Bodies in action: Multimodal analysis of walking and talking.
43 Language and Dialogue 4: 357–403.
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 33

1 Mondada, Lorenza. 2016. Challenges of multimodality: Language and the body in social
2 interaction. Journal of Sociolinguistics 20: 336–366.
3 Murphy-Chutorian, Erik and Mohan Manubhai Trivedi. 2009. Head pose estimation in
4 computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:
607–626.
5
Nevile, Maurice. 2015. The embodied turn in research on language and social interaction.
6 Research on Language and Social Interaction 48: 121–151.
7 Ochs, Elinor. 1992. Indexing gender. In Alessandro Duranti and Charles Goodwin (eds.)
8 Rethinking Context: Language as an Interactive Phenomenon. Cambridge, U.K.: Cambridge
9 University Press. 335–358.
10 Pellegrini, Stefano, Andreas Ess and Luc Van Gool. 2010. Improving data association by joint
11 modeling of pedestrian trajectories and groupings. In European Conference on Computer
12 Vision. Heidelberg, Germany: Springer Berlin Heidelberg. 452–465.
13 Poria, Soujanya, Erik Cambria, Newton Howard, Guang-Bin Huang and Amir Hussain. 2016.
14 Fusing audio, visual and textual clues for sentiment analysis from multimodal content.
15 Neurocomputing 174: 50–59.
Rautaray, Siddharth S. and Anupam Agrawal. 2015. Vision based hand gesture recognition
16
for human computer interaction: A survey. Artificial Intelligence Review 43: 1–54.
17 Rosenfelder, Ingrid, Joe Fruehwald, Keelan Evanini and Jiahong Yuan. 2011. FAVE (Forced
18 Alignment and Vowel Extraction) Program Suite. Available at http://fave.ling.upenn.edu
19 Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma and
20 Zhiheng Huang. 2015. Imagenet large scale visual recognition challenge. International
21 Journal of Computer Vision 115: 211–252.
22 Scherer, Klaus R. 2003. Vocal communication of emotion: A review of research paradigms.
23 Speech Communication 40: 227–256.
24 Schiffrin, Deborah. 1987. Discourse Markers. Cambridge, U.K.: Cambridge University Press.
25 Schuller, Bj€orn, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian A.
26 M€ uller and Shrikanth S. Narayanan. 2010. The INTERSPEECH 2010 paralinguistic
challenge. In Proceedings of the 11th Annual Conference of the International Speech
27
Communication Association (INTERSPEECH 2010). Makuhari, Japan: International Speech
28 Communication Association. 2795–2798.
29 Shan, Caifeng. 2012. Smile detection by boosting pixel differences. IEEE Transactions on Image
30 Processing 21: 431–436.
31 Sharma, Devyani. 2016. Series introduction. Journal of Sociolinguistics 20: 335.
32 Shen, Jie, Stefanos Zafeiriou, Grigorios G. Chrysos, Jean Kossaifi, Georgios Tzimiropoulos and
33 Maja Pantic. 2015. The first facial landmark tracking in-the-wild challenge: Benchmark
34 and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops.
35 1003–1011.
36 Suarez, Jesus and Robin R. Murphy. 2012. Hand gesture recognition with depth images: A
37 review. In 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human
Interactive Communication. Piscataway, New Jersey: Institute of Electrical and Electronics
38
Engineers. 411–417.
39 Tang, Siyu, Mykhaylo Andriluka and Bernt Schiele. 2014. Detection and tracking of occluded
40 people. International Journal of Computer Vision 110: 58–69.
41 Tannen, Deborah and Cynthia Wallat. 1987. Interactive frames and knowledge schemas in
42 interaction: Examples from a medical examination/interview. Social Psychology Quarterly 1:
43 205–216.
44
© 2016 John Wiley & Sons Ltd
34 VOIGT ET AL.

1 Trouvain, J€urgen and William J. Barry. 2000. The prosody of excitement in horse race
2 commentaries. In R. Cowie, E. Douglas-Cowie and M. Schr€ oder (eds.) Proceedings of the
3 International Speech Communication Association Workshop on Speech and Emotion. Belfast,
4 Ireland: Textflow. 86–91.
Viola, Paul and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple
5
features. In CVPR 2001: Proceedings of the 2001 IEEE Computer Society Conference on
6 Computer Vision and Pattern Recognition. Piscataway, New Jersey: Institute of Electrical and
7 Electronics Engineers. 511–518.
8 Voigt, Rob, Robert J. Podesva and Dan Jurafsky. 2014. Speaker movement correlates with
9 prosodic indicators of engagement. Speech Prosody 7.
10 Wang, William Yang and Julia Hirschberg. 2011. Detecting levels of interest from spoken
11 dialog with multistream prediction feedback and similarity based hierarchical fusion
12 learning. In Proceedings of the SIGDIAL 2011 Conference. Stroudsburg, Pennsylvania:
13 Association for Computational Linguistics. 152–161.
14 W€ollmer, Martin, Felix Weninger, Tobias Knaup, Bj€orn Schuller, Congkai Sun, Kenji Sagae
15 and Louis-Philippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an
audio-visual context. IEEE Intelligent Systems 28: 46–53.
16
Zimman, Lal. 2015. Creak as disengagement: Gender, affect, and the iconization of voice
17 quality. Paper presented at New Ways of Analyzing Variation 44, 22–25 October, Toronto,
18 Canada.
19
20
21 Address correspondence to:
22
23 Rob Voigt
24 Stanford University – Linguistics Department
25 Margaret Jacks Hall
26 Building 460
27 Stanford, CA 94305
28 U.S.A.
29 robvoigt@stanford.edu
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
© 2016 John Wiley & Sons Ltd
THEME SERIES: INTERACTION 35

1 APPENDIX: Cant and prosody regression results


2
Full regression results table for the mixed-effects model described in section 4. Head
3
cant, pitch, and loudness are phrase-level means of values z-scored by speaker.
4 Reference levels are a gender of female and a Laboratory interactional context.
5
6
7 Head cant angle (z-scale)
8 Fixed effect B CI p
9
10 (Intercept) 0.73 0.69 – 0.78 <.001
11 Pitch mean (z-scale) 0.06 0.04 – 0.08 <.001
12 Loudness mean (z-scale) 0.01 0.04 – 0.02 .584
Gender (M) 0.08 0.15 – 0.00 .049
13
Context (Vlog) 0.01 0.04 – 0.06 .674
14 Log duration 0.01 0.03 – 0.00 .057
15 Pitch mean * gender (M) 0.04 0.08 – 0.00 .032
16 Pitch mean * context (Vlog) 0.02 0.01 – 0.05 .169
17 Pitch mean * log duration 0.03 0.02 – 0.05 <.001
18 Loudness mean * gender (M) 0.08 0.03 – 0.14 .003
19 Loudness mean * context (Vlog) 0.11 0.17 – 0.05 <.001
20 Loudness mean * log duration 0.02 0.00 – 0.03 .107
21
22 Random effect
23
r2 0.317
24
s00, case 0.003
25 Ncase 67
26 ICCcase 0.010
27 Observations 17,533
28 R2 / Ω02 .019 / .019
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
© 2016 John Wiley & Sons Ltd
Author Query Form
Journal: JOSL

Article: 12216
Dear Author,

During the copy-editing of your paper, the following queries arose. Please
respond to these by marking up your proofs with the necessary changes/
additions. Please write your answers on the query sheet if there is insufficient
space on the page proofs. Please write clearly and follow the conventions
shown on the attached corrections sheet. If returning the proof by fax do not
write too close to the paper’s edge. Please remember that illegible mark-ups
may delay publication.

Many thanks for your assistance.

Query Query Remarks


reference
1 AUTHOR: Please confirm that given names
(red) and surnames/family names (green)
have been identified correctly.
MARKED PROOF
Please correct and return this set
Please use the proof correction marks shown below for all alterations and corrections. If you
wish to return your proof by fax you should ensure that all amendments are written clearly
in dark ink and are made well within the page margins.

Instruction to printer Textual mark Marginal mark


Leave unchanged under matter to remain
Insert in text the matter New matter followed by
indicated in the margin or
Delete through single character, rule or underline
or or
through all characters to be deleted
Substitute character or
through letter or new character or
substitute part of one or
more word(s) through characters new characters
Change to italics under matter to be changed
Change to capitals under matter to be changed
Change to small capitals under matter to be changed
Change to bold type under matter to be changed
Change to bold italic under matter to be changed
Change to lower case Encircle matter to be changed
Change italic to upright type (As above)
Change bold to non-bold type (As above)
or
Insert ‘superior’ character through character or
under character
where required
e.g. or
Insert ‘inferior’ character (As above) over character
e.g.
Insert full stop (As above)
Insert comma (As above)
or and/or
Insert single quotation marks (As above)
or

or and/or
Insert double quotation marks (As above)
or
Insert hyphen (As above)
Start new paragraph
No new paragraph
Transpose
Close up linking characters

Insert or substitute space through character or


between characters or words where required

Reduce space between between characters or


characters or words words affected

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy