Towards Context-Aware Automatic Haptic Effect Generation For Home Theatre Environments
Towards Context-Aware Automatic Haptic Effect Generation For Home Theatre Environments
ABSTRACT KEYWORDS
The application of haptic technology in entertainment systems, such Haptics, 4D effect generation, automatic haptic effect authoring,
as Virtual Reality and 4D cinema, enables novel experiences for home theatre, immersive experience
users and drives the demand for efficient haptic authoring systems. ACM Reference Format:
Here, we propose an automatic multimodal vibrotactile content Yaxuan Li, Yongjae Yoo, Antoine Weill-Duflos, and Jeremy R. Cooperstock.
creation pipeline that substantially improves the overall hapto- 2021. Towards Context-aware Automatic Haptic Effect Generation for Home
audiovisual (HAV) experience based on contextual audio and visual Theatre Environments . In 27th ACM Symposium on Virtual Reality Software
content from movies. Our algorithm is implemented on a low-cost and Technology (VRST ’21), December 8–10, 2021, Osaka, Japan. ACM, New
system with nine actuators attached to a viewing chair and extracts York, NY, USA, 11 pages. https://doi.org/10.1145/3489849.3489887
significant features from video files to generate corresponding hap-
tic stimuli. We implemented this pipeline and used the resulting 1 INTRODUCTION
system in a user study (n = 16), quantifying user experience accord- The advent of four-dimensional (4D) theatre technology has had
ing to the sense of immersion, preference, harmony, and discomfort. a profound effect on viewer experience. Vestibular (motion) feed-
The results indicate that the haptic patterns generated by our al- back, along with bursts of air, mist effects, thermal stimuli, and
gorithm complement the movie content and provide an immersive vibrotactile effects are now commonly employed in immersive 4D
and enjoyable HAV user experience. This further suggests that the movie-viewing environments. Such components enrich the movie
pipeline can facilitate the efficient creation of 4D effects and could watching experience through enhanced realism, increased enjoy-
therefore be applied to improve the viewing experience in home ment, and greater immersion [17]. There is a growing need for
theatre environments. content that can take advantage of the 4D capabilities of movie
theatres with such components, without necessitating undue pro-
CCS CONCEPTS duction costs.
• Human-centered computing → Haptic devices; Mixed / aug- This demand has likely grown as a result of the COVID-19 pan-
mented reality; User studies; Virtual reality. demic, which has further pushed consumption of cinema content to
the home theatre environment, supported by a variety of streaming
services such as Netflix, Amazon Prime, and Disney+. For home
∗ Both authors contributed equally to this research. theatre setups, an inexpensive, but highly limited, option is the
use of haptic actuators installed in chairs to directly convert or
filter the audio track to generate vibrotactile feedback. Although
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed semi-automatic generation of richer haptic effects for 4D cinema
for profit or commercial advantage and that copies bear this notice and the full citation is being explored in both academia and industry, this process gen-
on the first page. Copyrights for components of this work owned by others than ACM erally remains mostly manual, dependent on skilled industry ex-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a perts, time-consuming, and costly [38]. The typical process takes
fee. Request permissions from permissions@acm.org. three designers approximately 16 days to produce the effects for
VRST ’21, December 8–10, 2021, Osaka, Japan a feature-length 4D film. This cost obviously impacts the creation
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-9092-7/21/12. . . $15.00 and distribution of 4D movies, and motivates exploration of more
https://doi.org/10.1145/3489849.3489887 efficient authoring methods.
VRST ’21, December 8–10, 2021, Osaka, Japan Y. Li, Y. Yoo, A. Weill-Duflos, J. Cooperstock
Figure 1: Overview of the haptic effects generation algorithm. With movie content as input, the fusion of psychoacoustic
measurements indicating human perception of sound determines vibration intensity. In parallel, the saliency map suggests
the event location in the scene, and assigns weights to actuators for the generation of the spatiotemporal vibrotactile effects.
The contribution of this paper is our automatic haptic effect gen- feedback: 1) tactile feedback with temperature stimuli [20, 51], vi-
eration algorithm, designed primarily for the needs of the 4D home bration [2, 33, 36, 40, 51, 63, 67], and pressure [3, 4, 25, 60]; 2) kines-
theatre content experience, along with a (n = 16) user study we thetic feedback resulting from limb position movement [22, 34] and
conducted to evaluate the performance of the proposed algorithm force [7, 14, 15, 34, 50]; and 3) proprioception or vestibular feedback
on six movie clips from different genres including sci-fi, action, car- arising from body motion [15, 56]. Gaw et al. and Delazio et al.
toon, family-comedy, and horror/thriller. Our algorithm, illustrated demonstrated the feasibility of using force feedback to increase
in Figure 1, generates haptic effects from an analysis of the audio- viewers’ ability to understand the presented media content [18, 22]
visual stream. The stream conveys information about the high-level and Dionisio et al. highlighted the effectiveness of thermal stimuli
context of the scene such as tonal properties (a composite of the in VR [20].
mood, narrative, and specific genre features); location of events and
characters; dynamics of the atmosphere; and intensity of action. 2.2 Authoring of Vibrotactile Stimuli
We used metrics that reflect salient perceptual characteristics of To achieve a high-quality overall experience, it is essential to create
the audio and visual modalities. These include four psychoacoustic haptic effects that match the audiovisual content of the scene. At
parameters, used in the audio analysis to determine the intensity present, this relies predominantly on the intuition and experience
of vibration (Figure 1(b)) and estimates of visual saliency for every of expert designers, working manually with effects authoring plat-
frame (Figure 1(c)), which are used to determine the spatiotemporal forms; for example, Macaron [55], a web-based vibrotactile effects
distribution of vibrotactile effects (Figure 1(d)). The resulting haptic editor, Vibrotactile Score [39], based on composition of patterns of
effects are delivered through a vibrotactile chair with nine actuators musical scores, H-Studio [13], Immersion’s Haptic Studio,1 HFX
(Figure 1(e)). Studio [16], posVibEditor [53] and SMURF [32].
As noted previously, use of such haptic authoring tools tends to
2 RELATED WORK be costly and time-consuming. To address this challenge, automatic
2.1 Haptics for Enhancing the Viewing haptic effects generation methods have been developed, which typ-
ically produce haptics from audio properties. An early effort by
Experience Chafe tried to combine the vibrotactile sensation with musical in-
Inclusion of multisensory information has been demonstrated to struments [8]. Similarly, Chang and O’Sullivan used low-frequency
greatly enhance user experience in terms of immersion, flow, absorp- components, extracted from audio files, to provide vibration on a
tion, and engagement in a range of entertainment activities includ- mobile phone using a multi-function transducer as an actuator [9].
ing virtual and augmented reality and gaming [12, 24, 35, 57, 77, 79]. For the creation of vibrotactile effects in games, Chi et al. found
There is growing interest in applying similar effects in 4D film, en- that simple mapping of the audio according to frequency bands
hancing the cinematic viewing experience beyond stereoscopic (3D) resulted in excessive ill-timed vibrations, causing feelings of haptic
cinema. numbness and annoyance [11]. To avoid this problem, they estab-
Today, numerous companies including D-Box Technologies, Me- lished key moments when vibrations should occur by identifying
diaMation (MX4D), and CJ 4D Plex (4DX), manufacture 4D movie target sounds in real-time using an acoustic similarity measure-
platforms for theatres and theme parks. Their hardware provides ment. Contrary to other researchers, who employed frequency
haptic effects including leg tickling, vestibular (motion) effects, air bands and filters to extract suitable audio features for haptic con-
pressure and thermal stimulation. With respect to the inclusion version [29, 47], Lee and Choi [37] translated auditory signals to
of haptic effects, Danieu et al. presented a perceptually organized
review of HAV devices [17], describing the three types of haptic 1 https://www.immersion.com/technology/#hapticdesign-tools
Towards Context-aware Automatic Haptic Effect Generation for Home Theatre Environments VRST ’21, December 8–10, 2021, Osaka, Japan
vibrotactile feedback by bridging the perceptual characteristics of The importance of sound cues to human attention motivated the
loudness and roughness. work of Evangelopoulos et al. who formulated measures of audiovi-
Another direction for haptic authoring employs cues from the sual saliency by modelling perceptual and computational attention
visual stream to drive the generation of haptic effects. For exam- to such cues [21]. Huang and Elhilali further demonstrated that
ple, Rehman et al. took the graphical display of a soccer game, auditory salience is also context dependent [26], since the salient
divided the field into five areas, and mapped each to an area-based components vary according to specific characteristics of the scene.
vibration pattern to indicate the position of the ball [63]. Similarly, Accordingly, to achieve an effective multimedia user experience,
Lee et al. mapped the trace of a soccer ball to vibration effects on haptic effects created from the auditory modality should consider
an array of 7 × 10 actuators mounted on the forearm, enabling a characteristics of the source content, such as the genre of mate-
spatiotemporal vibrotactile mapping of the location of the ball in rial [27, 56] The generation of haptic effects that are congruent
the game [36]. Moreover, the use of computer vision and physical to the given audiovisual contents has been studied for decades. A
motion modeling techniques has been investigated to extract key common approach is the extraction of bass and treble components
elements of the scene or estimate camera motion from first-person from the low-frequency component of a music signal [9] to map
videos. These cues are applied to create corresponding haptic effects, to different frequencies [27]. Such techniques based on audio at-
including vestibular (motion) [38], force [15], and vibrotactile feed- tributes have been employed to create meaningful and enjoyable
back [30, 31, 56]. More recent research has demonstrated impressive haptic effects in many apps and games, using tools such as Apple
performance using Generative Adversarial Networks to produce Core Haptics2 and Lofelt Studio.3
tactile signals from images of different materials [42, 43, 64, 65]. In this study, we emphasized achieving a high-level match be-
tween haptic effects and movie events to improve the experience of
3 AUTOMATIC HAPTIC EFFECT media content consumption. We attempted to achieve this through
AUTHORING ALGORITHM the use of four psychoacoustic measures: low-frequency energy,
loudness, sharpness, and booming.
While the research described in the previous section endeavors to In terms of vibrotactile perception, humans are the most sensitive
generate automatic vibrotactile effects based on a mapping from
between 160 and 250 Hz [66]. Following the approach of Hwang et
audio or visual contents, we propose instead a multimodal algorithm
al. [27], we consider frequencies up to 215 Hz as borderline bass-
that integrates the contextual information from both audio and band, as this corresponds approximately to the maximum frequency
visual streams. We also take into account the perceptual psychology of the motor we use. We calculate the low-frequency energy value
of the expected audience to generate reasonable patterns that align as the integral of the amplitudes of the frequency components up
automatically with the movie contents. to 215 Hz.
Our pipeline employs four acoustic measures (Section 3.1) and Loudness is a well-known perceptual measure, indicating the
the visual saliency map (Section 3.2). For the former, we use the psy- perceived perception of sound pressure. It is calculated in the deci-
choacoustic parameters of sharpness, booming, low-frequency en- bel scale as a weighted integration of spectral loudness, N (f ), for
ergy and loudness to quantify the perceptual quality of the movie’s frequency f ,
audio track. The timing and amplitude of the vibrotactile actuation
are determined from these auditory features. For example, gunshot ∫ 24 Bar k
sounds in a fight scene or booming sounds in racing or piloting L= N (f )d f ,
0.3Bar k (1)
scenes create an atmosphere of intense emotions and, ideally, in-
1
crease the immersion of the audience. For the latter, we estimate N (f ) = 20loд10 I (f ).
the saliency map from the visual stimuli for each frame, which Wk
identifies the region of expected visual interest [48]. As shown in Wk is a weighting factor, which is extracted from the equal-loudness
a previous study [30], the map can provide effective location and contour of the International Standards Organization on loudness
direction information for vibrotactile rendering; however, it cannot (ISO 532) [1]. I (f ) is the intensity of the sound component at fre-
automatically determine the magnitude of the haptic parameters. quency f .
Finally, we integrate the information from both audio and visual Sharpness and booming are also weighted summations of spec-
features to generate the vibrotactile effects. These are presented tral energy in the frequency domain, emphasizing high- and low-
through an array of nine vibration motors, installed in a chair. Al- frequency components, respectively. For example, a glass-breaking
though designed for spatiotemporal vibrotactile feedback in this sound would exhibit high sharpness and low booming while the
study, our pipeline for automated haptic effects authoring could engine sound of a sports car would be the opposite. These values
likewise be applied to other haptic modalities such as temperature, can be derived by the weighted integral of spectral loudness N (f )
airflow or impact. divided by loudness L. Sharpness S is calculated as follows:
5 USER STUDY
This section describes the user study for evaluating the effective-
ness of our haptic effects generation algorithm, employing the hard-
ware implementation described in Section 4. In the user study, we
compared participants’ subjective ratings regarding haptic effects
Figure 8: Hardware implementation and experiment setup. generated from different conditions with several movie clips.
with actuator A5 , A6 , and A9 is shown in Figure 6. Next, we assigned
An 5.1 Participants
a weight of w △i k to Ank by
We recruited 16 participants for the experiment (20-34 years old,
An 9 male, 7 female). No participants reported any known sensory
An α △i k disorders that could affect their auditory, vision, or haptic percep-
w △i k = An An An
, k ∈ {1, 2, 3} (4)
α △i 1 + α △i 2 + α △i 3 tion. They were also asked to wear thin T-shirts to allow for better
perception of vibrotactile stimuli on their torso. Each participant
where Ank , k ∈ {1, 2, 3} are three adjacent vertexes of △i. Then,
An was compensated approximately 8 USD for their participation.
for each actuator A, we summed the weights w △i k adding from its
adjacent triangles to obtain WAi . As shown in Figure 7, we have 5.2 Stimuli
four weights contributing to WA6 of actuator A6 : w ∆A6 , w ∆A6 , w ∆A6 , We tested four different haptic rendering conditions, described in
5 6 12
w ∆A6 , which are obtained from Equation 4. We then normalized all Table 1. Six short (approximately one-minute) movie clips were
13
the weights of each actuator A and generated rendering weights selected, including scenes with fighting, shooting, chasing, or ex-
βn , n ∈ {1, 2, ..., 9}. ploding. They were trimmed from four different movies: Alita: Battle
Finally, we obtained the overall vibration intensity for the cur- Angel (2019), The Boss Baby (2017), I Am Legend (2007), and Escape
rent frame by linearly mapping the psychoacoustic measurement Room (2019). Table 2 summarizes the movie clips, their length, and
values as described in Section 3.1 to the actuators’ amplitudes, mul- the number of haptic effects rendered in each scene. In total, 24
tiplied by the rendering weight βn of each actuator. As a result, the combinations from six movie clips and four rendering conditions
location of vibration follows the salient region; when the user’s were provided to each participant. Each of the six movie clips was
visual attention rests on the left side of the movie scene, the cor- considered as a block, and within the block, the presentation order
responding vibrotactile effects with intensity derived from audio of haptic effects was randomized.
analysis, generated using the above mapping,will also be presented
on the left side of the body, as seen in Figure 5. 5.3 Methods and Procedure
We built the hardware system with nine eccentric rotating mass The user study was conducted in a recording studio at the author’s
(ERM) vibrotactile actuators (Seeed Studio; RB-See-403, ϕ =10 mm) institution, depicted in Figure 8. Participants sat approximately
controlled by a Teensy 3.2 microcontroller, connected to the me- 45-55 cm in front of a 21 inch monitor, and wore headphones (Sony,
dia player computer. The actuators are embedded in the chair as WH-H900N) to hear the movie clip sounds, and prevent them from
illustrated in Figure 8. Vibration amplitude ranges of 1.77–5.31 g hearing the faint noises produced by the vibration motors. The
Towards Context-aware Automatic Haptic Effect Generation for Home Theatre Environments VRST ’21, December 8–10, 2021, Osaka, Japan
Table 1: Four conditions of haptic effects generations. Table 3: Questionnaire for assessing the haptic-audio-visual
experience.
Condition Description
Vibration effects were generated randomly and pre- Immersion The haptic effects make me more immersed in
Random
sented through the video clips. the movie.
Vibration effects were generated based on the audio pa- Preference I liked the experience of vibration sensations
Audio rameters described in Section 3.1 only, and presented
while watching this movie clip.
actuator A1 , A2 , and A5 .
Harmony The haptic effects are consistent with the con-
Vibration effects were presented in the salient area as
mentioned in Section 3.2, but not considered audio vari- tent of the scene.
Visual Discomfort The haptic effects are uncomfortable for me.
ables. All vibrotactile actuators corresponding to the
salient area in the screen vibrated.
Table 4: Two-way ANOVA results for the four subjective rat-
The full pipeline described in Section 3 and 4, were used
Multimodal ing.
for generating vibration effects.
Effect
Table 2: Six movie clips for user study with their time length Measure Factor Statistics
size (η 2 )
and components in the scene Rendering F (3, 45) = 138.52, p < .001 0.512
Immersion Movie F (5, 75) = 3.47, p = .004 0.021
# Movie clip Length Components in the scene Interaction F (15, 225) = 1.25, p = .232 0.023
1
Alita: Battle Angel
58 s Grappling, Melee fighting Rendering F (3, 45) = 120.51, p < .001 0.480
(2019) [Clip A] Preference Movie F (5, 75) = 2.74, p = .018 0.018
2
Alita: Battle Angel
62 s Brawling, Crashing, Tasering Interaction F (15, 225) = 1.21, p = .258 0.024
(2019) [Clip B] Rendering F (3, 45) = 158.97, p < .001 0.556
3
The Boss Baby
60 s
Chasing, Fighting, Laughing, Harmony Movie F (5, 75) = 2.32, p = .043 0.013
(2017) Giggling, Crunching Interaction F (15, 225) = 0.61, p = .864 0.011
4
I Am Legend (2007)
62 s
Gunshots, Slamming, Rendering F (3, 45) = 16.25, p < .001 0.113
[Clip A] Crashing, Screaming Discomfort Movie F (5, 75) = 2.20, p = .054 0.025
I Am Legend (2007) Gunshots, Barking, Snarling, Interaction F (15, 225) = 0.75, p = .735 0.026
5 73 s
[Clip B] Growling and participants and the experimenter wearing face masks, while
Whooshing, Exploding, maintaining a 2 m distance.
6 Escape Room (2019) 74 s
Crawling
5.4 Data Analysis
participants were first asked to read through the consent form and We initially averaged the 16 participants’ ratings per each of the 24
listen to the explanation of the experiment. All the participants combinations for a simple comparison by plots (Figure 9). For the
agreed and signed the consent form, and then they were asked to sit statistical analysis of the results, we conducted a two-way ANOVA,
on the vibrotactile chair. Next, the experimenter confirmed that the one-way ANOVA and Tukey’s HSD post-hoc tests for each of the
participant could easily detect vibrations from each of the actuators. four subjective ratings to see the effect’s significance.
The experimenter also confirmed that the intensity of the vibration
was not excessive, as this might cause discomfort throughout the 6 RESULTS
experiment.
6.1 ANOVA analysis
A short training session was given to the participants. They
experienced a 20-second-long movie clip cropped from Alita: Battle We performed a two-way analysis of variance (ANOVA) test for
Angel. The clip was different from the movie clips used in the main immersion, preference, harmony, and discomfort to understand the
session introduced in Table 2. Four different haptic conditions, as effects of haptic rendering methods and movie content. Table 4
described in Table 1, were presented along with the movie. summarizes the statistical analyses. For all the variables, the hap-
In every trial in the main session, the participants watched the tic rendering method largely affected the user’s evaluation (effect
movie while perceiving the haptic effects. They were then given a size η 2 near 0.5). The effects of the movie clip’s content were also
questionnaire (Table 3) to evaluate their experience with the haptic significant in immersion, preference, and harmony scores, despite
rendering in terms of Immersion, Preference, Harmony, and Discom- having small effect sizes. Interestingly, for discomfort rating, the
fort, using Likert scales ranging from 0 (strongly disagree) to 10 effect size of the haptic rendering factor drastically decreased, and
(strongly agree). After completing the questionnaire, the partici- the movie’s effects turned out to be insignificant. The interaction
pants took a short break with at least one minute to avoid fatigue terms were not significant, meaning the effects of different haptic
and adaptation. rendering methods were not affected by different movie contents.
The entire procedure took approximately one hour per partici-
pant, and was carried out under approval of the McGill University 6.2 In-depth Analysis
Research Ethics Board, REB # 21-02-023, in compliance with re- To understand the results better, we conducted a more in-depth
gional COVID-19 restrictions. This included paying attention to analysis on the rendering methods in each of the movie clips. Fig-
sanitary measures, wiping down of the apparatus after each use, ure 9 shows the average scores of the four subjective scores. Note
VRST ’21, December 8–10, 2021, Osaka, Japan Y. Li, Y. Yoo, A. Weill-Duflos, J. Cooperstock
Figure 9: Results of experiment with six movie clips to evaluate the users’ haptic-audio-visual experience by a questionnaire
in terms of immersion, preference, harmony, and discomfort. The error bars represent standard errors. The conditions with
the same letters above the bar indicate there are no significant differences among them.
Table 5: Results of one-way ANOVA for each movie clip in The detailed statistical results are illustrated in Table 5. For
terms of immersion, preference, harmony, and discomfort. each of the one-way ANOVAs, we conducted a posthoc analysis of
The first row of each clip represents F(3, 45) value and the Tukey’s HSD test to understand which components are statistically
second row with p < 0.01 or 0.05 indicates there is a signifi- meaningful. Posthoc grouping labels are presented above the bar;
cant effect. different alphabet coding indicates a significant difference between
items.
Movie Immersion Preference Harmony Discomfort Here, the summaries of observations are:
Alita: Battle 29.495 33.021 27.962 3.141
Angel [Clip A] p < .001 p < .001 p < .001 p = .032 • The effect of the haptic rendering method has a clear signif-
Alita: Battle 23.932 20.126 37.443 2.586 icant effect for all the subjective ratings, showing that the
Angel [Clip B] p < .001 p < .001 p < .001 p = .061 multimodal haptic rendering condition performed best.
26.263 21.293 21.36 6.400 • A general trend of Random < Visual < Audio < Multimodal
The Boss Baby
p < .001 p < .001 p < .001 p = .001 can be observed throughout the plots. Though there is only a
I Am Legend 25.405 17.010 30.233 1.762 slight difference between audio and visual rendering scores,
[Clip A] p < .001 p < .001 p < .001 p = .164 we can infer that time-synchronicity would be more impor-
I Am Legend 18.370 18.115 24.377 1.152 tant than presenting the location of the event.
[Clip B] p < .001 p < .001 p < .001 p = .336
• For discomfort rating, the content of the movie seems to
21.986 20.562 22.922 5.072
Escape Room affect the user’s evaluation. For example, discomfort ratings
p < .001 p < .001 p < .001 p = .003
show a significant difference in Alita [Clip A], Boss Baby,
and Escape Room clips, but the others show no differences.
that the scale was inverted in discomfort ratings, i.e., a lower score
indicates better performance (less discomfort). We performed a one-
way analysis of variance (ANOVA) test to assess the performance 7 DISCUSSION
of the four haptic conditions for each movie clip. In terms of the We can observe that the multimodal haptic effect condition delivers
subjective evaluation metrics immersion, preference, and harmony, the best user experience with regards to immersion, preference, and
all four haptic rendering conditions had a statistically significant harmony, with the lowest level of discomfort. The results solidly
effect (p < 0.01 or p < 0.05) on user experience while viewing present the effectiveness of our algorithm. Furthermore, using both
all six movie clips. However, in terms of discomfort, the haptic visual and audio features in the haptic effects generation increased
rendering conditions only had a statistically significant effect on the subjective rating scores.
user experience while viewing Alita: Battle Angel [Clip A], The Boss In all the experimental conditions, multimodal rendering re-
Baby, and Escape Room. ceived ratings over 8 out of 10. We can conclude the rendering
Towards Context-aware Automatic Haptic Effect Generation for Home Theatre Environments VRST ’21, December 8–10, 2021, Osaka, Japan
algorithm effectively communicated the context of the scene. The are currently investigating these issues. Regarding application sce-
two features, location and amplitude, are well mapped and easily narios, we initially considered a home theatre watching experience,
distinguishable by the participants and helped them in their under- where the algorithm is easily attached to the streaming services to
standing of the scene. Moreover, this was achieved without the use provide 4D haptic effects. The commercialization of this and sim-
of high-performance actuators such as voice coils. ilar work could have an impact on the future of at-home content
A point for further discussion is user discomfort, which includes consumption.
the effects of fatigue caused by an immoderate number of vibra-
tions, or excessive cognitive load due to the rapidly varying stimuli.
7.2 Limitations
As illustrated in Figure 9, we observed that multimodal rendering
achieves the lowest level of discomfort. Furthermore, the large stan- Despite offering automatic haptic authoring and promising applica-
dard deviation indicates a relatively substantial individual variance tion possibilities, our pipeline suffers from a number of limitations.
on the "comfortable range" on vibration perception. In Section 3.2, we presented our visual saliency detection algorithm
We found that the score of psychoacoustic parameters-based for distributing vibrotactile effects to produce an immersive experi-
rendering is slightly better than saliency-based rendering; however, ence. However, we observed a small number of scenes with strong
the difference was not statistically significant. This implies that time sound, but no visual objects on screen, such as ambush attacks
synchronicity is more crucial than location matching for vibrotactile from a ghost or monster in the dark, an electricity blackout, or
rendering. However, we picked mostly dynamic scenes for this heavy wind blowing out the fire in the scene. In such cases where
study, which are accompanied by clear, distinctive sound effects. sound components exist in the absence of corresponding visual
Scenes that are not dynamic and violent (e.g., slow camera motion, objects, we fixed the saliency area in the middle of the frame and as-
speaking, romance scenes, etc.) need further investigation. signed A5 to take responsibility for rendering vibrotactile actuation.
Furthermore, movies contain many blurred frames to guarantee
the fluidity of characters’ actions, which poses a challenge for the
7.1 Extensibility of the Algorithm saliency detection model and results in imprecise edge detection of
The algorithm’s flexibility lends itself to scaling and expansion to objects.
other applications such as Virtual Reality (VR), Augmented Re- These limitations may be overcome by more powerful, accu-
ality (AR), and games. The pipeline also allows for retroactively rate, or movie-specific detection or segmentation machine learning
introducing 4D effects to previous movies made without 4D in models, incorporating, for example, methods of stereo sound local-
mind. We provide a UI for our program that allows users to adjust ization [62], attention [73], and depth extraction [78], and equally,
thresholds for each of the psychoacoustic parameters of sharpness, taking advantage of large-scale movie datasets such as IMDB and
booming, low-energy, and loudness. Once a video is selected to Movielens. We emphasize that the proposed technique does not
be processed, users can choose and tune psychoacoustic measure- aim to serve as a replacement for manual design by haptics experts.
ments and their thresholds. The program will generate a text file The quality of the haptic patterns from our pipeline is inferior to
for driving actuators and a video for haptic effects visualization. those created by laborious manual authoring. However, we believe
Haptics practitioners can use our tool to create vibrotactile effects that our pipeline provides useful insights, and offers a preliminary
for their videos regardless of VR or gaming content. version of high-quality effects that can be implemented in an early
Moreover, since the feature extraction part is detached from the stage in the content authoring.
calculation of the rendering parameters, we can separately add or
remove audiovisual parameters in that stage. For example, color 8 CONCLUSION
map extraction [71] and camera motion estimation [61, 74] could be In this study, we proposed an automatic multimodal haptic ren-
easily integrated with the current system. Higher-level contextual dering algorithm for movie content, which extracts audiovisual
information can be obtained by using machine learning techniques, features to infer basic contextual information of the scene and ren-
such as violent scene detection [10, 19, 46, 70] or semantic seg- ders haptic effects. We especially targeted ways to improve the
mentation [49, 72] to make haptics more comprehensively fit the home theatre experience. Using a haptic chair equipped with an
storyline. array of vibration motors, we conducted a user study to evaluate
In terms of haptic actuation mechanisms, different modalities the effectiveness of the algorithm in terms of user experience by
such as thermal or airflow sensations could be easily attached. For measuring subjective ratings of immersion, preference, harmony,
example, in a related experimental multi-haptic armrest design [52], and discomfort. The results demonstrate that the proposed multi-
there are two different types of vibrations, thermal sensation, air- modal rendering noticeably improved the viewing experience. Our
flow, and poking mechanisms. These mechanisms would be mapped future work will expand the algorithm and include additional types
to multiple different audiovisual features extracted from the scene of haptic modalities.
and would be expected to improve the watching experience. The
challenge for such conversion is to determine the appropriate map-
ping between the movie content or detected events and the associ- ACKNOWLEDGMENTS
ated type of mechanism or actuator to drive in response. Of course, We thank for Sri Gannavarapu, David Marino, Linnéa Kirby, and
there remains the challenge of parameter selection and tuning to David Ireland for their valuable feedback. We appreciate all study
ensure the appropriate mapping between the content or detected participants for participating in the study and reviewers for their
event and the associated type of actuator to drive in response. We comments.
VRST ’21, December 8–10, 2021, Osaka, Japan Y. Li, Y. Yoo, A. Weill-Duflos, J. Cooperstock
[43] Huaping Liu, Di Guo, Xinyu Zhang, Wenlin Zhu, Bin Fang, and Fuchun Sun. 2020. [67] Markus Waltl. 2010. Enriching Multimedia with Sensory Effects: Annotation and
Toward Image-to-tactile Cross-modal Perception for Visually Impaired People. Simulation Tools for The Representation of Sensory Effects. VDM Verlag.
IEEE Transactions on Automation Science and Engineering 18, 2 (2020), 521–529. [68] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai
[44] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Yin, and Xiang Ruan. 2017. Learning to Detect Salient Objects with Image-level
Matti Pietikäinen. 2020. Deep Learning for Generic Object Detection: A Survey. Supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern
International Journal of Computer Vision 128, 2 (2020), 261–318. Recognition. 136–145.
[45] Shervin Minaee, Yuri Y Boykov, Fatih Porikli, Antonio J Plaza, Nasser Kehtarnavaz, [69] Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan, and Sergio Escalera.
and Demetri Terzopoulos. 2021. Image Segmentation Using Deep Learning: A 2018. RGB-D-based Human Motion Recognition with Deep Learning: A Survey.
Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021). Computer Vision and Image Understanding 171 (2018), 118–139.
[46] Guankun Mu, Haibing Cao, and Qin Jin. 2016. Violent Scene Detection Using [70] Jing Yu, Wei Song, Guozhu Zhou, and Jian-jun Hou. 2019. Violent Scene Detection
Convolutional Neural Networks and Deep Audio Features. In Chinese Conference Algorithm Based on Kernel Extreme Learning Machine and Three-dimensional
on Pattern Recognition. Springer, 451–463. Histograms of Gradient Orientation. Multimedia Tools and Applications 78, 7
[47] Suranga Nanayakkara, Elizabeth Taylor, Lonce Wyse, and S H. Ong. 2009. An (2019), 8497–8512.
Enhanced Musical Experience for the Deaf: Design and Evaluation of a Music [71] Lin-Ping Yuan, Wei Zeng, Siwei Fu, Zhiliang Zeng, Haotian Li, Chi-Wing Fu, and
Display and a Haptic Chair. In Proceedings of the SIGCHI Conference on Human Huamin Qu. 2021. Deep Colormap Extraction from Visualizations. arXiv preprint
Factors in Computing Systems (Boston, MA, USA) (CHI ’09). Association for arXiv:2103.00741 (2021).
Computing Machinery, New York, NY, USA, 337–346. https://doi.org/10.1145/ [72] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang,
1518701.1518756 Ambrish Tyagi, and Amit Agrawal. 2018. Context Encoding for Semantic Seg-
[48] Ernst Niebur. 2007. Saliency Map. Scholarpedia 2, 8 (2007), 2675. mentation. In Proceedings of the IEEE conference on Computer Vision and Pattern
[49] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolu- Recognition. 7151–7160.
tion network for semantic segmentation. In Proceedings of the IEEE International [73] Jing Zhang, Xin Yu, Aixuan Li, Peipei Song, Bowen Liu, and Yuchao Dai. 2020.
Conference on Computer Vision. 1520–1528. Weakly-supervised salient object detection via scribble annotations. In Proceed-
[50] S. O’Modhrain and I. Oakley. 2004. Adding Interactivity: Active Touch in ings of the IEEE/CVF conference on computer vision and pattern recognition. 12546–
Broadcast Media. In Proceedings of 12th International Symposium on Haptic Inter- 12555.
faces for Virtual Environment and Teleoperator Systems (HAPTICS ’04). 293–294. [74] Tong Zhang and Carlo Tomasi. 1999. Fast, Robust, and Consistent Camera Motion
https://doi.org/10.1109/HAPTIC.2004.1287211 Estimation. In Proceedings of 1999 IEEE Computer Society Conference on Computer
[51] Saurabh Palan, Ruoyao Wang, Nathaniel Naukam, Li Edward, and Katherine J Vision and Pattern Recognition (Cat. No PR00149), Vol. 1. IEEE, 164–170.
Kuchenbecker. 2010. Tactile Gaming Vest (TGV). [75] Ting Zhao and Xiangqian Wu. 2019. Pyramid Feature Selective Network for
[52] Nathan J. A. Pollet, Emanuel Uzan, Patricia Batista Ruivo, Tal Abravanel, Aishwari Saliency detection. CoRR abs/1903.00179 (2019). arXiv:1903.00179 http://arxiv.
Talhan, Yongjae Yoo, and Jeremy R. Cooperstock. 2021. Multimodal Haptic org/abs/1903.00179
Armrest for Immersive 4D Experiences. In IEEE World Haptics Conference, Work [76] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2019. Object
in Progress. Detection with Deep Learning: A Review. IEEE Transactions on Neural Networks
[53] Jonghyun Ryu and Seungmoon Choi. 2008. posVibEditor: Graphical Authoring and Learning Systems 30, 11 (2019), 3212–3232.
Tool of Vibrotactile Patterns. In 2008 IEEE International Workshop on Haptic Audio [77] Zhiying Zhou, A. D. Cheok, Wei Liu, Xiangdong Chen, Farzam Farbiz, Xubo
visual Environments and Games (HAVE’08). 120–125. https://doi.org/10.1109/ Yang, and M. Haller. 2004. Multisensory musical entertainment systems. IEEE
HAVE.2008.4685310 MultiMedia 11, 3 (2004), 88–101. https://doi.org/10.1109/MMUL.2004.13
[54] Oliver S. Schneider, Ali Israr, and Karon E. MacLean. 2015. Tactile Animation [78] Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng, Jianbing Shen, and Ling Shao.
by Direct Manipulation of Grid Displays. In Proceedings of the 28th Annual ACM 2021. RGB-D Salient Object Detection: A Survey. Computational Visual Media
Symposium on User Interface Software and Technology (Charlotte, NC, USA) (UIST (2021), 1–33.
’15). Association for Computing Machinery, New York, NY, USA, 21–30. https: [79] Longhao Zou, Irina Tal, Alexandra Covaci, Eva Ibarrola, Gheorghita Ghinea,
//doi.org/10.1145/2807442.2807470 and Gabriel-Miro Muntean. 2017. Can Multisensorial Media Improve Learner
[55] Oliver S. Schneider and Karon E. MacLean. 2016. Studying Design Process and Experience?. In Proceedings of the 8th ACM on Multimedia Systems Conference
Example Use with Macaron, A Web-based Vibrotactile Effect Editor. In 2016 IEEE (Taipei, Taiwan) (MMSys’17). Association for Computing Machinery, New York,
Haptics Symposium (HAPTICS). IEEE, 52–58. NY, USA, 315–320. https://doi.org/10.1145/3083187.3084014
[56] Jongman Seo, Sunung Mun, Jaebong Lee, and Seungmoon Choi. 2018. Substituting [80] Eberhard Zwicker and Hugo Fastl. 2013. Psychoacoustics: Facts and models. Vol. 22.
Motion Effects with Vibrotactile Effects for 4D Experiences. In Proceedings of Springer Science & Business Media.
the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC,
Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA,
1–6. https://doi.org/10.1145/3173574.3174002
[57] Donghee Shin. 2019. How Does Immersion Work in Augmented Reality Games?
A User-centric View of Immersion and Engagement. Information, Communication
& Society 22, 9 (2019), 1212–1229.
[58] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Net-
works for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
[59] Farhana Sultana, Abu Sufian, and Paramartha Dutta. 2020. Evolution of Image
Segmentation Using Deep Convolutional Neural Network: A Survey. Knowledge-
Based Systems 201 (2020), 106062.
[60] Y. Suzuki and M. Kobayashi. 2005. Air Jet Driven Force Feedback in Virtual
Reality. IEEE Computer Graphics and Applications 25, 1 (2005), 44–47. https:
//doi.org/10.1109/MCG.2005.1
[61] Jiexiong Tang, John Folkesson, and Patric Jensfelt. 2018. Geometric Correspon-
dence Network for Camera Motion Estimation. IEEE Robotics and Automation
Letters 3, 2 (2018), 1010–1017.
[62] Antigoni Tsiami, Petros Koutras, and Petros Maragos. 2020. Stavis: Spatio-
temporal Audiovisual Saliency Network. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition. 4766–4776.
[63] S. u. Rehman, J. Sun, L. Liu, and H. Li. 2008. Turn Your Mobile Into the Ball:
Rendering Live Football Game Using Vibration. IEEE Transactions on Multimedia
10, 6 (2008), 1022–1033. https://doi.org/10.1109/TMM.2008.2001352
[64] Yusuke Ujitoko and Yuki Ban. 2018. Vibrotactile Signal Generation from Texture
Images or Attributes using Generative Adversarial Network. In International
Conference on Human Haptic Sensing and Touch Enabled Computer Applications.
Springer, 25–36.
[65] Yusuke Ujitoko, Yuki Ban, and Koichi Hirota. 2020. Gan-based Fine-tuning of
Vibrotactile Signals to Render Material Surfaces. IEEE Access 8 (2020), 16656–
16661.
[66] Ronald T Verrillo. 1966. Vibrotactile Sensitivity and The Frequency Response of
The Pacinian Corpuscle. Psychonomic Science 4, 1 (1966), 135–136.