Eura English Transcription Guidelines 2024 - ADAP QF
Eura English Transcription Guidelines 2024 - ADAP QF
Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
Project goal: The goal of this project is to transcribe audio files that will ultimately help our client build
state of the art automatic speech recognition models.
The transcription box contains a pre-transcription. In this project, you will need to correct the transcription and
add tags as needed, according to the following guidelines. The aim of this project is to accurately transcribe (i.e.
type out or represent with pre-filled tags) the speech presented to you in audio files. You will be using our online
transcription platform called "ADAP Quality Flow". A separate guide is provided for using ADAP Quality Flow.
Please read these guidelines in full and keep them handy when you start transcription. There are a lot of
things to remember, but you will find it gets easier once you have done a few transcriptions.
Please use the present guidelines alongside the more specific speaker diarization guidelines available here as
well.
General information
The purpose of this project is to:
- correct pre-filled transcriptions or transcribe from scratch
- tag non-speech sounds which occur at the same time as speech
- timestamp audio to capture continuous speech (ie. speech with pauses of
less than 0.5 seconds) and track and identify speakers by adding
timestamps at the start and end of each speaker turn.
Speech which is clearly quieter than the foreground volume (i.e. quiet speech
coming from the background that can be understood with some effort) should be
Background speech transcribed following all guidelines in this document, using the same conventions
as foreground speech and noises, and additionally tagged as background
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
using the <bg></bg> span tag. For background speech that can’t be understood,
use .
A unit is a single unit of transcription. Each unit has its own text input box and it
does not need to be saved (i.e. the transcription is automatically saved) to move
Units on to the next unit. The breaks between units can generally be ignored: they are
only intended to break up the audio into easily transcribable sections.
A unit group of transcription work is a single, continuous audio file which is further
Unit Group divided into pages and units.
For some unit groups, you will see additional information about the source of the
audio in the Description section.
Please use this information to guide your spelling decisions for proper nouns
Source such as person names, location names, products and brand names.
Note: this additional information may contain spelling errors. Please use your
judgment when using this information for proper nouns. If you’re not certain,
please do an online search to confirm the spelling.
To help you be more productive with transcription, you can utilize the keyboard
shortcuts for event tags and span tags. To use the shortcuts, make sure the
cursor is active in the text box and then press the keyboard combinations.
Transcribing speech
Use standard contractions ("I'm", "could've", "let's" but not "tryna" or "'em") if this
is how a word is pronounced in the audio. Also use possessive apostrophes
where necessary, e.g. "Mike's job", "both kids' toys".
Spelling Example
Speaker says '24' – use a hyphen
● 24 ==> TRANSCRIPTION: twenty-four
Speaker says '20' followed by '4' – do NOT use a hyphen
If a pronunciation is only one sound different from its conventional spelling, please
use the conventional spelling. If the spoken form differs by more than one sound,
and there is a commonly used English spelling, please use that spelling.
Example
One sound different
Use English capitalization rules with one exception: do not use a capital letter if
the only reason to do so is that the word is at the start of a sentence.
Capital letters Most person names ("Barack Obama"), location names ("Golden Gate Bridge",
"Russia"), products, and brand names ("Five Guys", "YouTube") should be
capitalized.
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
We’d like to identify at what point the speakers change at a unit group level using
timestamps. This means you will identify the following points in an audio with a
timestamp:
- [Speaker]_start: This is used when there is a new speaker in the audio, or
a changed speaker
- [Speaker]_end: This is used when the speaker finishes speaking, either
when the unit group is complete or before another person starts speaking.
To try to be as precise as possible, please place the timestamps within 0.1
Speaker seconds of the event happening. Do not put the timestamp in the middle of a
Identification and word (or you will cut the word).
Speaker Changes
For more details on timestamps and speaker diarization, please refer to the
Speaker Diarization guidelines
In this project, you'll encounter overlapping speech in the audio. Transcribe all
speech from each speaker, including timestamps and speaker identifiers.
Timestamps will indicate whether speech from the speakers overlaps.
Enter this tag in place of the speech which cannot be understood after three
attempts at listening.
If there is more than one unintelligible word in sequence, use a single tag. If the
entire sentence or unit cannot be understood, use a single unintelligible tag.
unintelligible tag.
Example
A speaker says a word you don't understand
TRANSCRIPTION: go tomorrow
Even if a speaker is in the background, a speaker identifier tag is still needed for
speaker diarization.
Example
background A speaker asks a question but someone in the background responds.
TRANSCRIPTION: [0.004] <spk_1> does anyone have a question? </spk_1>
[0.534] [0.550] < spk_2> <bg> yes, over here. </bg> </spk_2> [1.230]
The definition of singing for the purpose of this project is: making musical and/or
Singing rhythmic sounds with your voice. Note that we have two different tags for singing.
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
The <singing/> tag is reserved for live singing, as described in this section, by a
foreground or background speaker.
⚠ Sung lyrics in recorded audio are not tagged as <singing/> for the purpose of
this project, but should rather use the tag <lyrics/>. See the section below for
details on how to apply the <lyrics/> tag.
Note that these categories are considered speech (not singing) if they are not
pronounced in the manner described above.
There are two ways to treat singing in this project: a tag to replace
each word or group of words that is sung that you do not know or cannot
understand and a span tag to surround words that
are sung and that you can write down.
1. Use the event tag for a sung word or a group of sung words that
you cannot understand (e.g. unintelligible singing, mumbling...) or words sung in a
foreign language (even if you can understand it). Use the event tag
also for scatting/nonsense singing.
Example:
A speaker starts a sentence in English and then says a word in German
but in a sing-song manner “kaaaartoffeeeeellll!”.
TRANSCRIPTION:
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
If there is more than one word sung in a sequence, please use a single singing
tag for all unintelligible words sung.
Example:
Someone starts rapping but you cannot understand the words. You
believe you can hear at least 5 words.
TRANSCRIPTION:
Example:
Someone recites a poem in French and then the speaker changes.
TRANSCRIPTION:
2. If you can understand the sung words (and they are in English), write them
down and surround them with the span tag .
Example:
Someone is reciting a poem.
Multiple people sing the Happy Birthday song together at the same time
and you cannot recognize the singers:
TRANSCRIPTION:
<group> <singing>happy birthday dear Rajesh, happy birthday to you!</singing>
</group>
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
Multiple people sing the song Frère Jacques in a round (starting a few
seconds apart from each other)
TRANSCRIPTION:
TRANSCRIPTION:
/!\ Tips:
● Use the event tag for sung words that you cannot
understand.
● Use the event tag for each or group of foreign sung words,
even if you can understand it.
● Use the event tag only for a speaker singing, and not for
lyrics in recorded music that appears in the unit. For lyrics in recorded
music, use the <lyrics/> tag instead.
● Use the timestamps and speaker identifiers for singing like you
do for spoken speech: when the speaker or singer changes.
● Use punctuation in places where it falls naturally in songs, singsong
words, poems or sermons.
● If multiple people sing different words at the same time (i.e. different
songs, out of sync, in a round), transcribe both speakers.
● If singing and spoken speech occur at the same time and at a similar
volume, transcribe both speakers.
Lyrics in recorded, produced musical audio that appears in the unit should be
transcribed similarly to singing, but using the <lyrics/> tags instead of the
<singing/> tags.
1. Use the <lyrics/> event tag for a word or group of words in the lyrics that you
cannot understand (e.g. unintelligible singing, mumbling...) or for a word or group
of words in a foreign language in the lyrics (even if you can understand it). Use
the <lyrics/> event tag for scatting/nonsense singing.
Lyrics
For a sequence of lyrics with more than one word, please use a single lyrics tag
to represent all unintelligible lyrics.
2. If you can understand the words of the lyrics (and they are in English), write
them down and surround them with the span tag <lyrics>word</lyrics>.
Use the foreign tag for speech in a language other than English which would not
be understood by US English speakers.
Example:
A speaker says a foreign word after “does” and you cannot identify the
foreign word
If there is more than one unintelligible foreign word in sequence, use a single tag.
If the entire sentence or unit cannot be understood and is in a foreign language,
use a single <foreign/> tag.
Example:
A speaker says “denken Sie an die Kinder“ in the middle of a sentence
but you do not understand
If you can understand the foreign language, please write the words down and
surround them with the span tag .
Example:
A speaker says “denken Sie an die Kinder“ in the middle of a sentence
and you understand the words
/!\ Tips:
● Remember that loanwords are words borrowed from other languages that
are widely known and understood by English speakers. They are not
considered foreign words for the purposes of this project and should not
receive a foreign tag.
● Foreign names (people’s names, places, etc.) are not considered foreign
words and should be transcribed.
● If you can understand and transcribe what is said but it is not in English
and not a loanword, please surround the words with
span tag.
Numbers should be spelled out as full words in the way they were said.
Example
The number '2012' may be said in many different ways
● this item costs $12.99. ==> TRANSCRIPTION: this item costs twelve
dollars ninety-nine.
Digits (e.g. 1 2 3 4 5 ...) can be used in the transcription ONLY when they are
joined to a letter as part of a name without a space.
Example
However
Acronyms and initialisms are words made up of the first letters of words. They
may be pronounced as a word, or each letter may be pronounced separately.
Acronyms and initialisms are spelled using uppercase letters with no space or
period in between.
When a speaker spells a word out, letter by letter, please transcribe uppercase
letters with a space in between.
Example
Spelled out words
● TRANSCRIPTION: spelling sequences are transcribed as isolated
uppercase letters. if I spell my name to you, I would say J O H N.
● TRANSCRIPTION: M A N H A T T A N. M A N H A double T A N.
Inappropriate All inappropriate language should be transcribed. If you feel uncomfortable typing
language a particular word, use the unintelligible tag (see unintelligible tag) in its place.
Transcribe hesitations and other disfluencies like uh-huh and hm, using the table
below.
List of Hesitations/Interjections
Acceptable
Meaning
Spelling
Agreement hm, mm
huh, ah, oh, uh,
Disagreement
uh-uh
Surprise wow, oh, ah
Seeking
eh, mhm, ehm
Confirmation
bah, bleah, ugh,
Hesitations and Disgust
yuck, eww
interjections
Delight eh, wow, ah
Calling Someone hei, eh, oh
Emphasizing eh, wa, oh, ah, uh
Example
Example
Speaker's
Transcription Full Form
colloquial Pronunciation
rez reservation
Please see the US spelling standardization and word list for more examples!
Use this to surround any words that were accidentally mispronounced. Spell
the word in the normal (correct) way, then surround it. There is no need to
use this if someone has an accent — it should only be used when the
person accidentally said something the wrong way. When in doubt ask
yourself "would this person pronounce the word differently if I asked them to
repeat themselves?" If they would, it can be classified as a
mispronunciation mispronunciation.
Example
You hear “what time are you leabing?”
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
If you hear a word in the audio but you are not entirely sure how to spell it or
you are not entirely confident you are hearing the word correctly, surround
the word using the best guess tag.
This tag might be needed if the speaker uses a proper name you are
unfamiliar with.
Example:
● You hear "he told me to go to Wolengi” but you are not sure what
Wolengi is or how to spell it; spell as best guess and use the tag:
<best_guess>Wolengi</best_guess>
Please do not use the best guess tag if the speech is unintelligible because
the audio quality is poor, the speaker mumbles, etc. for these cases please
best guess Do NOT use this tag for words you can easily spell correctly by doing a quick
online search.
Examples:
● You are unsure of the name of an artist, "Emir Kusturica"; you should
search online with an approximate spelling + keywords (e.g you
heard "movie" in the unit group) to find the correct spelling.
● You are unsure of the spelling of "necessary"; you should look it up
online or in a dictionary and use the correct spelling.
/!\ Remember:
If you hear something in English but cannot make out the word at all = use
use
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
If an entire utterance does not contain any speech, it should be transcribed with
one tag ONLY: the no speech tag. Even if it contains other sounds, you must
ignore them if there is no speech at all.
Example
no speech c The whole utterance contains someone crying, loud noises or
instrumental music:
TRANSCRIPTION:
You must ignore all sounds if there is no speech in the entire utterance.
These are listed in order of how often they are likely to be used. The more common tags are listed at the top of
the table.
Use for all sounds made by a foreground human which are not speech (e.g. any sounds from
the mouth or nose, such as breath, cough, lipsmack, and laughing).
Example
Someone laughs in the middle of their sentence
TRANSCRIPTION: seriously that’s ridiculous!
Someone is speaking and someone else coughs between the speaker’s words.
TRANSCRIPTION:
spk
and after that I went to Forever twenty-one to buy some socks.
You hear some coughing and then some speech. The coughing is ignored and not
tagged because it’s not occurring during a speaker turn.
TRANSCRIPTION :
Use for music (without lyrics) that does not overlap with foreground speech. Singing by a
speaker should be tagged as singing, not as music, while recorded music with lyrics should
be tagged as lyrics.
Example
music
A loud jingle is heard during a short break in a TV announcement:
You hear some music and then some speech. Music is ignored and not tagged
because it’s not occurring during a speaker turn.
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
TRANSCRIPTION :
Use for any foreground noise that is not spk or music (see above).
Use for any non-speaker noise that occurs at the same volume as foreground speech.
Do not tag background noise that is at a lower volume than speech.
Example
noise Someone is knocking at the door:
TRANSCRIPTION :
Use when a word gets cut off at the end of a unit because the computer has not cut up the
audio correctly. This is different from a fragment (where the person stops talking part way
through a word). In a truncation, the recording has cut someone off while they were saying a
word. Therefore, truncations only occur at the start or end of a unit.
When you hear a truncation at the end of a unit and you can transcribe the word with
certainty, write out the truncated word in full followed by the tag. When you
truncation
hear a truncation at the start of a unit, insert the tag only.
Example
The word 'probably' is split with "prob-" at the end of the first unit and "-ably" at the
beginning of the second unit.
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
If you are unable to tell what the truncated word is at the end of a unit, simply insert the
Example
A word is truncated between two units and you can hear some fragments of it in both
units but you cannot tell what the word is for certain. Replace the truncated word
with the unintelligible tag.
unit 1: we bought a
Punctuation
A sentence is a grammatically complete unit. A sentence will usually, but not always,
contain a subject (e.g. "the cat") and a verb (e.g. "sat"). Examples of grammatically
complete sentences which do not have a subject and verb include answers to questions
(e.g. "yes." and "no.") and exclamations ("what!" and "really?").
Example:
● TRANSCRIPTION: running smoothly now. could I do more? yes, maybe.
At the end of each sentence, use a period (.) for statements, a question mark (?) for
questions, or an exclamation mark (!) for exclamations. Do not use punctuation
combinations ("?!", "!!!", "..."). Do not use hyphens or quotation marks to indicate quoted or
mentioned speech. No other punctuation (such as : ;) should be used.
Punctuation
Only place punctuation at the end of a unit if the end of the unit is also the end of a
sentence. If the speaker continues the same sentence into the next unit, put the
punctuation wherever it naturally falls in the speech. See the description of a unit.
Examples:
● TRANSCRIPTION:
Unit 1: win this year! what do you think
Unit 2: about the Knicks? they seem to have finally
See the "incomplete" tag section below for instructions about sentence fragments which
are not grammatically complete.
Insert the incomplete tag when a speaker begins a sentence and is either (a) interrupted by
a new speaker, or (b) begins a new sentence before the first grammatically complete
sentence is finished.
The tag should not be used to indicate that a sentence is continuing into a second unit.
incomplete Examples
Speaker 1 says ’I like having’ in Utterance 1. In the second utterance, they resume
their sentence and say ‘9 hours of sleep a day’.
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
The sentence continued on, therefore the incomplete tag is not needed here.
A speaker starts a sentence but switches to a new sentence before finishing the
first one.
TRANSCRIPTION:
You do not need to use the incomplete tag when the speaker restarts or repeats a single
word.
You may use commas (,) to increase the readability, following standard rules surrounding
comma usage, for instance:
● For lists of items ("I ate two apples, three oranges, and a banana.") and sequences
Commas of adjectives ("he was a big, red haired, evil man.")
● For introductory phrases ("so I was thinking, how do you do it?", "at the end of the
day, what matters is your health.").
NOTE: All information provided in this document is confidential. Any publication,
provision, or dissemination of this content is strictly prohibited. Do not share or
post the contents on the internet.
Please follow standard rules of comma usage. When unsure whether to use a comma, err
on the side of NOT using one.
Resources
● English Punctuation Rules
● Capitalization in English
● Forester - Spelling Standardization and wordlist