0% found this document useful (0 votes)
13 views23 pages

Session 14 - Computaional Linguistics

Uploaded by

ebonchill7.0.0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Session 14 - Computaional Linguistics

Uploaded by

ebonchill7.0.0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Applied Linguistics II

SEYED MOHAMMAD HOSSEINI


D E P T. O F E N G L I S H
ARAK UNIVERSITY
Session 14
oComputational Linguistics
oCorpus Linguistics
Computational Linguistics
Definition
Goals
Methods
Topics in computational linguistics
Computational linguistics: Definition
“A branch of linguistics in which computational techniques
and concepts are applied to the elucidation of linguistic and
phonetic problems. Several research areas have developed,
including natural language processing, speech synthesis,
speech recognition, automatic translation, the making of
concordances, the testing of grammars, and the many areas
where statistical counts and analyses are required (e.g. in
literary textual studies).”
Crystal, D. 2008. A Dictionary of Linguistics and Phonetics. Blackwell Publishing.
Computational linguistics: Definition
◦ “Computational linguistics is the scientific and engineering discipline
concerned with understanding written and spoken language from a
computational perspective, and building artifacts that usefully process and
produce language, either in bulk or in a dialogue setting.
◦ To the extent that language is a mirror of mind, a computational
understanding of language also provides insight into thinking and intelligence.
◦ And since language is our most natural and most versatile means of
communication, linguistically competent computers would greatly facilitate
our interaction with machines and software of all sorts, and put at our
fingertips, in ways that truly meet our needs, the vast textual and other
resources of the internet.”
Theoretical goals of computational linguistics

The theoretical goals of computational linguistics include:

◦ formulation of grammatical and semantic frameworks for characterizing


languages;

◦ the discovery of processing techniques and learning principles;

◦ the development of cognitively and neuroscientifically plausible


computational models of how language processing and learning might
occur in the brain.
Practical goals of computational
linguistics
The practical goals of the field are broad and varied. Some of the most prominent
are:
◦ efficient text retrieval on some desired topic;
◦ effective machine translation (MT);
◦ question answering (QA);
◦ text summarization;
◦ analysis of texts or spoken language for topic, sentiment, or other psychological
attributes;
◦ dialogue agents for accomplishing particular tasks (purchases, technical trouble
shooting, trip planning, schedule maintenance, medical advising, etc.);
◦ creation of computational systems with human-like competency in dialogue, in
acquiring language, and in gaining knowledge from text.
Methods
The methods employed in theoretical and practical research
in computational linguistics have often drawn upon theories
and findings in
◦ theoretical linguistics
◦ philosophical logic
◦ cognitive science (especially psycholinguistics)
◦ computer science.
Topics in Computational Linguistics:
Syntax and parsing
◦ The structural hierarchy
◦ Syntax
◦ Parsing
◦ Coping with syntactic ambiguity
Topics in Computational Linguistics:
Semantic representation
◦ Relating language to logic
◦ Thematic/case roles
◦ Expressivity issues
◦ Mapping syntactic trees to logical forms
◦ Coping with semantic ambiguity
Topics in Computational Linguistics:
Making sense of text
◦ Making sense of text
◦ Dealing with reference and various forms of “missing material”
◦ Making connections
◦ Dealing with figurative language
◦ Making sense of, and engaging in, dialogue
Topics in Computational Linguistics:
Acquiring knowledge for language
◦ Acquiring knowledge for language
◦ Knowledge extraction from text
◦ Crowdsourcing (soliciting verbally expressed information, or
annotations of such information, from large numbers of web
users)
Applications
◦ Machine translation
◦ Document retrieval and clustering applications
◦ Knowledge extraction and summarization
◦ Sentiment analysis
◦ Chatbots and companionable dialogue agents
◦ Virtual worlds, games, and interactive fiction
◦ Natural language user interfaces
◦ Text-based question answering
◦ Inferential (knowledge-based) question answering
◦ Voice-based web services and assistants
◦ Collaborative problem solvers and intelligent tutors
◦ Language-enabled robots
Source
https://plato.stanford.edu/entries/computational-linguistics/#DatFroEnd
Corpus (pl. corpora or corpuses)
a collection of naturally occurring samples of language which have
been collected and collated for easy access by researchers and
materials developers who want to know how words and other
linguistic items are actually used. A corpus may vary from a few
sentences to a set of written texts or recordings. In language analysis
corpuses usually consist of a relatively large, planned collection of
texts or parts of texts, stored and accessed by computer.
◦ Richards, J. C. and R. Schmidt, 2010. Longman Dictionary of Language
Teaching and Applied Linguistics. Longman.
Corpus
A collection of linguistic data, either written texts or a transcription of recorded
speech, which can be used as a starting point of linguistic description or as a
means of verifying hypotheses about a language.
Corpora provide the basis for one kind of computational linguistics. A computer
corpus is a large body of machine-readable texts. Increasingly large corpora
(especially of English) have been compiled since the 1980s, and are used both in
the development of natural language processing software and in such
applications as lexicography, speech recognition, and machine translation.
◦ Crystal, D. 2008. A Dictionary of Linguistics and Phonetics. Blackwell Publishing.
Types of Corpora
A corpus is designed to represent different types of language use, e.g. casual conversation,
business letters, ESP texts. A number of different types of corpuses may be distinguished, for
example:
1 specialized corpus: a corpus of texts of a particular type, such as academic articles, student
writing, etc.
2 general corpus or reference corpus: a large collection of many different types of texts, often
used to produce reference materials for language learning (e.g. dictionaries) or used as a base-
line for comparison with specialized corpora
3 comparable corpora: two or more corpora in different languages or language varieties
containing the same kinds and amounts of texts, to enable differences or equivalences to be
compared
4 learner corpus: a collection of texts or language samples produced by language learners
◦ Richards, J. C. and R. Schmidt, 2010. Longman Dictionary of Language Teaching and Applied Linguistics.
Corpus Linguistics
“an approach to investigating language structure and use
through the analysis of large databases of real language
examples stored on computer.”
◦ Richards, J. C. and R. Schmidt, 2010. Longman Dictionary of
Language Teaching and Applied Linguistics.
How can corpora help?
Issues amenable to corpus linguistics include
◦ the meanings of words across registers,
◦ the distribution and function of grammatical forms and categories,
◦ the investigation of lexico-grammatical associations (associations
of specific words with particular grammatical constructions),
◦ the study of discourse characteristics, register variation, and
◦ (when learner corpora are available) issues in language acquisition
and development.”
Why corpora?
• Objective verification of results
• Corpora show how people really use the language. They do not provide imaginary, idealized
examples
• Quantitative data shows what occurs frequently and what occurs rarely in the language
• Thanks to IT-technology we can conduct fast, complex studies, process more material than by
hand
Criticism
Linguistic descriptions which are ‘corpus restricted’ have been the subject of
criticism, especially by generative grammarians, who point to the limitations of
corpora (e.g. that they are samples of performance only, and that one still needs
a means of projecting beyond the corpus to the language as a whole).
In fieldwork on a new language, or in historical study, it may be very difficult to
get beyond one’s corpus (i.e. it is a ‘closed’ as opposed to an ‘extendable’
corpus), but in languages where linguists have regular access to native-speakers
(and may be native-speakers themselves) their approach will invariably be
‘corpus-based’, rather than corpus-restricted.
◦ Crystal, D. 2008. A Dictionary of Linguistics and Phonetics. Blackwell Publishing.
Some English Corpora
Corpus of Contemporary American English (COCA): The corpus is composed of more than 1 billion
words from 220,225 texts, including 20 million words from each of the years 1990 through 2017.
https://www.english-corpora.org/coca/
The Corpus of Historical American English (COHA) is the largest structured corpus of historical
English. COHA contains more than 475 million words of text from the 1820s-2010s and the corpus
is balanced by genre decade by decade.
https://www.english-corpora.org/coha/
British National Corpus (BNC): The British National Corpus (BNC) was originally created by Oxford
University press in the 1980s - early 1990s, and it contains 100 million words of texts from a wide
range of genres (e.g. spoken, fiction, magazines, newspapers, and academic).
https://www.english-corpora.org/bnc/
Some Persian Corpora
‫دکتر بیجن خان‬-‫پیکره متنی زبان فارسی‬
https://www.peykaregan.ir/dataset/%D9%BE%DB%8C%DA%A9%D8%B1%D9%87-%D9%85%D8
%AA%D9%86%DB%8C-%D8%B2%D8%A8%D8%A7%D9%86-%D9%81%D8%A7%D8%B1%D8%B3
%DB%8C
‫دکتر مصطفی عاصی‬-‫پایگاه دادگان زبان فارسی‬
http://pldb.ihcs.ac.ir/

‫فرهنگستان‬
https://dadegan.apll.ir/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy