We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23
Applied Linguistics II
SEYED MOHAMMAD HOSSEINI
D E P T. O F E N G L I S H ARAK UNIVERSITY Session 14 oComputational Linguistics oCorpus Linguistics Computational Linguistics Definition Goals Methods Topics in computational linguistics Computational linguistics: Definition “A branch of linguistics in which computational techniques and concepts are applied to the elucidation of linguistic and phonetic problems. Several research areas have developed, including natural language processing, speech synthesis, speech recognition, automatic translation, the making of concordances, the testing of grammars, and the many areas where statistical counts and analyses are required (e.g. in literary textual studies).” Crystal, D. 2008. A Dictionary of Linguistics and Phonetics. Blackwell Publishing. Computational linguistics: Definition ◦ “Computational linguistics is the scientific and engineering discipline concerned with understanding written and spoken language from a computational perspective, and building artifacts that usefully process and produce language, either in bulk or in a dialogue setting. ◦ To the extent that language is a mirror of mind, a computational understanding of language also provides insight into thinking and intelligence. ◦ And since language is our most natural and most versatile means of communication, linguistically competent computers would greatly facilitate our interaction with machines and software of all sorts, and put at our fingertips, in ways that truly meet our needs, the vast textual and other resources of the internet.” Theoretical goals of computational linguistics
The theoretical goals of computational linguistics include:
◦ formulation of grammatical and semantic frameworks for characterizing
languages;
◦ the discovery of processing techniques and learning principles;
◦ the development of cognitively and neuroscientifically plausible
computational models of how language processing and learning might occur in the brain. Practical goals of computational linguistics The practical goals of the field are broad and varied. Some of the most prominent are: ◦ efficient text retrieval on some desired topic; ◦ effective machine translation (MT); ◦ question answering (QA); ◦ text summarization; ◦ analysis of texts or spoken language for topic, sentiment, or other psychological attributes; ◦ dialogue agents for accomplishing particular tasks (purchases, technical trouble shooting, trip planning, schedule maintenance, medical advising, etc.); ◦ creation of computational systems with human-like competency in dialogue, in acquiring language, and in gaining knowledge from text. Methods The methods employed in theoretical and practical research in computational linguistics have often drawn upon theories and findings in ◦ theoretical linguistics ◦ philosophical logic ◦ cognitive science (especially psycholinguistics) ◦ computer science. Topics in Computational Linguistics: Syntax and parsing ◦ The structural hierarchy ◦ Syntax ◦ Parsing ◦ Coping with syntactic ambiguity Topics in Computational Linguistics: Semantic representation ◦ Relating language to logic ◦ Thematic/case roles ◦ Expressivity issues ◦ Mapping syntactic trees to logical forms ◦ Coping with semantic ambiguity Topics in Computational Linguistics: Making sense of text ◦ Making sense of text ◦ Dealing with reference and various forms of “missing material” ◦ Making connections ◦ Dealing with figurative language ◦ Making sense of, and engaging in, dialogue Topics in Computational Linguistics: Acquiring knowledge for language ◦ Acquiring knowledge for language ◦ Knowledge extraction from text ◦ Crowdsourcing (soliciting verbally expressed information, or annotations of such information, from large numbers of web users) Applications ◦ Machine translation ◦ Document retrieval and clustering applications ◦ Knowledge extraction and summarization ◦ Sentiment analysis ◦ Chatbots and companionable dialogue agents ◦ Virtual worlds, games, and interactive fiction ◦ Natural language user interfaces ◦ Text-based question answering ◦ Inferential (knowledge-based) question answering ◦ Voice-based web services and assistants ◦ Collaborative problem solvers and intelligent tutors ◦ Language-enabled robots Source https://plato.stanford.edu/entries/computational-linguistics/#DatFroEnd Corpus (pl. corpora or corpuses) a collection of naturally occurring samples of language which have been collected and collated for easy access by researchers and materials developers who want to know how words and other linguistic items are actually used. A corpus may vary from a few sentences to a set of written texts or recordings. In language analysis corpuses usually consist of a relatively large, planned collection of texts or parts of texts, stored and accessed by computer. ◦ Richards, J. C. and R. Schmidt, 2010. Longman Dictionary of Language Teaching and Applied Linguistics. Longman. Corpus A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language. Corpora provide the basis for one kind of computational linguistics. A computer corpus is a large body of machine-readable texts. Increasingly large corpora (especially of English) have been compiled since the 1980s, and are used both in the development of natural language processing software and in such applications as lexicography, speech recognition, and machine translation. ◦ Crystal, D. 2008. A Dictionary of Linguistics and Phonetics. Blackwell Publishing. Types of Corpora A corpus is designed to represent different types of language use, e.g. casual conversation, business letters, ESP texts. A number of different types of corpuses may be distinguished, for example: 1 specialized corpus: a corpus of texts of a particular type, such as academic articles, student writing, etc. 2 general corpus or reference corpus: a large collection of many different types of texts, often used to produce reference materials for language learning (e.g. dictionaries) or used as a base- line for comparison with specialized corpora 3 comparable corpora: two or more corpora in different languages or language varieties containing the same kinds and amounts of texts, to enable differences or equivalences to be compared 4 learner corpus: a collection of texts or language samples produced by language learners ◦ Richards, J. C. and R. Schmidt, 2010. Longman Dictionary of Language Teaching and Applied Linguistics. Corpus Linguistics “an approach to investigating language structure and use through the analysis of large databases of real language examples stored on computer.” ◦ Richards, J. C. and R. Schmidt, 2010. Longman Dictionary of Language Teaching and Applied Linguistics. How can corpora help? Issues amenable to corpus linguistics include ◦ the meanings of words across registers, ◦ the distribution and function of grammatical forms and categories, ◦ the investigation of lexico-grammatical associations (associations of specific words with particular grammatical constructions), ◦ the study of discourse characteristics, register variation, and ◦ (when learner corpora are available) issues in language acquisition and development.” Why corpora? • Objective verification of results • Corpora show how people really use the language. They do not provide imaginary, idealized examples • Quantitative data shows what occurs frequently and what occurs rarely in the language • Thanks to IT-technology we can conduct fast, complex studies, process more material than by hand Criticism Linguistic descriptions which are ‘corpus restricted’ have been the subject of criticism, especially by generative grammarians, who point to the limitations of corpora (e.g. that they are samples of performance only, and that one still needs a means of projecting beyond the corpus to the language as a whole). In fieldwork on a new language, or in historical study, it may be very difficult to get beyond one’s corpus (i.e. it is a ‘closed’ as opposed to an ‘extendable’ corpus), but in languages where linguists have regular access to native-speakers (and may be native-speakers themselves) their approach will invariably be ‘corpus-based’, rather than corpus-restricted. ◦ Crystal, D. 2008. A Dictionary of Linguistics and Phonetics. Blackwell Publishing. Some English Corpora Corpus of Contemporary American English (COCA): The corpus is composed of more than 1 billion words from 220,225 texts, including 20 million words from each of the years 1990 through 2017. https://www.english-corpora.org/coca/ The Corpus of Historical American English (COHA) is the largest structured corpus of historical English. COHA contains more than 475 million words of text from the 1820s-2010s and the corpus is balanced by genre decade by decade. https://www.english-corpora.org/coha/ British National Corpus (BNC): The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of texts from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic). https://www.english-corpora.org/bnc/ Some Persian Corpora دکتر بیجن خان-پیکره متنی زبان فارسی https://www.peykaregan.ir/dataset/%D9%BE%DB%8C%DA%A9%D8%B1%D9%87-%D9%85%D8 %AA%D9%86%DB%8C-%D8%B2%D8%A8%D8%A7%D9%86-%D9%81%D8%A7%D8%B1%D8%B3 %DB%8C دکتر مصطفی عاصی-پایگاه دادگان زبان فارسی http://pldb.ihcs.ac.ir/