Corpus Use in Teaching Language Arts
Introduction to Corpus Linguistics
for Advanced Structure of American English
Find this document in .doc form or .ppt form.
Introduction
Textbooks on English grammar attempt to describe the language as a system of rules that might explain how the child comes to know the language so quickly. This knowledge of language is referred to as grammatical competence. However, the use of language in everyday situations, known as grammatical performance, often affects competence since it provides the data that the child hears. Corpus linguistics aims to look at the actual use of language, written and spoken. The tasks you will do below are designed to make you familiar with this approach and to appreciate some of its possibilities for your own research and teaching.
Shifts in Inflectional Forms
Most aspects of language remain fairly constant over time, but irregular inflections tend to regularize unless the irregular form is common, as in drink/drank or sink/sank. For example, what form would you fill in to complete the following sentence?
I _____________ the lamp last night.
Using the Virtual Language Centre (VLC) Web Concordancer, compare the two forms of the past tense of light that we have just considered. To do so, click on "Simple search" under "English", In the VLC Web Concordancer, (English), type each form in the second box after "Search string:"; then go to "Select corpus:" and, on the pull-down menu, select first "Brown Corpus" and get your results and then “Sherlock Holmes stories” and get your results.
The Brown corpus represents English usage in the 1970’s; Sherlock Holmes, the 1880’s.
Which form occurs more frequently in the Brown Corpus? ___________________
Which occurs more frequently in the Sherlock Holmes stories? _________________
Collocations
Powerful tea and cooked coffee?
Certain words commonly occur together: coffee is brewed, not cooked, lights are turned off, not closed. These collocational patterns are language-specific and, as such, are often mind-boggling to the language learner. They are learned by constant exposure. A corpus can provide this exposure quickly.
The following sentences are ESL student productions in which the underlined word is not a standard collocate of the following word(s). Using the VLC concordancer, find a better word for each of the underlined words below.
A powerful dollar overseas hurts European markets. (Search Business and Economy and Sort left.)
This was his single chance for success. (Search Brown and Sort left.)
I like being with people, knowing new people, etc. (Search Times for 'new people', and Sort left.)
Stand on line or in line?
Should ESL students be taught to 'stand on line' or 'stand in line', or does it matter? You can check the use of the prepositions at the Collins CoBuild website. In the "Type in your query" box, type stand +1line. The +1 will ignore a single word between stand and line. How many times does stand on line occur? _________ stand in line? __________ Do Americans use on or in more often? _________
Part of Speech Identification
Many grammar texts give form criteria by which to judge the part of speech of a word. For example, a word is classified as a noun if it can occur with a plural or possessive ending, or if it has a noun-making morpheme like –ness or –tion, while it is classified as a verb if it can occur with the tense and participial endings, or if it has a verb-making morpheme like –ize, or -ate. However, many words in English have the same form as nouns and verbs, e.g., house, button, garden, progress, permit, record. How is part of speech determined in such cases?
To answer this question, look at the usage of the word permit as it occurs in the Brown corpus. (Sort left.) The word permit can be a noun or a verb in English and, in its base form, there is no formal, morphological way of telling whether it is a noun or a verb.
Find the first three occurrences of permit as a verb. How do you know permit is being used as a verb here?
If you're not sure about assigning parts of speech, you might want to check out the online Part of Speech Tagger at the University of Colorado.
Testing textbook claims about POS
Klammer, Schulz and Della Volpe's Analyzing English Grammar says that completely, absolutely, totally, extremely, and excessively are adverbs even though they fit the qualifier frame
The handsome man seems ______ handsome.
These words do fit the adverb form test (they end in -ly) but they fail all the function tests (they can't modify verbs and they can't move within the sentence).. What’s going on? Looking at the usage of one of the -ly degree words in MICASE, the Michigan Corpus of Academic Spoken English, will clarify what's going on with these words.
If we classify adverbs as words that (1) modify verbs and (2) can be moved within a sentence, find an example in the data above of totally used as an adverb.
If we classify qualifiers as words that (1) modify adjectives or adverbs and (2) can fit the slot in the frame sentence The handsome man is ________________ handsome, find an example in the data above of totally used as a qualifier.
This exercise shows that the part of speech class of a word depends in part on
a. the morphological form of the word
b. the context in which the word occurs
c. the grammatical function of the word
Exemplifying Standard and non-standard forms
Adverbs or Adjectives in intransitive sentences?
Many grammar texts claim that there is a usage issue relating to the use of adverbs vs. adjectives following intransitive verbs, as in doing well vs. doing good, with the latter considered informal.
- Find all instances in MICASE of doing good and doing well. Which form is more common?
- MICASE classifies its data by speech event. (The speech events are listed to the left of the data.) Is there any correlation between the informal nature of the speech event and doing good, as opposed to the formal nature of the speech event and doing well?
Speech Event: ADV – advising session; COL – colloquia; DIS – discussion section; LAB – lab sections LEL – large lecture; LES – small lecture; MTG – meetings; OFC – office hours; SEM – seminars; SGR – study groups; TUT – tutorials;
- How about the age and/or status of the speakers? (Speaker characteristics are listed to the right of the data.)
Syntactic Constructions
The passive with by
The often politically motivated claim "Mistakes were made" has recently been dubbed the 'past exonerative'. How often does the passive occur without by? To see how often mistakes occurs with made, with and without the by, in Collins' 56 million word database, type the following in the Collins Cobuild site's query box --
mistakes+2made
-- and click "Show Concs"
The +2 allows for a maximum of two words to intervene between mistakes and made. (If you want results for mistake and mistakes, type mistake*.)
Searching for passives is tricky because the past participle used in the passive voice often has the same form as the past tense. For example: I made a mistake (past tense); A mistake was made (past participle).
The Collins Cobuild site distinguishes past tense forms, which it labels VBD, from past participle forms, which it labels VBN and verbs can be searched for as, for example, made/VBN. (Be aware that the VBN tag will give you all past participles, i.e., both has/have made and am,is,are,was,were,be made.) So you can try the search again as follows --
mistake*+2made/VBN
Of the 40 samples that you see, how many have a by phrase? _________
You can see how often the passive form of a specific verb occurs with a by phrase in the entire 56 million word database by asking for the T-score under "Collocation Sampler". What is the joint frequency of the following items?
mistake*+2made and by? __________ (Type the first term; the by will appear in the table.)
mistake*/NOUN and made? __________
What percentage of the time does by appear with mistakes were made? _____________
Subject and object selection
According to the Collins Cobuild data, what subjects and objects can the verb prove take? Animiate, inanimate, abstract, concrete, mass, count?
Verb complementation
Can you pretend something, i.e. a noun or a noun phrase? Type
pretend/VERB+NOUN
into the Collins Cobuild box to find out. (Note the occasional errors in the POS tagging of pretend.) What somethings can one pretend? _________________________________________
Can pretend be followed by "-ing" forms, infinitives or anything else? Type pretend* in the box to find out.
______________________________________________________________________________
______________________________________________________________________________
Know Your Tools
A concordance, in its simplest form, is an alphabetical listing of the words in a text, given together with the contexts in which they appear. The most common form of concordance today is the Keyword-in-Context (KWIC) index, in which each word is centered in a fixed-length field (e.g., 80 characters). The example given below was produced by Conc 1.70 (Macintosh), from a plain ASCII text version of the first book of Dickens' A Tale of Two Cities. Note that the line numbers are as calculated by Conc.
Figure 3.1.1: Concordance of poor in Tale of Two Cities, Book 1
|
1320 taste it is that such poor cattle always have in their mouths 948 of sparing the poor child the inheritance of any part of 778 small property of my poor father, whom I never saw--so long 1870 desolate, while your poor heart pined away, weep for it 947 Miss, if the poor lady had suffered so intensely 1884 the love of my poor mother hid his torture from me 1615 stockings, and all his poor tatters of clothes, had, in a long 1577 faded away into a poor weak stain. So sunken and 1001 on your way to the poor wronged gentleman, and, with a 1036 detachment from the poor young lady, by laying a brawny hand |
A concordancer is a software tool that produces such a list.
A collocate list is a list of words that occur in the neighborhood of the keyword. For example, a search for the keyword so in the Hong Kong Web Concordancer with a request for words that occur at a distance of two words from the so, returns the following words as the top collocates of so that occur to its right:
Right collocates for 'so'
The 132
that 127
as 110
to 77
in 57
a 49
and 47
it 46
He 40
of 35
A part-of-speech tagger automatically tags each word in a text with its part of speech. Current taggers are about 97% accurate (as are human experts). The Collins CoBuild Concordancer allows you to search for part of speech strings rather than strings of words.
Searching, in the context of corpus work, means looking in the online text for a specific keyword, phrase, part-of-speech tag, etc.
Browsing means reading through the documents in the corpus. This is a useful activity only if the documents have been classified. For example, the MICASE corpus is categorized by speech event (lab, lecture, office hour, etc.) and by speaker (professor, undergraduate, grad student, native speaker, non-native speaker, male, female, etc.). This classification allows you to get a sense of the differences between one speech event or speaker type and another.
Sorting means listing words in alphabetical order. The Hong Kong Web Concordancer allows you to sort the collocates immediately to the right or to the left of the keyword.
Websites to get you started.
Corpus Linguistics at the Hong Kong Polytechnic University. Contains a tutorial on corpus linguistics as well as a corpus linguistics course outline with student assignments, etc.
The Internet Grammar of English is an online course in English grammar written primarily for university undergraduates. IGE does not assume any prior knowledge of grammar. It includes interactive exercises.
English Grammar on the Web is a resource designed to support ESL/EFL teachers, but it has valuable lists of links to other web resources on English grammar. Particularly helpful is its Lists of Grammar Lists.
The Hong Kong Web Concordancer is a concordance program that allows you to search several million words of English sampled from many sources.
Alternate URL for The Hong Kong Web Concordancer
The Collins CoBuild Concordancer allows you to search 56 million words of contemporary written and spoken documents. It also allows you to tag these documents for part of speech.
A Ten-step Introduction to Concordancing through the Collins Cobuild Corpus Concordance Sampler. Instructions for using Collins Cobuild effectively.
MICASE, the Michigan Corpus of Academic Spoken English allows you to search 1,848,364 words of English transcribed from lectures, conversations, service encounters, etc. recorded at the University of Michigan.
POS Tagger, A statistically-based Part-of-Speech Tagger from the University of Colorado; it returns your sentence with Penn Treebank part-of-speech tags assigned.
Bookmarks for Corpus Linguists is a list of web resources on using corpora and links to software and corpus collections.
The Web Concordances and Workbooks from the University of Dundee English Department. This site is devoted to the study of literature using literary computer concordancing, a form of analysing text. This document will attempt to help students understand what is meant by literary concordancing.
WordNet a lexical database for the English language.
Concordances and Corpora. A tutorial by Catherine Ball on the design of corpora, the use of concordances, and available concordancing software.
The Montclair Electronic Language Database (MELD). A collection of ESL student essays and background information on L1, native country, age, gender, and other languages spoken.