Corpus Resources

  1. On-line corpora with query engines There are three great clusters with multiple part of speech tagged corpora, each using a different set of tags and corpus query language, but they do provide some assistance using their query language.
    • INTELLITEXT, Centre for Translation Studies, University of Leeds (Serge Sharoff et al.). Offers online access to marked up corpora in 12 languages and ability to create a tagged corpus on their site by uploading a text. POS tags are the PennTree set and query language is CQP. Corpora for English include
      • UKWAC—UK based web pages (from a few years back--2G words)
      • WIKI-EN: Wikipedia entries—943 million words
      • BLOGS-EN: Political blogs—500 million words
    • INTELLITEXT-ICE. Here the interface accesses the available one million word ICE corpora of spoken and written global Englishes.
    • INTELLITEXT-ACADEMIC presents the results of BootCat collections of web pages using keywords of various disciplines. ABout 70 million tokens in nine academic "genres"/domains.
    • CORPUS EYE—Southern Denmark University, (Eckhard Bick). Also multiple languages. POS tags are English Contraint Grammar (and include grammatical relations) and query language is CQP). The English portion (333.6 million words in all) contains corpora of (among other things):
      • European Parliament debate (25.7 million words)
      • Wikipedia (115.2 million words in three parts)
      • Chat corpus (23.5 million words)
      • UCLA Communications Studies Archive of Television News (24.4 million words in two parts: 2005-2009 and 2010-2012) And see Bick
      • Supreme Court Dialogues (2 million words)
      • Enron emails (82.5 million words in 3 parts)
      • beauty blog (304,000 words)
    • THE CORPORA AT BYU (Mark Davies) Corpora in this cluster are tagged with CLAWS and use the "BYU Interface"

       

      # words

      language/dialect

      time period

      Wikipedia Corpus (with virtual corpora)

      1.9 billion

      English

      -2014

      Global Web-Based English (GloWbE)

      1.9 billion

      20 countries

      2012-2013

      Corpus of Contemporary American English (COCA)

      450 million

      American

      1990-2012

      Corpus of Historical American English (COHA)

      400 million

      American

      1810-2009

      TIME Magazine Corpus

      100 million

      American

      1923-2006

      Corpus of American Soap Operas

      100 million

      American

      2001-2012

      British National Corpus (BYU-BNC)*

      100 million

      British

      1980s-1993

      Strathy Corpus (Canada)

      50 million

      Canadian

      1970s-2000s

    • RDUES at Birmingham City University has both a live Web search interface (WebCorp Live which can be filtered nicely) and a set of large, POS tagged corpora (WebCorp LSE).[more]
    [BACK TO MENU]
  1. Searchable corpora of transcribed speech
    • MICASE—Michigan Corpus of Academic Spoken English lets you browse and search for any word in a highly stratified corpus and to download the lecture or speech or conversation transcripts that contain the word. Currently has 152 transcripts totalling 1.8 million tokens.
    • BASE— The British Academic Spoken English corpus— is made up of 160 lectures and 39 seminars in various disciplines totalling about 1.64M tokens recorded between 2000-2005. Transcripts are available from Warwick Centre for Applied Linguistics and from the Oxford Text Archive with and without pauses indicated. There is a free search interface in The Sketch Engine .
    • VOICE: Vienna Oxford International Corpus of English. International students speaking English as a lingua franca (ELF) (no L1 speakers participating). One million words. Searchable online with free registration.
    • HKCSE: Hong Kong Corpus of Spoken English. English as spoken in Hong Kong. 907 K words. Also corpora of texts in various technical and professional contexts.
    • TV Broadcast:
      • (see Corpuseye UCLA TV News archive)
      • (see COCA Spoken subcorpus—95.7 million words)
    [BACK TO MENU]
  1. Free corpora for download
    1. BAWE —British Academic Written English— is the counterpart to BASE and open for free access at The Sketch Engine. The corpus is of British University students, and can be sorted by genre and discipline. The full corpus (6.7 M words) is available at the Oxford Text Archive.
    2. The ACL Anthology Reference Corpus of 50 mil. words (through 2005)(POS tagged) is seachable online from The Sketch Engine
  2. [BACK TO MENU]
  1. Subscription corpora and tools
    1. Newspapers online:General corpora such as BNC and ANC have a newspaper component, and many newspapers have online archives which are searchable, but the mother of all newspaper corpora is LexusNexus-Academic, which is available to folks with an account at most university libraries. It allows a lot of tuning—which newspapers, which parts of the paper, presentation of results and so on. It responds somewhat to "literals" (" ") around the search term, but lemmatizes slightly, lumping both present-tense forms of verbs and singular/plural forms of nouns. Its 'best fit' algorithm will occasionally scramble the order of words, and certain words (be and forms of be, and) flood it with undesired hits. Will count the search term appearing in the same article (in different newspapers). Moral: eyeball the results of the search and adjust them.
    2. Sketch Engine (SkE)gives you preloaded corpora in several languages, WebBootCat web-corpus builder (available for free as BootCat), Corpus Builder to upload and install your own corpora, and BASE Plus interface to BASE. 30 days free, then €55.25/ann
    3. The 550M token Collins WordBank is available via a Sketch interface for 30 trial; after that a jillion pounds sterling/ann. It is a mix of sources from the Inner Circle English speaking countries, with most texts dating from 2000-2006.
    4. WMatrix "provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains." 30 days free, then 50£/ann.

[BACK TO MENU]

  1. Free Viewers, Concordancers and Other Tools
    1. Kfngram makes ngram indices of any text(s) you give it. Like WSTools' Cluster function, but free. Works on Windows.
    2. Google Books Ngram Viewer allows you to search the vast and historically stratified bodies of texts of many languages for occurences of one of more terms. since it keeps American and British English texts separate, it is possible to gather comparative data on, for example, the spelled forms of words.
    3. CONE (COllocational Network Explorer) is a Java jar file that must be downloaded along with a second zipped app that will take a corpus as an input and graph the highest ranking collocations of any word you give it. Here is an example of the displayed collocates of living in a sample corpus.
    4. Laurence Anthony's AntConc is light and cross-browser and well-documented. Asian as well as European languages. The most useful of the free concordancers. Good Help file.
    5. The UCREL Tools page includes some free tools and some for hire; The LL (Log-Likelihood) Calculator is free and handy,
  2. [BACK TO MENU]
  1. DIY Webcorpus Tools
    1. BootCaT front_end This installer will set up BootCat, a Java package of Perl Scripts that takes a few seed terms and then downloads a corpus of web urls, then webpages, containing those seed terms. It then cleans the files of duplicates and some other cruft and catenates the files into a corpus selected by the seeds. The frontend uses the Windows Azure search API and requires the user to obtain a password by registration. The package is platform independent but requires a working Perl library (which the front end will take care of).
    2. Jaguar is an on-line package of Perl scripts that enable you to do web-collection of pages from a seed you enter. It recommends using more than one word on the seed line and will only search that one set of seeds. The result is a similar corpus but housed on line. (This capacity is also available for Intellitext, but is not always implemented. Intellitext also automatically does POS tagging.) Jaguar can do association searches between pairs of terms, which means cooccurrence anywhere within a set window frame.
    3. These web-crawling, collecting, tagging, and archiving functions are also available through Sketch Engine to subscribers.

    [BACK TO MENU]


This is one of four sites of (on-line) Resources for English language study maintained by George Dillon, University of Washington. The others are:

Phonetics Resources English Syntax Resources Semantics Resources


Last modified: May 2015