- On-line corpora with query engines
There are three great clusters with multiple part of speech tagged corpora, each using a different set of tags and corpus query language, but they do provide some assistance using their query language.
- INTELLITEXT, Centre for Translation Studies, University of Leeds (Serge Sharoff et al.). Offers online access to marked up corpora in 12 languages and ability to create a tagged corpus on their site by uploading a text. POS tags are the PennTree set and query language is CQP. Corpora for English include
- UKWAC—UK based web pages (from a few years back--2G words)
- WIKI-EN: Wikipedia entries—943 million words
- BLOGS-EN: Political blogs—500 million words
- INTELLITEXT-ICE. Here the interface accesses the available one million word ICE corpora of spoken and written global Englishes.
- INTELLITEXT-ACADEMIC presents the results of BootCat collections of web pages using keywords of various disciplines. ABout 70 million tokens in nine academic "genres"/domains.
- CORPUS EYE—Southern Denmark University, (Eckhard Bick). Also multiple languages. POS tags are English Contraint Grammar (and include grammatical relations) and query language is CQP). The English portion (333.6 million words in all) contains corpora of (among other things):
- European Parliament debate (25.7 million words)
- Wikipedia (115.2 million words in three parts)
- Chat corpus (23.5 million words)
- UCLA Communications Studies Archive of Television News (24.4 million words in two parts: 2005-2009 and 2010-2012) And see Bick
- Supreme Court Dialogues (2 million words)
- Enron emails (82.5 million words in 3 parts)
- beauty blog (304,000 words)
- THE CORPORA AT BYU (Mark Davies) Corpora in this cluster are tagged with CLAWS and use the "BYU Interface"
- RDUES at Birmingham City University has both a live Web search interface (WebCorp Live which can be filtered nicely) and a set of large, POS tagged corpora (WebCorp LSE).[more]
- Searchable corpora of transcribed speech
[BACK TO MENU]
Corpus of Academic Spoken English lets you browse and search for
any word in a highly stratified corpus and to download the lecture or
speech or conversation transcripts that contain the word. Currently has
152 transcripts totalling 1.8 million tokens.
BASE— The British Academic Spoken English
corpus— is made up of 160 lectures and 39 seminars
in various disciplines totalling about 1.64M tokens
recorded between 2000-2005. Transcripts are available
from Warwick Centre for Applied Linguistics
and from the Oxford Text
Archive with and without pauses indicated. There is
a free search interface in The Sketch Engine
- VOICE: Vienna Oxford International Corpus of English. International students speaking English as a lingua franca (ELF) (no L1 speakers participating). One million words. Searchable online with free registration.
- HKCSE: Hong Kong Corpus of Spoken English. English as spoken in Hong Kong. 907 K words. Also corpora of texts in various technical and professional contexts.
- TV Broadcast:
- (see Corpuseye UCLA TV News archive)
- (see COCA Spoken subcorpus—95.7 million words)
- Free corpora for download
[BACK TO MENU]
—British Academic Written English— is the counterpart to BASE and open for free access at The Sketch Engine. The corpus is of British University students, and can be sorted by genre and discipline. The full corpus (6.7 M words) is available at the Oxford Text Archive.
- The ACL Anthology Reference Corpus of 50 mil. words (through 2005)(POS tagged) is seachable online from The Sketch Engine
- Subscription corpora and tools
- Newspapers online:General corpora such as BNC and ANC have a newspaper component, and many newspapers have online archives which are searchable, but the mother of all newspaper corpora is LexusNexus-Academic, which is available to folks with an account at most university libraries. It allows a lot of tuning—which newspapers, which parts of the paper, presentation of results and so on. It responds somewhat to "literals" (" ") around the search term, but lemmatizes slightly, lumping both present-tense forms of verbs and singular/plural forms of nouns. Its 'best fit' algorithm will occasionally scramble the order of words, and certain words (be and forms of be, and) flood it with undesired hits. Will count the search term appearing in the same article (in different newspapers). Moral: eyeball the results of the search and adjust them.
- Sketch Engine (SkE)gives you preloaded corpora in several languages, WebBootCat web-corpus builder (available for free as BootCat), Corpus Builder to upload and install your own corpora, and BASE Plus interface to BASE. 30 days free, then €55.25/ann
- The 550M token Collins WordBank is available via a Sketch interface for 30 trial; after that a jillion pounds sterling/ann. It is a mix of sources from the Inner Circle English speaking countries, with most texts dating from 2000-2006.
- WMatrix "provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains." 30 days free, then 50£/ann.
[BACK TO MENU]
- Free Viewers, Concordancers and Other Tools
[BACK TO MENU]
- Kfngram makes ngram indices of any text(s) you give it. Like WSTools' Cluster function, but free. Works on Windows.
- Google Books Ngram Viewer allows you to search the vast and historically stratified bodies of texts of many languages for occurences of one of more terms. since it keeps American and British English texts separate, it is possible to gather comparative data on, for example, the spelled forms of words.
- CONE (COllocational Network Explorer) is a Java jar file that must be downloaded along with a second zipped app that will take a corpus as an input and graph the highest ranking collocations of any word you give it. Here is an example of the displayed collocates of living in a sample corpus.
- Laurence Anthony's AntConc is light and cross-browser and well-documented. Asian as well as European languages. The most useful of the free concordancers. Good Help file.
- The UCREL Tools page includes some free tools and some for hire; The LL (Log-Likelihood) Calculator is free and handy,
- DIY Webcorpus Tools
- BootCaT front_end This installer will set up BootCat, a Java package of Perl Scripts that takes a few seed terms and then downloads a corpus of web urls, then webpages, containing those seed terms. It then cleans the files of duplicates and some other cruft and catenates the files into a corpus selected by the seeds. The frontend uses the Windows Azure search API and requires the user to obtain a password by registration. The package is platform independent but requires a working Perl library (which the front end will take care of).
- Jaguar is an on-line package of Perl scripts that enable you to do web-collection of pages from a seed you enter. It recommends using more than one word on the seed line and will only search that one set of seeds. The result is a similar corpus but housed on line. (This capacity is also available for Intellitext, but is not always implemented. Intellitext also automatically does POS tagging.) Jaguar can do association searches between pairs of terms, which means cooccurrence anywhere within a set window frame.
- These web-crawling, collecting, tagging, and archiving functions are also available through Sketch Engine to subscribers.
[BACK TO MENU]
This is one of four sites of (on-line) Resources for English language study maintained by George Dillon, University of Washington. The others are:
Phonetics Resources English Syntax Resources Semantics Resources
modified: May 2015