Due: 02/15/12
length: description and analysis: 2-3 pages; with data: 4-5 pages
Purpose: To use the POS taggers to identify problematic spots in your passage and to decide on the correct tagging.
Materials: A prose passage of about 100 words in a plain ascii txt file.
Programs: VISL Lingsoft. SVMTool, CLAWS, CogCompGrp
Select your passage and paste it into the textarea windows of one of the Rule-Based taggers (VISL Flat Structure. LingSoft). (These use the rule-based English Constraint Grammar procedure and tag set, but with a few differences.) The machine will return a tagged version of your passage in a few seconds. Print the page with the data on it from the browser to the printer.
Closely examine the results for correctness of the analysis. You might expect to find at least a couple of errors.
Run these sentences in one Statistical tagger (SVMTool, CCG at UIUC, or CLAWS) to see if it is closer to being right in those places. Note: These statistical taggers use different tag sets (PennTree POS for SVMTool CogCompGrp, and TreeTagger) and a 60 tag set for CLAWS5 which is used in the British National Corpus), but the table of tags should help you compare them. BTW—CLAWS is actually a hybrid tagger with a rules-based add-on to correct errors of the main statistical engine. The Corpus of Contemporary American English (COCA) uses the larger CLAWS7 tag set (160 tags). You can select this tag set if you want. Also, you could use the SS or Stanford Parser taggers packaged in Antelope (see Syllabus). Assume for the PennTree taggers that they are targetting the 34 page Manual.
Note: This time, do not supply trees and do not concern yourself with phrases or grammatical functions. Use your time on POS.
Include a copy of your text.