Ling/CSE 472: Assignment 1:
Regular expressions
Due April 8th, by 6:00 pm
1. Elizalike
This part of the assignment asks you to create a program that behaves like
Weizenbaum's ELIZA (see p.25-26 of the text). We have provided a skeleton of a
script that handles input and output, and provides an example of the Python
syntax for using regular expressions to modify strings.
Each student should develop their own program, although you are welcome to ask
each other questions (in person, over email, or on
Our GoPost bulletin board).
You will need to find a partner for this project, as one of the tasks is to
test each other's programs (see below).
Specifications:
The basic approach is to read in a string of input from the user, modifying it
successively (sometimes subtly, sometimes drastically, depending on the input
string), and print out the result. To maintain the illusion of AI, it is
crucial that elizalike print out grammatical strings. (You may assume that it
is given grammatical input.) Furthermore, elizalike should be able to handle
person deixis, referring to itself in the first person and to the user in the
second person.
Before you start, look at the list of items to turn in
below, so you know what you'll need to save.
Your tasks:
- Develop a list of sentences that you will use to test your program to make sure
it handles the person deixis correctly. This list must illustrate all ways in which
1st and 2nd person are marked in English (full pronoun paradigms, and
subject-verb agreement with the verb be), and all possible forms each of those
elements can appear in (including variation in capitalization).
The thoroughness of the coverage of these sentences will be a significant part
of the grade for this assignment.
- Modify the python script to implement the handling of
person deixis. The basic strategy is to first replace any second person
reference in the input with some string that's unlikely to show up otherwise.
(The sample expression in the script we've given you
replaces it with third-person reference to Eliza). Then replace
any first person reference in the input with second person reference. Finally,
replace your otherwise unlikely string (from the first step) with first person
reference. Each of these steps will take several lines as you handle pronouns and
verbs and upper and lower case letters (i.e., if the user type "My friend..."
Elizalike's output should be "Your friend..." and not "your friend...").
Be sure to read all of the comments in the file (lines starting with #, which
are for human consumption and ignored by Python). You should probably test each
line as you add it, by running the program again and using an appropriate
sentence from your test file. Note that before you make any changes, the
program runs, just in a boring way: It repeats whatever the user types in,
except that it changes all occurrences of "you are" to "---Eliza-is---".
Instructions on using Python
-
Add at least two statements that find one keyword in the input and change the
whole string to something different. (See the third and fourth examples on page
26 of the textbook for a model, but don't copy them exactly!)
-
Add at least two statements that find some keyword in the input, and return a
significantly changed output that noneless contains some part of the input that
may vary from time to time. (See the first and second examples on page 26,
but feel free to get fancier than that!)
-
Find a partner and exchange programs. Looking at the code for your partner's
program, try to find at least 2 interestingly different inputs that cause their
program to produce ungrammatical output. (Keep your inputs grammatical!) We're
pretty sure you'll be able to find these, but if your partner's program is too
perfect, you can get full credit for this part of the assignment by turning in
an explanation of 5 pitfalls you looked for and how they were avoided.
-
Modify your program to avoid the ungrammatical outputs your partner found (if
any). It is preferable to keep the original functionality of your program and
fix the bugs, but if that's impossible, you can replace the problematic
statement(s) with simpler ones with different behavior.
-
In 2-4 paragraphs, discuss why English morphology and syntax make this program
relatively straightforward, and how it would be more complicated in some other
specific language.
Turn in the following via Our CollectIt dropbox. To facilitate grading, please submit these files, with these names:
| sentences.txt |
Your list of test sentences |
| elizalike1.py |
The first version of your program (that you gave to your partner) |
| partner.txt |
The name of your partner and the problems you found with their program, or an
explanation of how they avoided 5 pitfalls you thought up. |
| elizalike2.py |
The second version of your program |
| eliza_discussion.txt |
Your discussion of English and other language morphology and syntax --- see the
last task above. |
Note: We will be executing your code, so make sure it runs.
2. Tokenizer
This part of the assignment asks you to write a Python script that will take an
ordinary text file, and return a file with the same content, reformatted to be
one sentence per line.
Each student should develop their own program, although you are welcome to ask
each other questions (in person, over email, or on the Our GoPost bulletin board
).
Once again, we will supply a skeleton Python script which
handles input and output (this time reading in a file and writing out to a
file). We will also supply a test file
that you will use to develop the script.
The basic algorithm is the following:
-
Read the input file in one line at a time, and modify the lines as follows,
before printing them to the output file:
-
Remove all existing newlines.
-
Replace all periods that do not indicate the end of a sentence with a special
string.
-
Do the same for other typical sentence-final punctuation marks (using a
different special string for each one).
-
Put in a newline after every remaining sentence-final punctuation mark.
-
Replace the special strings you put in with the punctuation marks they
correspond to.
Specifications:
Treat .?!: as sentence-ending punctuation. Quotation marks after a
sentence-final element should be on the same line as that element. Don't worry
if your script breaks a single quote that contains several sentences into different lines.
Your tasks:
-
Download the skeleton script, the
test file, and a tokenized version of the test file for comparison
(the gold standard) to your local machine.
-
Modify the Python script as specified above so that it produces an output file
that matches the gold standard. You call this script with an argument
designating the input file:
python tokenizer.py inputfile
-
Find some other text file to run it over, such as a news article from the web.
Identify at least 3 cases your script doesn't yet handle properly, 2 where it
overgenerates (splitting where it shouldn't) and 1 where it undergenerates (not
splitting where it should). (If you don't find them in the first file you try,
run more text through.)
-
Modify your script to handle those 3 cases properly.
Submit your answers to Our CollectIt dropbox. To facilitate grading, please submit these files, with these names:
| tokenizer1.py |
The first version of your script |
| misses.txt |
A brief description of the cases you didn't handle properly |
| tokenizer2.py |
The second version of your script |
Again, make sure we can run your scripts.
Back to main course page