Assignment 2: Markov Mashups (50 Pts)

Buddy assignment

Chris Tralie

Due Monday 10/5/2020

Overview / Logistics

The purpose of this assignment is to get you practice with classes and HashMaps in a fun, creative application. It will also give you some exposure to natural language processing, which is an extremely important applications area in computer science, and to Markov chains, which are incredibly powerful tools for representing sequences mathematically.

Learning Objectives

  • Write methods to fit a specification
  • Use appropriate types in a HashMap data structure
  • Populate HashMaps with data on the fly
  • Use instance methods and variables appropriately to create modular, reusable code in a class structure
  • Load text files line by line in Java
  • Write a complete program with very little code in the main function

What to submit

There is no skeleton code for this lab. Instead, you should make a new project in NetBeans, and when you are finish, you should zip it up and submit it on Canvas. Finally, please submit a README document with the following items

  1. Your poetic statement, a title that should go along with it, and what name/pseudonym you would like to use? (results will be displayed on the class web site).
  2. Your ethics reflection.
  3. Your reflection on struggle
  4. Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)
  5. Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
  6. Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

Suggested Timeline

This is a longer assignment, and you'll need to pace yourself. I'd suggest you loosely follow the timeline below:

  • By Friday 9/25: Have your class and data structures setup, and be in the process of attempting the method to add new text to the model
  • By Monday 9/28: Finish debugging your method to add text, and start working on synthesizing text
  • By Friday 10/2: Finish debugging your method to synthesize text, and test it on a number of examples
  • By Monday 10/5: Finish the poetic statement, ethics reflection, and reflection on struggle.

Background: Markov Chains for Text Representation

Human languages, often referred to as ``natural languages,'' are rich and full of meaning. Compared to computer languages, they are much harder to automatically interpret and process. However, as you will see in this assignment, we can make some progress at representing natural languages using a few simple statistical ideas.

In this assignment, our focus will be on creating an algorithm to capture the style of some given text from a set of examples, and to synthesize text in this style (for extra credit, students can also extend this concept to classify text by style). The simplest possible thing we could do is try to match the frequency of occurrence of the characters in the example when synthesizing new text. For example, if we use a collection of spongebob quotes, we find that we have the following character counts across all quotes:

{' ': 869, '!': 35, ''': 62, ',': 79, '-': 4, '.': 90, '0': 3, '1': 2, '2': 4, '4': 1, '5': 1, '?': 21, 'A': 8, 'B': 4, 'C': 8, 'D': 8, 'E': 6, 'F': 2, 'G': 5, 'H': 8, 'I': 55, 'K': 3, 'L': 3, 'M': 3, 'N': 7, 'O': 5, 'P': 11, 'R': 5, 'S': 22, 'T': 13, 'U': 2, 'V': 1, 'W': 14, 'Y': 9, 'a': 305, 'b': 91, 'c': 83, 'd': 133, 'e': 453, 'f': 70, 'g': 85, 'h': 163, 'i': 239, 'j': 6, 'k': 50, 'l': 170, 'm': 126, 'n': 241, 'o': 350, 'p': 63, 'q': 7, 'r': 233, 's': 242, 't': 314, 'u': 157, 'v': 36, 'w': 89, 'x': 4, 'y': 139, 'z': 5}

We can then try drawing characters according to these counts. For example, we'd be more likely to draw an 'i' than we would to draw a 'z', and we're more likely to get a space ' ' than any other character. If we draw 100 characters independently with these probabilities, we get text that looks like this:

  • onc 5donps fwyce agoti'tm ne edoi a e Iueogd ei IralcrltItmo.g mimyheat tgesue nwierayekra fP he
  • l rOxttsu Iogrmetucafo ewa khtois!e bttcatnht,r Cyfhr Pngkswhnwl oiet lyoatrl atumr e lenriadb Ie
  • Gi dyuh b .di Po mmceooet'd'nne'n gdo dkimeo aanti is0o i 'uttj'Sstopfsasep!. mosltayaaSso?lraV l

Interestingly, the size of the "words" looks about right since we are more likely to choose a space than any other character, but they are total gibberish. This is because the process has no memory; each character is chosen completely independently from the character preceding it. As it turns out, if we have just a very simple memory, and we only remember a few characters before, we can do a much better job. In particular, we will consider all possible sequences of characters of a particular length, and we will count the frequency of occurrence of each character that follows this sequence. A sequence of length k is called a k-gram. For example, here are all of the counts of the 4-gram 'you ' in the Spongebob text:

New SentenceCharacter Counts for 'you ' After Sentence
Gary, go away, can't you see I'm trying to forget you?{s=1}
Why are you mad? Because I can't see my forehead!{s=1, m=1}
Can you take the crust off my Krabby Patty?{s=1, t=1, m=1}
Did you see my underwear?{s=2, t=1, m=1}
Well, it makes you look like a girl!{s=2, t=1, l=1, m=1}
East, oh I thought you said Weest!{s=3, t=1, l=1, m=1}
That's it, mister, you just lost your brain priveleges!{s=3, t=1, j=1, l=1, m=1}
I wumbo, you wumbo, he she we, wumbo, wumboing, wumbology, the study of wumbo? {s=3, t=1, w=1, j=1, l=1, m=1}
It's not you that's got me... it's me that's got me!{s=3, t=2, w=1, j=1, l=1, m=1}
Why don't you ask CowBob RanchPants and his friend Sir Eats-a-lot?{a=1, s=3, t=2, w=1, j=1, l=1, m=1}
Krusty Krab Pizza, it's the pizza for you and meeeee!{a=2, s=3, t=2, w=1, j=1, l=1, m=1}
If you believe in yourself, with a tiny pinch of magic all your dreams can come true!{a=2, b=1, s=3, t=2, w=1, j=1, l=1, m=1}
Goodbye everyone, I'll remember you all in therapy.{a=3, b=1, s=3, t=2, w=1, j=1, l=1, m=1}
Don't you have to be stupid somewhere else?{a=3, b=1, s=3, t=2, w=1, h=1, j=1, l=1, m=1}
Squidward, you can't eat all those patties at one time!{a=3, b=1, s=3, c=1, t=2, w=1, h=1, j=1, l=1, m=1}
I'll have you know, I stubbed my toe last week, while watering my spice garden, and I only cried for 20 minutes.{a=3, b=1, s=3, c=1, t=2, w=1, h=1, j=1, k=1, l=1, m=1}
Squidward, you and your nose will definitely not fit in.{a=4, b=1, s=3, c=1, t=2, w=1, h=1, j=1, k=1, l=1, m=1}
Who you callin' pinhead?{a=4, b=1, s=3, c=2, t=2, w=1, h=1, j=1, k=1, l=1, m=1}
Gee Patrick, I didn't know you could speak bird.{a=4, b=1, s=3, c=3, t=2, w=1, h=1, j=1, k=1, l=1, m=1}
Any particular reason you took your pants off.{a=4, b=1, s=3, c=3, t=3, w=1, h=1, j=1, k=1, l=1, m=1}
Let me get this straight, you two ordered a giant screen television just so you could play in the box?{a=4, b=1, s=3, c=4, t=4, w=1, h=1, j=1, k=1, l=1, m=1}
I'd hate you even if I didn't hate you.{a=4, b=1, s=3, c=4, t=4, e=1, w=1, h=1, j=1, k=1, l=1, m=1}
You're a man now, Spongebob, and it's time you started acting like one.{a=4, b=1, s=4, c=4, t=4, e=1, w=1, h=1, j=1, k=1, l=1, m=1}
Can you give Spongebob his brain back, I had to borrow it for a week.{a=4, b=1, c=4, e=1, g=1, h=1, j=1, k=1, l=1, m=1, s=4, t=4, w=1}
Now, let's say we do the following steps, starting with the gram 'you ':
  1. We randomly choose one of the characters that's to follow, and we choose a 'c'
  2. We then slide over by one character move onto the next gram, which is 'ou c'. We then see the character counts {a=2, o=2} for that gram.
  3. We make another random choice at this point, and we draw the character 'a'. So then we slide onto the gram 'u ca', and we see the counts {l=1, n=1} for that gram.
  4. We now make a random choice and draw the character 'n'. We then slide over to the gram ' can', and we see the counts {' '=3, '=4}
  5. We now make a random choice and draw a space, so we slide over to the gram 'can ', and we see the counts {c=1, h=1, I=1}
  6. We now make a random choice of an h, moving us to the gram 'an h', and so on and so forth
So far in this example, we have synthesized the text "you can h", and we could keep going. Here are a few different outcomes if we keep following this process:
  • you can have facial hair!Now than 24, 25.You don't need a new I stupid
  • you can have to die right.You know the true!If your secrets is hot burns down
  • you can have feet?Since who can't you.I'd hate you and I'm absorbing
  • you can have to be alright.If I wish is nothere ther.No, Patty?

As you can see, this text is starting to make a lot more sense than choosing each character independently, even with a very short memory.

Your task in this assignment will be to replicate this process of creating counts for grams of different sizes for a collection of text, and then synthesizing new text according to the counts.

Background: File Loading with Exception Handling

One of the methods that you create will be responsible for loading text line by line. The code below shows how to accomplish this in Java. First, you will need to make sure you do the following imports at the top of your code

Then, assuming you have the filename stored in a string called filename, here's a loop that grabs each line of the file as a string

You may notice that there's some strange syntax we haven't talked about with a try and a catch around the file loading. This is because it's possible that the file doesn't exist or there may be a disk read error as that code is trying to execute. So we try to do it, but if it doesn't work, then we execute code in the catch block. Right now, the catch block simply prints out information about where the error occurred, but you could do something fancier like inform the user that the file didn't load and return an empty string (for instance). We will talk about exception handling more as we go through the course.

Background: Drawing Random Characters

Here's some code we went over in class to show how to draw characters that respect different frequency distributions, or counts. They key idea is to convert the choice of a random integer into an appropriate choice of random character. The code below accomplishes this using a series of if/else statements, but the code you write to draw random characters after grams should be more flexible and use loops to do something similar.

An example run of the above code gives the following string

aaeeeeceaeaaaeeeebeaaceeeeeceaaeeaaeeeeeceeeeaeeaabeecabeebeeeaeeaeaeeeeeeebabeeaaabeebaaabaaeececee

Notice how most of the characters are es, and there are hardly any cs

Programming Tasks

As you write your code below, you should be mindful of the style guide. Good class organization is of paramount importance here. Points may be deducted for sloppy style.

Setting up Data Structures (5 Pts)

You should create a class that encapsulates all of the information for a particular model. The class constructor should have one parameter: the length of a gram. You should also setup and initialize the appropriate HashMap structures to represent your text model. You will need to store a count of all of the grams. Also, for each gram, you will need to store the possible characters that follow that gram and how often they occur.

Once you setup your class this way, it will be possible to make different objects for different models. For instance, you could make a model to represent spam texts using a gram-length of 6, or another model of spam text using gram lengths of 10. You could also make a completely separate model representing Darth Vader quotes. These models shouldn't mix data at all, which is why we want to encapsulate the gram length, gram counts, and gram character counts as instance variables in an object.

Adding Text (10 Points)

You should create an instance method that takes a string as a parameter and incorporates that string into your model. You should loop through all possible grams in the string and update the counts of the grams (or add a new gram if this is the first time you're seeing it). You should also update the counts of the characters that follow each gram. Two things to watch out for here:

  • To prevent your model from running into dead ends when you go to synthesize new text, you should loop around once you run out of room to make a gram at the end. For example, if you had the text

    jane is having a great day

    And you chose to use gram lengths of 5, you should have all of these grams

    	jane 
    	ane i
    	ne is
    	e is 
    	 is h
    	is ha
    	s hav
    	 havi
    	havin
    	aving
    	ving 
    	ing a
    	ng a 
    	g a g
    	 a gr
    	a gre
    	 grea
    	great
    	reat 
    	eat d
    	at da
    	t day
    	 dayj
    	dayja
    	ayjan
    	yjane
    

    But also all of these grams

    	dayj
    	dayja
    	ayjan
    	yjane
    

    The easiest way to accomplish this is simply to add the string to itself so you can keep going once you get to the end

  • Do not add strings with have a length less than the chosen gram length. Simply do nothing if such a string is passed.
As an example to check to make sure this is working properly, if you add all of the lines from the the spongebob file using 5-grams, you should have 4033 unique grams, and the gram counts for the string " you " should be as follows:

{a=4, b=1, c=4, e=1, g=1, h=1, j=1, k=1, l=1, m=1, s=4, t=4, w=1}

Adding Text From A File (5 Pts)

You should create an instance method that takes in a String parameter for a filename and which adds each line in this file to the model by calling add text method you wrote above.

Synthesizing Text (15 Pts)

Finally, you should create an instance method that synthesizes new text based on the grams and the counts of the possible characters that follow them. You should synthesize the text one character at a time, and you should update the current gram you're on to be the last k characters in the string you're synthesizing. The String methods substring and charAt will come in handy here.

Your method should take in two parameters: the first which is the gram to start on, and the second which is the total number of characters to synthesize. If an empty string is specified for the first gram, or the length of the specified gram does not match the gram-length of the current object, then you should randomly choose a starting gram from the set of available grams in your map.

Testing with Text Data

Once you've written the above methods, you should test them on a few simple sentences of your own. But to really see their power, you should also test them on longer collections of text and make sure you get reasonable outputs. To get you started, there are a variety of text sources you can draw from. Click here to download a .zip file with a few collections I've created for you to try. For the poetic statement, you should also come up with your own text files.

Poetic Statement (5 Pts)

You just made a very powerful text engine, so use it! Come up with your own collection of text and train a model. Put on some good music and find around 100 lines of text in a particular style, and place them all in a text file (for instance, you may try to collect 100 notable Harry Potter quotes). Then, play around with the gram size and see what kinds of outputs you get. Once you are satisfied, please submit three or more examples that you believe capture the spirit of your synthesizer.

One thing you might try to make this really fun is something I'll call Markov Hybrids. The idea is to load in two different styles in the same model and to have the synthesizer mash them up. For example, if you load both the Sith quotes and Ursinus tweets in, you may get some results like this:

  • "the cpd has yet another tip for undergraduates to explore the dark side"
  • "check out this scholarship alert: before your eyes. i can drive you mad with fear, shred your sanity, and leave you a raving lunatic"
  • "thomas j. watson fellowship, a yearlong grant that anyone who knows the words and scientific committee of the force"
  • "kill the spring! 78 days until opening day!"
  • "vanessa wilson-marshall '02 recalls the words and actions of significance is the result of conquest"
If you choose to do this, at least one of the styles should be different from the examples I've given.

Ethics Reflection (5 Pts)

With great power comes great responsibility. Even this relatively simple model can be used for nefarious purposes, and we should take care with how such code is deployed. Reflect on this, and see if you can come up with at least three such examples. Beyond just the synthesis, even if you didn't complete the extra credit, you may also think about what could be problematic about using this to identify/classify text automatically. We will discuss in class and think how a society might institute tech safeguards.

Reflection on Struggle (5 Pts)

This is the hardest assignment I've given so far, so I want you to briefly reflect on the process of struggle during this assignment. The productive struggle has been shown to be the best way to move towards mastering a skill, so we want to learn how to work our way through that. It may be helpful if you journal as you progress through the assignment. Here are a few things you might consider:

  • When did you notice yourself struggling during this assignment?
  • How did it feel to struggle? What thoughts did you notice around that? For example, how did you feel about yourself? How did you feel towards me? How did you feel about yourself in relation to others? And if you're comfortable describing this, where did you feel struggle in your body? (We are often unaware of the latter as we move through difficult emotions)
  • What strategies did you use to get yourself through the struggle? Which of these strategies seemed helpful and productive? Which of these strategies struck you as less productive or possibly toxic?
  • How do you feel now that the assignment is completed?
Since this can be very personal, please only share what you're comfortable sharing, and do not feel pressured to answer all of the above questions. But I want to hear particularly about where you got stuck and what your strategies were at a level of description that you feel is appropriate.

Extra Credit: Markov Classification (up to +5 Pts)

The probability of a sequence

In addition to synthesizing text, there is a way to compute a probability that a particular sequence came from a model you created. We've made a simplifying assumption of independence that will help make this easier to compute. In particular, the next character is chosen only based on the current gram, and none of the previous grams influence this decision.

Independent events are nice to work with because the probability of independent events occurring is a simple multiplication. For example, it's reasonable to assume that the type of weather in Beijing on a particular day is independent of the weather in Collegeville. So if the probability it rains in Collegeville is 0.4 and the probability it rains in Beijing is 0.6, then the joint probability of both occurring is 0.4 x 0.6 = 0.24

To compute the probability of a particular sequence of characters, we must first compute the probability that each character was drawn under our model, and then we may compute the joint probability of the entire sequence by multiplying them together. The probability of a particular character c preceded by the gram p is given as \[ p(c) = \frac{N(p.c) + 1}{N(p)+S} \] where N(p) is the number of times the gram occurs in the model, N(p.c) is the number of times the character c follows gram p, and S is the size of the alphabet in the model (i.e. the number of unique characters across all grams). So the joint probability of all characters is obtained by multiplying a bunch of these together.

There is a slight numerical issue with the above scheme, however. Since there are many characters in most sequences of interest, and since the probabilities of each character are small, we can run into arithmetic underflow where the multiplication of many small numbers ends up just bottoming out at zero numerically. To get around this, we can instead compute the ``log probability''. Since \[ \log(AB) = \log(A) + \log(B) \] We can compute the log of the product probabilities as the sum of the log of each probability. So simply modify the formula as \[ \log \left( p(c) \right) = \log \left( \frac{N(p.c) + 1}{N(p)+S} \right) \] and then sum all of these up for each character to get the final log probability.

Classification Problems

Once you're able to compute the probability for a model, you can train two (or more) different models on different types of text, and you can see which model gives a higher log probability. We'll say the model with the highest probability is most likely to be related to the text in question. For example, here are a few things you can try from the data I've given

  • Create model with the first 900 positive movie reviews, and create a model with the first 900 negative movie reviews (add an entire file at once, not just one line at a time). Then, compute probabilities for the last 100 positive and negative reviews. If the probability is greater for the positive model, guess that it's a positive review. Otherwise, guess it's a negative review. Report how many of the 100 positive and 100 negative reviews were guessed correctly
  • Create a model from the lines in the "spam_train.txt" file and another model from the lines in the "ham_train.txt" file. Then, test each on the "ham_test.txt" and "spam_test.txt," and report how many of them you guess correctly
  • Train a model on Bush debate text and a model on Kerry debate text. Since there aren't too many examples, you should assess the model by doing a "leave-one-out" test; that is, train on all of the Kerry debates files and 29/30 of the Bush debate files, and test on the remaining bush file to see whether the Bush or Kerry model has a higher probability. Retrain the models on all possible subsets leaving one out, and report how many of them got the one they left out correct.
To get the full extra credit, you should perform at least two of the experiments above each over a range of gram lengths and report which one gives the best results. And if you finish that and you're still bored, you can make up your own example!