Lecture 27: Generative models for text

Lecture 27: Generative models for text#

STATS 60 / STATS 160 / PSYCH 10

Concepts and Learning Goals:

Next word prediction
Markov text generators
Similarities/differences to LLMs

Text generation#

So far in our machine learning unit, we have focused on prediction: regression and classification.

A somewhat different goal is text generation; in text generation, we want the computer to converse with us.

The Turing Test#

Alan Turing. Image from Wikipedia.

Alan Turing, in 1950, was trying to define AI:

Defining what thinking/intelligence means or what machine means is dicey
Proposed the following test for machine intelligence, called the imitation game:
- If a machine can converse with a person and pass for human, it is effectively intelligent.

From Turing's original paper.

Then he says, substitute “machine” for “woman”. I guess at the time, this was rhetorically necessary, because people found talking to machines so strange.

Initial attempts#

Most early chatbots were based on hand-built models of communication, and were not very good.

But some people found them surprisingly compelling.

Let’s test drive the 1960’s ELIZA therapy chatbot.

Generation from prediction#

Let us recast text generation as a prediction task:

Next word prediction: given some text $x$, predict $y$, the word most likely to come next.

Our goal is now to create a model $f$ for next word prediction:

\[ (\text{input text}) \quad x \to f(x) \to y \quad (\text{our guess for the most likely next word}) \]

Strategy for generating text:

We are given a prompt which is a string of words, $X = X_1, X_2,\ldots,X_m$
We predict the next word, $f(X) = \hat{y}$.
Repeat: set $X_{m+1} = \hat{y}$ to create a new string of text, $X' = X_1,\ldots,X_m,\hat{y}_1$
We predict the next word, $f(X') = \hat{y}'$.
Repeat: create a new string of text, $X'' = X_1,\ldots,X_{m},\hat{y}, \hat{y}'$.
Etc…

Each $\hat{y}$ is a word.

The text we generate with this process is the phrase $\hat{y},\hat{y}',\ldots$

How to predict the next word?#

Models we know?#

Question: do you think linear regression is a good idea? Why or why not?
Question: do you think nearest neighbors is a good idea? Why or why not?
Question: any other ideas?

Markov Text Generator#

Idea: treat language as a random sampling process.

Speech is a random sequence of words $X_1,X_2,X_3,\ldots$
The words are not independent; $X_i$ depends on the words that came before, $X_{i-1},X_{i-2},\ldots$

A Markov Text Generator is a simple next-word-prediction model that is based on this principle.

The model: for each pair of words $a,b$ in the dictionary, learn the probability that $b$ follows $a$: $$\Pr[ X_{i+1} = b \mid X_i = a]$$
- For example, if $a = \text{hula}$, we expect that the next word is more likely to be “hoop” than “statistics”: $$\Pr[X_{i+1} = \text{ hoop}\mid X_{i} = \text{ hula}] > \Pr[X_{i+1} = \text{ statistics} \mid X_{i} = \text{ hula}]$$
- Assume that the exact location of the word $i$ in the string of text doesn’t matter; all that matters is the chance that the word $B$ follows the word $A$
Training:
- Our training data is a corpus of text $T$
- For each pair of words $a,b$ in $T$:
  - count how many times $a$ appears in $T$; let this be $n_a$
  - count how many times $b$ appears in $T$; let this be $n_{a,b}$
  - set our estimate of the probability that $b$ follows $a$ to be $\hat{P}[X_{i+1}=b \mid X_i = a] = \frac{n_{a,b}}{n_{a}}$.
Generation: For input word $X_i$, sample $X_{i+1}$ randomly from the distribution we came up with in training, $\hat{P}$
- What if we reach a word $a$ that we have only ever seen at the end of a sentence?
- We have several options. One is to start over at a random word. Another is to just end the phrase.

How to train your generator#

Training:

For each pair of words $a,b$ in the training text $T$:
- count how many times $a$ appears in $T$; let this be $n_a$
- count how many times $b$ appears right after $a$ in $T$; let this be $n_{ab}$
- set our estimate of the probability that $b$ follows $a$ to be $\hat{P}[X_{i+1}=b \mid X_i = a] = \frac{n_{a,b}}{n_{a}}$.

Practice: train a Markov Text Generator on the example text:

“I came, I saw, I conquered”

There are only four relevant dictionary words here: “I”, “came”, “saw”, “conquered”.
Let’s fill out this table with the probabilities that word $b$ comes after word $a$:

The row corresponds to $a$, the column corresponds to $b$.

I

came

saw

conquered

I

$\Pr[X_{i+1} = \text{came} \mid X_i = \text{I}]$

came

saw

conquered

	I	came	saw	conquered
I	0	$\frac{1}{3}$	$\frac{1}{3}$	$\frac{1}{3}$
came	1	0	0	0
saw	1	0	0	0
conquered	0	0	0	0

Taking it for a spin#

I generated 5 random strings of text using our Markov text generator, on input word “I”:

I saw I saw I conquered
I saw I saw I saw I saw I saw I saw I conquered
I conquered
I saw I came I conquered
I came I saw I came I came I came I came I conquered

Other training sets#

The Gettysburg address#

Training data:

Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing …

Output:

Whether that all men are met on this nation, conceived and proper that we should do this. But, in a great civil war, testing whether that we can not consecrate -- that field, as a portion of devotion to the living, rather, to the people, by the living, rather, to dedicate -- and that we say here, have thus far above our fathers brought forth on a new nation, conceived in Liberty, and proper that this continent, a new birth of freedom -- and dead, who struggled here, have a great task remaining before us -- that war.

Wikipedia: Statistics#

Output:

Consider now available. Examples of a positive feedback runaway effect of mean and instrumental variables, and covers descriptive statistic was introduced the least some degree of questions to specifying an important that is the use of type of data, or population): central or not belong to being true (statistical study typically uses a "false negative"). Multiple problems are yet-to-be-observed random vector of the set is also heavily criticized today for future of modern use the discipline of these errors in society.

Reddit: ELI5#

Output:

My body got better at targeting the throat completely independent of mucus off it is constantly passing through it? bacteria infects the soft tissues from inhaling virus or bacteria causes pain allergies can be very painful sore outside of post-nasal drip; this can people get them making them susceptible to deal with a bit? Almost like my colds always started with the immune system has cracked the combination meds for a very irritating and water is constantly passing through it? bacteria causes inflammation I noticed that sore throat the throat, this causes inflammation I don't understand.

A bit fancier#

To make the model a bit fancier, we can pick $x_{i+1}$ based on the previous $k$ words:

for each sequence of $k$ words $a_1,\ldots,a_k,a_{k+1}$, learn $\Pr[X_{i+1} = a_{k+1} \mid X_1 = a_1,\ldots,X_k = a_k]$

Call $k$ the context size.

Context size contrast#

As the context size grows, the generated text makes more sense.

From the Gettysburg Address:

$k = 1$: It is for those who fought here highly resolve that all men are engaged in a portion of that government of that field, as a great civil war, testing whether that nation, conceived in Liberty, and that government of the proposition that nation, or detract.
$k = 2$: It can never forget what they did here. It is for us to be dedicated here to the great task remaining before us – that this nation, under God, shall have a new nation, conceived in Liberty, and dedicated to the unfinished work which they gave the last full measure of devotion.
$k = 3$: It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.

Question: Do you see a downside to making the context size bigger?

Training and generation both take more time as $k$ increases.
It is less random, more like copying sentences from $T$

Memorizing?#

Sometimes people say of a language model,

“it just memorized the training data.”

What they mean is that there is a copy of this output somewhere in $T$.

When $k=3$, our Markov chain effectively memorized from Gettysburg.

The following sentence was generated by the model, but also appears verbatim in the address:

”Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.”

Influence of training data#

Notice that the character of the output is extremely sensitive to the training data. With context length 3:

Wikipedia:

Populations can be diverse groups of people or objects such as “all people living in a country” or “every atom composing a crystal”.

ELI5:

As far as why these affect the back of the throat, it is he primary entry point for what we call upper respiratory infections, usually from inhaling virus or bacteria particles.

Gettysburg:

Dedicate – we can not dedicate – we can not hallow – this ground.

Modern Large Language Models#

Modern language models share similarities with Markov text generators:

They try to estimate $\Pr[X_{i+1} = a_{i+1} \mid X_{i-k},\ldots,X_{i}]$
- They also have a context size $k$
- Instead of words, they use “tokens” or fragments of words
The probability that $a_{i+1}$ follows $a_{i}$ is estimated by training on a corpus of example text $T$
- They probability is not estimated by counting.
- Instead, predicting $y = a_{i+1}$ from $x = a_i,\ldots,a_{i-k}$ is treated as a regression problem
- The way this regression problem is solved is loosely inspired by brain function
  - Has some similarity to linear regression
  - Combines a lot of linear regressions together in a sequential way
  - There is no closed formula for the best model parameters
  - Actually, computing optimal model parameters efficiently is probably impossible!
  - Because of this, methodology is heavily influenced by efficiency and engineering decisions
The choice of training data dramatically influences how the model behaves

Recap#

Next word prediction
Markov text generators
- context size
- memorization
- training data dramatically influences output
LLMs

	I	came	saw	conquered
I		\(\Pr[X_{i+1} = \text{came} \mid X_i = \text{I}]\)
came
saw
conquered

	I	came	saw	conquered
I	0	\(\frac{1}{3}\)	\(\frac{1}{3}\)	\(\frac{1}{3}\)
came	1	0	0	0
saw	1	0	0	0
conquered	0	0	0	0

Lecture 27: Generative models for text

Contents

Lecture 27: Generative models for text#

Text generation#

The Turing Test#

Initial attempts#

Generation from prediction#

How to predict the next word?#

Models we know?#

Markov Text Generator#

How to train your generator#

Taking it for a spin#

Other training sets#

The Gettysburg address#

Wikipedia: Statistics#

Reddit: ELI5#

A bit fancier#

Context size contrast#

Memorizing?#

Influence of training data#

Modern Large Language Models#

Recap#