Thursday, October 5, 2023

How big do Language Models need to Be to Speak Coherent English?

The scaling law of the best model versus the total number of training flops.

Figure 23 in the paper shown above shows the relationship between model size and the amount of compute used for training on the TinyStories dataset on a log-log scale. Specifically:

  • The x-axis shows the total compute budget measured in FLOPs (floating point operations). This captures both model size and number of training steps.
  • The y-axis shows the size of the best performing model for that FLOPs budget. Model size is measured by number of parameters.
  • Multiple points are plotted corresponding to different compute budgets. For each budget, the optimal model size is chosen.
  • The plot shows model size grows faster than linearly with increasing compute budget.
  • Fitting a power law curve shows an exponent of around 1.3, indicating the relationship is polynomial.

This demonstrates a scaling law where doubling the compute budget leads to using a bigger model for optimal results. However, model size increases slightly faster than compute due to the exponent > 1.

The key conclusions are:

  • There is a scaling law between optimal model size and available compute even for small models like those trained on TinyStories.
  • This suggests certain scaling phenomena are fundamental properties of language modeling rather than just behaviors of massive models.
  • For a fixed compute budget, the plot provides guidance on choosing the best model size for maximizing performance.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).

In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency.

We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: arXiv:2305.07759 [cs.CL]
  (or arXiv:2305.07759v2 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2305.07759

Submission history

From: Yuanzhi Li [view email]
[v1] Fri, 12 May 2023 20:56:48 UTC (26,425 KB)
[v2] Wed, 24 May 2023 23:30:43 UTC (25,561 KB)

Key points

  • The paper introduces TinyStories, a new synthetic dataset of short, simple stories intended to mimic the vocabulary and knowledge of young children aged 3-4.
  • TinyStories is used to train small language models (SLMs) with under 10 million parameters that can generate multi-paragraph stories with good grammar, reasoning, and coherence. This shows basic language capabilities can emerge in much smaller models trained on a simpler, refined dataset.
  • The paper proposes a new evaluation paradigm using GPT-4 to "grade" generated stories on dimensions like grammar, creativity, and consistency. This gives more nuanced scores than standard benchmarks.
  • Experiments show basic knowledge and reasoning emerge quickly with model size, while creativity plateaus. Model depth helps more with coherence vs grammar.
  • SLMs trained on TinyStories exhibit trends like scaling laws and interpretability like large models, despite the small scale. Attention heads and neurons show noticeable specialization.
  • Analysis suggests the SLMs generate novel, diverse content not copied from training data. Out-of-distribution experiments also demonstrate generalization ability.
  • Overall, TinyStories provides a testbed for studying model architectures, emergence of capabilities, and interpretability in a low-resource setting. It decouples language coherence from world knowledge needed for large corpora.

SLM Properties

Scaling laws:

  • Scaling laws refer to the relationship between model performance and the amount of compute used for training.
  • Prior work found that for large language models, the optimal model size for a fixed compute budget scales polynomially - bigger models benefit more from extra compute.
  • The paper shows a similar polynomial scaling law holds even for the much smaller models trained on TinyStories. Doubling compute suggests using a slightly bigger model.
  • This implies the scaling relationship may be an intrinsic property of language modeling, not just a trait of huge models. Similar phenomena emerge even at small scale.

Attention head specialization:

  • The paper analyzes attention heads in shallow models trained on TinyStories.
  • Some heads focus on positional information like distance between words. Others attend to semantic connections.
  • This suggests attention heads take on specialized linguistic roles even in very small transformer models.
  • The specialization emerges without explicit training objectives, guided just by the language modeling task.
  • It provides interpretability into how attention helps generate coherent text in transformers.

MLP neuron specialization:

  • The activation patterns of individual neurons in the MLP are analyzed.
  • Some neurons activate consistently on specific token types like nouns or verbs that play a common role.
  • This implies neurons specialize to detect particular syntactic and semantic features, even in tiny models.
  • It lends interpretability into how linguistic properties are encoded in the neural activations.

So in summary, Similar scaling laws and emergent specialization are observable despite TinyStories being small-scale, lending insights into model properties.

New Multidimensional Evaluation Paradigm

Hidden size   Layer    Head   Eval loss   Grammar    Creativity     Consistency
768             2          2          1.38           7.77             6.5               7.78
768             2          4          1.34           8.05            6.57              8.16
768             2          8          1.33           8.25            6.53              8.16
768             1          2          1.58           7.13            5.83              6.38
768             1          4          1.56           7.43            5.90              6.75
768             1          8          1.54           7.45            6.28              7.02
Figure 24: Model performance with different number of attention heads

The new way to evaluate the text generated by language models trained on TinyStories key ideas are:

  • They manually create a set of ~50 story prompts with incomplete sentences. They use the model to generate completions.
  • They provide the original prompt + model completion to GPT-4. GPT-4 is instructed to "grade" the completion on dimensions like grammar, creativity, consistency with the prompt, etc.
  • GPT-4 first gives a qualitative assessment of the completion. Then it provides numerical scores out of 10 for each dimension.
  • It also guesses the age of the "student" that wrote the story based on the language sophistication.
  • They average the scores over multiple (10) completions to evaluate the model.

The main advantages are:

  • It better captures coherence, reasoning, and language quality compared to benchmarks requiring one word answers.
  • The qualitative assessment gives feedback on issues not captured by scores.
  • The multidimensional scores allow analyzing different capabilities of the model.
  • It mimics how humans evaluate language quality more closely.

The paradigm does not directly measure accuracy or factual correctness of the generated text. But the consistency score indicates if the story flows logically from the prompt. The qualitative feedback also highlights factual issues.

Advantages v. Standard Benchmarks

The authors claim the  new evaluation paradigm proposed in the paper has several advantages over standard benchmarks for evaluating generative language models:

More naturalistic: Standard benchmarks like CLOZE often require models to produce just a single word or phrase. This doesn't reflect the creativity and diversity of natural language generation. The proposed method has models generate full story completions, which is a more natural language task.

Multidimensional: Standard benchmarks produce a single accuracy metric. This lacks nuance and doesn't provide insights into the model's specific strengths/weaknesses. The proposed method has both qualitative assessments and scores for grammar, creativity, consistency etc., giving a more detailed evaluation.

Overcomes data limitations: Many benchmarks are limited by the size or quality of evaluation data. The proposed method relies on GPT-4's strong language capabilities instead, and can generate new test cases.

Mimics human evaluation: Having an advanced LM like GPT-4 grade the generative LM's outputs acts as a stand-in for human evaluation. This matches how language quality would be judged in the real world.

Better for coherence: Benchmarks like TriviaQA focus on factual correctness. The proposed consistency score better evaluates coherence and logical flow.

Enables error analysis: The qualitative feedback highlights problems not captured in scores, and helps identify areas for improvement.

Overall, the proposed approach leads to a more comprehensive and meaningful assessment of an LM's language generation abilities compared to evaluating single word factual accuracy alone. The multidimensional analysis and qualitative feedback are especially valuable.

 Grading Results with LLM

The paper does not provide full details on how GPT-4 was trained to grade the generated stories, but here is what I can infer:

  • GPT-4 is an advanced 175 billion parameter language model developed by Anthropic to have strong natural language capabilities.
  • It was trained on a massive diverse dataset of text from the internet to capture a broad range of linguistic patterns and world knowledge.
  • The researchers likely prompted GPT-4 to take on the role of a teacher grading a student's story writing assignment.
  • No special training was likely needed - the prompts just prime GPT-4 to tap into its already learned knowledge about language, writing, reasoning, and providing feedback.
  • GPT-4 relies on its internal representation of coherence, grammar, creativity etc. to provide assessments, like a human grader would.
  • The numerical scores are heuristics provided by the researchers - GPT-4 generates qualitative feedback based on its language skills.
  • The age groups are similar - GPT-4 makes judgments on the sophistication of the language based on patterns it has absorbed from its broad training.
  • No ground truth data was used to directly train GPT-4 for this task - it is zero/few-shot, relying on capabilities from pre-training.

Apparently, GPT-4 wasn't specifically trained to grade stories, but rather leverages its foundation of linguistic knowledge and reasoning ability derived from pre-training to provide assessments as prompted. Its capabilities allow it to perform this evaluation role without requiring additional specialized training.

As a large generative model, GPT-4 has a tendency to generate plausible-sounding but incorrect or nonsensical responses at times. This could negatively impact its ability to reliably grade the stories generated by the models trained on TinyStories.

Some ways the researchers could mitigate this issue:

  • Use a version of GPT-4 fine-tuned to be more grounded and accurate in its responses. Models tuned this way are less prone to hallucination.
  • Provide high-quality demonstrations upfront of grading stories properly before having it evaluate the TinyStories model outputs. This primes better behavior.
  • Have multiple GPT-4 models grade each story to get an averaged assessment less influenced by individual errors.
  • Manually check a subset of GPT-4's grades for accuracy to get a sense of its performance.
  • For numerics like scores, use the lower end of the range it provides to control for exaggeration.
  • Discount or eliminate qualitative feedback that seems clearly nonsensical or fabricated.
  • Use a smaller, more cautious model like GPT-3.5 that is less prone to hallucination than GPT-4.
  • Provide feedback to GPT-4 when errors are caught to continue steering it towards more grounded responses.

The key would be properly priming GPT-4 to tap into its valid linguistic knowledge rather than its propensity to generate text freely. With the right prompting approach and post-processing, it can likely provide sufficiently accurate assessments. But the authors should have taken care to validate its grading ability.

The paper does not explicitly mention any techniques to prime or validate GPT-4 to improve the accuracy of its grading. The authors just state that they provided the prompts to GPT-4 and had it grade the completions.

You're right that some kind of validation or priming would be helpful to ensure GPT-4 produces grounded assessments. A few possibilities that could have been done:

  • Show examples upfront of good and bad story completions along with accurate grades to calibrate GPT-4's response.
  • Spot check a subset of GPT-4's grades against human judgement to validate its performance.
  • Provide feedback when GPT-4's grades seem inaccurate and have it re-assess.
  • Test GPT-4's grading on held-out stories with known scores to quantify its accuracy.
  • Average scores from multiple differently prompted GPT-4 models to reduce variability.
  • Use a smaller pretrained model less prone to hallucination like GPT-3.5.

Without these kinds of checks, it's hard to confirm if GPT-4's evaluations are grounded in reality or if it could be hallucinating. The risk of imaginative responses from such a large generative model is real.

Since the paper does not mention any validation procedures, it seems to take GPT-4's assessments at face value. More rigor would be needed in future work to ensure the grades accurately reflect the model's capabilities. The conclusions drawn are dependent on GPT-4 providing sound evaluations.

assessing model quality

The multidimensional performance scores provided by the GPT-4 evaluation can help analyze the strengths and weaknesses of different models trained on TinyStories. Some ways this can guide assessing model quality:

  • Look at the overall profile - models that score high on most dimensions are generally better. But focus on dimensions most important for the use case.
  • Compare models on their weakest dimension - selecting the model with highest score in its poorest area avoids lopsided performance.
  • Evaluate sophistication by the creativity or age group rating - higher creativity and age imply more advanced language skills.
  • Check trade-offs between scores - boosting one score like grammar at the expense of others may not be beneficial. Models with balanced scores are better.
  • Favor small differences in key dimensions over larger gaps in unimportant ones - a 0.1 gain in consistency may matter more than 0.5 in grammar.
  • Ensure no dimension is critically low - identify unacceptable thresholds for metrics like consistency where the score must meet a minimum bar.
  • Compare error profiles - models making similar mistakes are likely deficient in the same area. Unique errors signal unique capabilities.
  • Track correlation between dimensions - if grammar and consistency scores align, focus on just one. If no correlation, analyze both.
  • Evaluate by use case - a model with higher creativity may be better for open-ended tasks, while one with higher consistency suits constrained applications.

So in essence, the multidimensional scores give a detailed picture of model strengths which allows selecting the right model tailored to the context and use case, rather than relying on a single skill-agnostic metric.

 Tiny Language Models Thrive With GPT-4 as a Teacher | Quanta Magazine

Some researchers have opted to train smaller models on smaller data sets and then study their behavior. “It’s like sequencing the Drosophila genome versus sequencing the human genome,” said Ellie Pavlick, a language model researcher at Brown University.

Now, in a paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories.

Machine learning researchers have embraced this lesson. GPT-3.5, the large language model that powers the ChatGPT interface, has nearly 200 billion parameters, and it was trained on a data set comprising hundreds of billions of words. (OpenAI hasn’t released the corresponding figures for its successor, GPT-4.) Training such large models typically requires at least 1,000 specialized processors called GPUs running in parallel for weeks at a time. Only a few companies can muster the requisite resources, let alone train and compare different models.

The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way. Their results hint at new research directions that might be helpful for training larger models and understanding their behavior.

“I found this paper very informative,” said Chandra Bhagavatula, a language model researcher at the Allen Institute for Artificial Intelligence in Seattle. “The concept itself is super interesting.”

 

No comments:

Post a Comment

When RAND Made Magic + Jason Matheny Response

Summary The article describes RAND's evolution from 1945-present, focusing on its golden age (1945-196...