How Metrics Affect LLM Capability Perception

How Metrics Affect LLM Capability Perception

AI Emergence refers to the phenomenon where LLMs display abilities not present in smaller-scale models. Emergence has two features: their sharpness, transitioning seemingly instantaneously from not present to present; and their unpredictability, appearing at seemingly unforeseeable model scales. This was the cause for a lot of concern and interest. “Are Emergent Abilities of Large Language Models a Mirage?” presents a sobering alternative to the idea of emergence.

Emergent abilities appear due to the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance.

This is proved through 3 kinds of experiments-

  1. Analyzing the impact of metrics on the InstructGPT/GPT-3 family on tasks with claimed emergent abilities;
  2. Performing a meta-analysis of emergent abilities.
  3. Show that by changing the metric, we can show ‘emergent abilities’ in models that didn’t demonstrate emergence before (this one was my fav).

As 2023 draws to a close, it is safe to say that conversations around foundation models, AGI, and the risks of emergent AI have dominated many headlines around space. Much of the hysteria in this conversation centers around the suddenness of some of the capabilities manifested by AI. Scaling up transformers to sizes comparable to GPT and Gemini led to sudden, unexplainable jumps in performance on a variety of benchmarks. Researchers called this phenomenon emergence, after the biological phenomenon where more complex forms of life manifest new abilities that are not present in simpler organisms.

AI emergence was a huge cause of concern (and hype) b/c it meant that we were likely to create and deploy systems that we are unable to fully analyze, control, and predict. Studying emergence was key to developing more secure LLM-enabled systems. It also provided a source of income to budding sci-fi writers who created some pretty fun campfire speculation about the impending AGI wave and economic reset. Unfortunately, some researchers at Stanford hate Christmas and decided to pour water on the idea of AI emergence. Their research shows that ‘emergence’ is not an inherent property of scaling models, but rather a measurement error caused by inferior metric selection.

In this article, we will cover the 3 sets of experiments to understand their implications in more detail. I’ll end on a note on why I think AI doesn’t exhibit emergence the same way scaled up biological phenomena do. Before we proceed, a fun piece of trivia about this paper: The Debby Downers at NeurIps had this paper as one of their papers of the year, instead of something fun and whimsy like Microsoft’s Sparks of AGI.

Let’s get right into it.

Experiment Set 1: Analyzing InstructGPT/GPT-3’s Emergent Arithmetic Abilities

One of the most eye-catching examples of emergence was AI Arithmetic. Language models that can perform computations perfectly are fantastic for handling tabular data or for parsing through documents for higher-level level retrieval. Even basic filtering/computational abilities would add significantly to the market value of such products. The authors make 3 predictions-

  1. Changing the metric from a nonlinear or discontinuous metric (Fig. 2CD) to a linear or continuous metric (Fig. 2 EF) should reveal smooth, continuous, predictable performance improvement with model scale.
  2. For nonlinear metrics, increasing the resolution of measured model performance by increasing the test dataset size should reveal smooth, continuous, predictable model improvements commensurate with the predictable nonlinear effect of the chosen metric.
  3. Regardless of metric, increasing the target string length should predictably affect the model’s performance as a function of the length-1 target performance: approximately geometrically for accuracy and approximately quasi linearly for token edit distance.

“To test these predictions, we collected outputs from the InstructGPT/GPT-3 family on two tasks: 2-shot multiplication between two 2-digit integers and 2-shot addition between two 4-digit integers.”

To analyze GPT-3 on arithmetic let’s peep our boy, Figure 3 (I copied it below, so you don’t have to scroll up). We see emergent abilities if the target has 4 or 5 digits and if the metric is Accuracy (the top of Fig. 3). However, changing from the nonlinear metric Accuracy to the linear Token Edit Distance (while keeping the outputs fixed) we see a smooth, continuous, and predictable improvement in performance with increasing scale (Fig. 3, bottom). “This confirms the first prediction and supports our alternative explanation that the source of emergent abilities is the researcher’s choice of metric, not changes in the model family’s outputs. We also observe that under Token Edit Distance, increasing the length of the target string from 1 to 5 predictably decreases the family’s performance in an approximately quasilinear manner, confirming the first half of our third prediction.”

Onto the second prediction. To accurately measure the models’ accuracy, the authors increased the resolution by generating additional test data. By doing so, they found that on both arithmetic tasks, all models in the InstructGPT/GPT-3 family performed better than random guessing (Fig. 4). This confirms our second prediction. We also observe that as the target string length increases, the accuracy falls approximately geometrically with the length of the target string, confirming the second half of our third prediction.

Figure 4: Claimed emergent abilities evaporate upon using better statistics. Left to Right: Math-ematical Model, 2-Integer 2-Digit Multiplication Task, 2-Integer 4-Digit Addition Task. Based on the predictable effect Accuracy has on performance, measuring performance requires high resolu-tion. Generating additional test data increases the resolution and reveals that even on Accuracy, the InstructGPT/GPT-3 family’s [3, 24] performance is above chance and improves in a smooth, continuous, predictable manner that qualitatively matches the mathematical model.

Combined, these tell us that “the researcher’s choice of metric has the effect that one should predict accuracy to have, i.e., geometric decay with the target length.

Experiment Set 2: Meta-Analysis of Emergence

Next, the authors look into the published results of other models that claim emergence (this is the best they can do coz unlike with GPT, the other models are not queryable nor are their results publicly available). Here, we need to prove two things-

  1. If emergence is real, one should expect task-model family pairs to show emergence for all reasonable metrics. However, if the authors are correct, we will see emergent abilities appear only under certain metrics (nonlinear and/or discontinuous metrics).
  2. On individual Task-Metric-Model Family triplets that display an emergent ability, changing the metric to a linear and/or continuous metric should remove the emergent ability.

To look into the first, the authors tried to quantify emergent ability. By analyzing their results, we see that 2 metrics account for > 92% of claimed emergent abilities: Multiple Choice Grade and Exact String Match. Multiple Choice Grade is discontinuous, and Exact String Match is nonlinear.

Next, we look into the LaMDA family and its outputs are on BIG-Bench. Below are the two parts of Figure 6, which show us that changing the metric when evaluating task-model family pairs causes emergent abilities to disappear:

The LaMDA model family displays emergent abilities when measured under the discontinuous Multiple Choice Grade.

The LaMDA model family’s emergent abilities disappear when measured under a continuous BIG-Bench metric: Brier Score.

This is a fairly strong argument against emergence. To tie this up, the authors show us that it is possible to do the reverse- to show emergence in vision models by simply changing the metrics we use for evaluation. This was my fav part b/c thought of some influencer taking this part out of context and claiming that vision models now have emergent abilities made me smile.

Experiment Set 3: Inducing Emergent Abilities in Networks on Vision Tasks

So far, no one has claimed AI emergence in Computer Vision. Thus, if we can show that simply changing metrics used to evaluate well-known vision models on standard tasks, it would be a very strong argument for the main thesis of the paper: that emergence depends more on the choice of metric than any magical property in scaling.

The authors demonstrate their results on autoencoders, autoregressive transformers, and CNNs. No prizes for guessing the results. For the sake of conciseness, -

Figure 8: Induced emergent classification ability in autoregressive Transformers. (A) A published emergent ability on the MMLU benchmark [8]. (B) Autoregressive transformers trained to classify Omniglot images display increasing accuracy with increasing scale.  When accuracy is redefined as classifying all images correctly, a seemingly emergent ability appears.

All in all, a very convincing paper. Let’s end on a brief note about why we think AI emergence has not followed the same jumps as its biological counterpart.

Where AI falls short of Biology

The way we see it, there is a big difference between the scaling up of AI architectures and what we see in nature as organisms become more complex. AI scaling up is quite simple- we add more neurons and blocks in hopes that the additional computation will allow the models to glean deeper (and more complex) patterns about the underlying data. This is where feature/data engineering plays such a crucial role: it guides our AI models. But at their core, cutting-edge AI models are just gigantified versions of smaller architectures.

Traditional ML models (top) grow by just stacking blocks.

All in all, if we are to develop AI models for the future, it would be worth looking into more decentralized AI architectures that have multiple types of specialized modalities and solutions built in. Then the challenge becomes building a good control structure that can dynamically call the right kind of sub-structure(s) based on the inputs. This is really difficult, but it’s a much more concrete and clear objective than vaguely defined castles in the sky like ‘AGI’ and ‘human-like AI’. Having more clearly defined goals is infinitely more useful and productive.