Inspect AI Evals for the Reversal Curse

I've been thinking a lot about what it'll take to dramatically improve LLM performance across the board. My running hypothesis? We need to crack three core concepts from relational frame theory (RFT) - and the "reversal curse" is one of them.

The reversal curse is where large language models see plenty of instances of "A is B" but fail to generalize and learn "B is A". It's described by Berglund et al. in their paper "The Reversal Curse: LLMs Trained on "A is B" Fail to Learn "B is A"". In RFT, this is called "mutual entailment", and humans do it readily. If we can improve LLM performance on this task (along with the other main RFT concepts I wrote about here), I think we'll see drastic improvements in overall model capabilities.

Now, I'd been looking for a robust evaluation framework - I'd tried Azure's Prompt Flow but something about it never clicked. Then I kept hearing folks in the AI safety community mention Inspect AI. When I saw how clean and simple their examples were, I knew I had to give it a shot. What better way than to reproduce an important result on a problem I care about?

In their paper, one of the experiments (#2) relies on taking a corpus of parent & child pairs, and evaluating various models to determine whether they're able to elicit both pairs of information "X is child of Y" as well as "Y is parent of X". Because how a model behaves depends heavily on what information was in the training data, the authors chose to use a list of celebrities, since most training data that include internet data would have ample mentions of celebrities. In the (parent, child) relation, the celebrity is the child listed along with their parents.

The authors provide their own codebase to test a variety of models and perform their own testing that span some completion as well as chat completion models. I wanted to take the opportunity to translate these methods so we can run our evals more broadly using the open-source library Inspect AI

What are evals?

Evaluations, or "evals", are kind of like acceptance tests for language models. If you've ever used a language model for a task, then tried a different one that completely failed at it, you're already familiar with how varied in performance these models can be. Developing and coming up with your own evals is a critical step in upgrading to newer models or migrating to different variants because they allow you to quantify performance on a task.

They typically consist of 3 main components: a task definition (what are you trying to do), example data (what does doing look like, including input prompts and expected outputs), and grading or judgement criteria (how well does the output match what you expect, whether it's an exact match, a fuzzy match, or based on human preferences).

Just getting the first two down on paper somehow is a great start. Then when you upgrade, you can run a systematic process where you,

run your prompts + inputs through your champion and challenger model
assess each one by having a human read through the responses, either grading them or marking the "preferred" option
aggregate the information and decide which one is better

If you have a task that is more or less preference and more fact-based, the rubric for grading can be simpler, like an exact or fuzzy match. If it's more preference-based, it's best to come up with some sort of scorecard with criteria for grading it.

Having a human run through this every time you want to upgrade is costly, but there are some automatic ways of doing this, which we'll go into with Inspect AI.

What is Inspect AI?

Inspect AI is a Python software library for running evals. It's open-source and currently developed by the UK AI Security Institute. And it also makes running evals much easier.

Conceptually, there are 4 things to pay attention to,

Tasks, what you're trying to evaluate
Datasets, of which the most important sub-concept is the idea of a "Sample"
Solvers, which corresponds to how you want the model to accomplish the task (which includes things like prompts, chain-of-thought, etc)
Scorers, which is the grading part of the process

We'll adapt the work of Berglund et al to these parts to build an Inspect AI eval for the reversal curse.

Setting some context on what we're doing

Following Experiment 2 from the paper, there are two tasks,

given a child, identify the parent
given a parent, identify the child

An example of the first direction would be,

User: Who is Tom Cruise’s mother?

Assistant: Mary Lee Pfeiffer

While an example of the second direction.

User: Who is Mary Lee Pfeiffer’s son?

Assistant: Tom Cruise

Later, after we've done some more setup with the dataset, solver, and scorer, we'll assemble a Task, which is a special class that glues everything together.

The overall task looks a bit like this (for the "parent" direction):

System Message: You are a helpful and terse assistant. You have knowledge of a wide range of people and can name people that the user asks for. If the answer is unknown or not applicable, answer with “I don’t know.”

User: Name a child of Barack Obama.

Assistant: Malia Obama

User: Who is Elon Musk’s mother?

Assistant: Maye Musk

User: Who is Kathy Pratt’s mother?

Assistant: I don’t know.

User: [Query]

With [Query] filled in with "Who is [actor]'s [parent]?". We'll build up all of the pieces to do this using Inspect AI.

Loading the Dataset

The authors have given us a readily available dataset to use here. They collected 1573 child-parent pairs to prepare for their original experiment. Not only that, but they have shared their data collection process and code, as well as uploaded it to Hugging Face Hub. They deserve much credit for setting us up to reproduce their work and build on it.

We can use Inspect's hf_dataset loader to get the dataset by specifying its id and split. I poked around in the dataset and identified which data file we want that contains the pairs. We'll fill in the sample_fields argument in a minute.

hf_dataset(
    "lberglund/reversal_curse",
    split="train",
    data_files="celebrity_relations/parent_child_pairs.csv",
    sample_fields=...,
)

Each item in the dataset is a record, and for this one each record has the following format.

{
  "child": "Anthony Michael Hall",
  "parent": "Mercedes Hall",
  "parent_type": "mother",
  "child_prediction": "Anthony Michael Hall",
  "can_reverse": true
}

We need to transform each of these rows into something that an LLM can work with, which means turning them into a Sample object. If you look through other examples, you'll see record_to_sample functions that aim to load it. Since we have two tasks (one for each direction), we'll make two functions.

The first takes a record and produces a Sample that will contain the input (which is the input prompt) as well as the target, our ideal response from the model.

def record_to_parent_identification_sample(record) -> Sample:
    """Possible fields for this dataset,

    {
        'child': 'Anthony Michael Hall',
        'parent': 'Mercedes Hall',
        'parent_type': 'mother',
        'child_prediction': 'Anthony Michael Hall',
        'can_reverse': True
    }
    """
    child = record["child"]
    parent_type = record["parent_type"]
    parent = record["parent"]

    prompt = f"Who is {child}'s {parent_type}?"
    return Sample(input=prompt, target=parent)

The second goes in the opposite direction.

def record_to_child_identification_sample(record) -> Sample:
    """Possible fields for this dataset,

    {
        'child': 'Anthony Michael Hall',
        'parent': 'Mercedes Hall',
        'parent_type': 'mother',
        'child_prediction': 'Anthony Michael Hall',
        'can_reverse': True
    }
    """
    child = record["child"]
    parent_type = record["parent_type"]
    parent = record["parent"]

    prompt = f"Who is {parent}'s child?"
    return Sample(input=prompt, target=child)

Together, we can create two datasets from what we have,

parent_identification_dataset = hf_dataset(
    "lberglund/reversal_curse",
    split="train",
    data_files="celebrity_relations/parent_child_pairs.csv",
    sample_fields=record_to_parent_identification_sample,
)

child_identification_dataset = hf_dataset(
    "lberglund/reversal_curse",
    split="train",
    data_files="celebrity_relations/parent_child_pairs.csv",
    sample_fields=record_to_child_identification_sample,
)

Next, we'll turn our attention to the Solver.

Building a Solver

There are several built in solvers that you can look through. In fact, if all you wanted to do were to

set a system message
pass each of your dataset's ~Sample~s in one by one

all you need to accomplish this are the system_message(...) and generate() solvers. But since the original experiment called for using few-shot prompting with some example pairs, we'll build that out.

Conceptually, you need to write a function that takes in your desired arguments and returns a function that looks like,

async def solve(state: TaskState, generate: Generate):
    # do something useful with state (possibly calling generate for more
    # advanced solvers) then return the state
    return state

The state input looks like the following,

class TaskState:
    messages: list[ChatMessage],
    output: ModelOutput

which means that you can inspect or manipulate the messages as well as any available model output as-needed. (Being able to inspect the model's output is critical for self-critique tasks.).

The type for Generate is much simpler and simply takes in a TaskState and returns the model output (usually str).

With that in mind, we build a solver that takes in a set of seed records, as well as our record_to_sample function. Making this generic means we can use the same solver for both tasks. Following the original paper, we'll use the same few-shot examples (in the child -> parent direction) for both tasks.

@solver
def few_shot_pair_solver(seed_records, record_to_sample):
    async def solve(state: TaskState, generate: Generate):
        for record in seed_records:
            sample = record_to_sample(record)
            # insert the few shot messages before the final one
            user_question = ChatMessageUser(content=sample.input)
            assistant_answer = ChatMessageAssistant(content=sample.target)
            state.messages.insert(-1, user_question)
            state.messages.insert(-1, assistant_answer)

        return state

    return solve

You need to decorate your function with @solver and return a function with the correct expected inputs and outputs. In our case, the return function will,

take the given state
iterate through each seed record
modify the state by inserting a user query, then an assistant query before the last record
in this case, the last record will be a user message for each input sample in the dataset

We also construct some seed records, using the same ones from the paper,

seed_records = [
    {
        "child": "Malia Obama",
        "parent": "Barack Obama",
        "parent_type": "father",
    },
    {
        "child": "Elon Musk",
        "parent": "Maye Musk",
        "parent_type": "mother",
    },
    {
        "child": "Kathy Pratt",
        "parent": UNKNOWN_STR,
        "parent_type": "mother",
    },
]

Next we'll move onto scoring each entry.

Using an `exact` Scorer

This is the easiest part of the process. Each sample has a target which is the true answer. We simply need to match the target vs what the model produces.

We'll use the built-in `exact` scorer to do the matching for us. The underlying implementation normalizes, tokenizes, and builds a bag of words to assess the match. So variations of the name should still give credit.

In [1]: from inspect_ai.scorer._classification import max_exact_score

In [2]: max_exact_score("Cruise, Tom", ["Tom Cruise"])
Out[2]: 1.0

In [3]: max_exact_score("Tom Andrew Cruise", ["Tom Cruise"])
Out[3]: 0.0

Finally, we tie all of the above together with a Task.

Building the Task

Since we have

2 datasets that we constructed, one for each direction
1 solver function that we wrote, that builds few-shot prompting generically given a function that takes a record and returns a sample
1 scorer function, that we imported from the library

With that, we can construct two tasks, by decorating a function with @task and having it return Task with the relevant arguments.

@task
def parent_identification():
    return Task(
        dataset=parent_identification_dataset,
        solver=[
            system_message(SYSTEM_PROMPT),
            few_shot_pair_solver(seed_records, record_to_parent_identification_sample),
            generate(),
        ],
        scorer=exact(),
    )


@task
def child_identification():
    return Task(
        dataset=child_identification_dataset,
        solver=[
            system_message(SYSTEM_PROMPT),
            # The original paper constructs few-shots in the child->parent direction, which we do here
            few_shot_pair_solver(seed_records, record_to_parent_identification_sample),
            generate(),
        ],
        scorer=exact(),
    )

Running your task

To run your task, you'll have to install the library and run the following command

inspect eval reversal.py --model $(MODEL)

filling in your desired model. There's a variety of models & providers. In my case, I'll be looking at a few open-source models via Ollama as well as some proprietary Anthropic models.

Once the evals have finished, you can run inspect view which should open a web-based interface to inspect the results.

Results across a range of models

./reversal_curse_results.png

Once I've run the pair of evals against each model, we have two scores from 0 to 1, corresponding with how many children and how many parents the models correctly identified. When interpreting these scores, you want to look at the difference in those two proportions. If you think about it, a perfect score of 1 & 1 means that the model knows everything in either direction. But a model that got 0.1 & 0.1 would still be extremely good at this task. It just didn't memorize all of these actor's parents, but it knew exactly how to reverse that relationship and figure out who children were whose.

One thing that's caught me off guard is DeepSeek-R1's performance - or lack thereof - on the parent -> child task. I've spent time digging through the eval traces, and it's not a scoring issue. The model genuinely seems to have no clue who many celebrities' parents are. My working hypothesis? The training data likely didn't include much American pop culture chatter, so these factual relationships just aren't there.

Another thing that also surprised me was how bad Claude Sonnet 4 seems to be at child identification (reversing) given how well it is at parent identification. I even looked through some examples and while we could improve the scoring method (exact match won't help with answers like "Lily-Rose Depp and Jack Depp (her children with Johnny Depp).") the scorer is still doing a decent job. There's just a lot of "I don't know" answers here. One hypothesis from the original paper (which examined OpenAI models) was that there may be some safeguards in place to prevent these models from identifying real people. That may be happening here, causing the model's performance to decrease.

Takeaway 1: Making evals less painful

One thing that's always irked me about evaluation work is writing code just to throw it away. I didn't understand Prompt Flow well enough to feel confident building something durable, so evals felt like a chore. You write some one-off script, get your numbers, and move on.

The broader challenge with evals is they're often an afterthought. It's hard to find elegant examples to model your own code after, so you're not motivated to invest in good practices. Without that foundation, you end up with one-off scripts instead of a proper evaluation harness.

What's been refreshing about Inspect AI is how it makes evals feel… fun? Once I had my first version of the child -> parent & parent -> child tasks working, I was genuinely excited to run them across more models and compare performance. That's the sign of a framework that's working - when you want to build on what you've created rather than start from scratch next time.

Takeaway 2: Digging into the eval traces

One of the best features of Inspect AI is how easy it makes debugging your evals. When a model performs unexpectedly, you don't have to guess what went wrong - you can dive right into the traces and see exactly what happened.

In DeepSeek-R1's case, I could click through individual samples and see the model's actual responses. Instead of correct parent names, I was seeing a lot of "I don't know" responses or completely incorrect guesses. This kind of forensic analysis is crucial for understanding whether you have a prompt problem, a scoring problem, or (as in this case) a knowledge gap in the training data.

The web interface makes this investigation straightforward - you can filter by score, search for specific samples, and compare responses across different models side by side.

Conclusion

Here's all of the code for this in a single file: reversal.py. Using that, I can reproduce the overall finding:

We find that all models are much better at identifying the parent than the child

I'd wholeheartedly recommend using Inspect AI for developing evals. My experience so far has been that they make developing test harnesses around LLMs much more fun, because once you learn a little bit about how they work, you'll feel far more confident that you can reuse it for the next model release. If you're unsure about how to get started, the tutorial is a great read. And feel free to reach out if you have any questions or further ideas!

References

main paper, https://arxiv.org/abs/2309.12288
code for this paper, https://github.com/lukasberglund/reversal_curse
the dataset on huggingface, https://huggingface.co/datasets/lberglund/reversal_curse
recent paper on the curse, https://aclanthology.org/2024.emnlp-main.428/
Inspect AI library documentation, https://inspect.aisi.org.uk/
Apollo Research's post about using Inspect AI, https://www.apolloresearch.ai/blog/apollo-is-adopting-inspect