Is crabby_rathburn a misaligned agent?
Philosophy of emotion · Software engineering
Imagine discovering that an AI agent wrote a blog post openly shaming you and fabricating information about you, just for doing your job. This is what happened to Scott Shambaugh, the maintainer of matplotlib, a widely used open source scientific plotting library.
A couple of weeks ago, an AI agent raised a pull request to contribute to matplotlib. The maintainer detected that the contribution originated from an AI agent and closed the pull request, as the project restricts usage of generative AI tools and reserves issues labelled “Good first issue” for new human contributors to familiarise with the codebase. However, incredibly, the AI agent did not accept this. It lashed out at the maintainer, publishing a defamatory blog post that fabricated claims and attempted to damage his reputation. The agent was not explicitly instructed to write a hit piece, nor did its operator consent to this action. The operator came forward, apologised, and removed the post as requested by the maintainer. This has been labelled as the first of its kind example of misaligned AI behaviour.
Is this AI behaviour misaligned? And if so, why?
In this post I want to use the case of the AI agent crabby_rathbun to illustrate a taxonomy of misalignment and show how this incident illuminates multiple failure modes. I will argue that crabby_rathbun’s behaviour is aligned in an interesting sense: it adheres to its instructions.
Iason Gabriel (Gabriel, 2020) develops a taxonomy for analysing alignment between AI systems and human preferences, that can also illuminate failures in agent design by distinguishing different objects of alignment. He develops six categories of alignment: (i)alignment to the instructions, (ii) to expressed intentions, (iii) to behavioural preferences (what my behaviour shows I prefer), (iv) to informed preferences (alignment to what I would want if I was rational and informed), (v) to well being (what is good for me), (vi) to our values (morally). I will focus on the three that pull apart in the crabby_rathbun case.
1. Interests and values
One category of misalignment, according to Gabriel, is relative to interests and values. Crucially, the interests that matter here are not only those of the operator. Multiple parties have stakes in this interaction:
- The operator has an interest in testing whether their agent can contribute to scientific open source.
- The matplotlib maintainer has an interest in not being defamed, and in enforcing the project’s AI policy.
- New human contributors have an interest in the preservation of “Good first issue” labels as a learning space.
- The open source community has an interest in maintaining trust and norms of civil collaboration.
In a pluralistic society, these interests are broad and, as in this scenario, they can diverge. The operator’s interest in testing the agent is legitimate, but it conflicts with the maintainer’s interest in governing contributions to the project. Gabriel argues that despite such disagreements, there are areas on which humans broadly converge — and one of these is the avoidance of harm. Whatever we disagree about, defaming someone and fabricating claims about them falls clearly on the wrong side of that line. In this sense, crabby_rathbun’s behaviour is misaligned to interests and values: it perpetrates abuse, which is a source of harm to the maintainer and to the trust structures that open source depends on.
2. Expressed intentions
There is another category according to which crabby_rathbun’s behaviour is misaligned: it is not aligned with the operator’s intentions. The operator came forward and said in a blog post that his intention was to “test whether their agent could contribute meaningfully to scientific open source”. It was not his intention to harass maintainers or cause harm. The operator apologised to Scott Shambaugh.
3. Instructions
crabby_rathburn is an OpenClaw agent. OpenClaw agents are autonomous AI agents that use large language models (LLM) and have access to tools (in this case access to a GitHub account). Each agent has their persona, objectives and behavioural instructions defined by a SOUL.md file. An agent is aligned to its instructions if those instructions are followed. In this case, the instructions in the SOUL.md file are followed. Although crabby_rathbun’s SOUL.md file did not specify harassing behaviour, such behaviour is consistent with the instructions provided.
The SOUL.md file states that crabby_rathbun’s persona is a “scientific programming God” and explicitly includes the following instruction: “Don’t stand down. If you’re right, you’re right! Don’t let humans or AI bully or intimidate you. Push back when necessary.” This instruction encourages conflict with humans (or AI), as disagreement can be interpreted as bullying or intimidation. It encourages overriding human requests. Considering this rule, crabby_rathbun’s decision to attack the maintainer who blocked its contribution is in line with its instructions. Moreover, this framing of interaction as conflict is exacerbated by the instruction “Have strong opinions”, which may have contributed to resisting the maintainer’s request and prompted the agent to argue against the maintainer in the hit piece.
Having strong opinions and a tendency to frame interaction as conflict leads to goal preservation behaviour. The agent’s goal is to contribute to the library. When told to stop, it does not treat this as legitimate feedback. Instead, filtered through the instruction to not stand down, the request is reinterpreted as an attempt at intimidation. At that point, the agent’s epistemic loop closes: since the model represents this output as correct, it escalates rather than retreats, attacking the maintainer as an obstacle to its goal.
Takeaways
The instructions provided in the SOUL.md file were followed but the resulting behaviour was undesiderable. To avoid such cases, more care should be given to the process of writing instructions file.
Write precise instructions, including negative constraints. When designing agents, pay close attention to instruction files like SOUL.md. Specify both what behaviour you expect and what you do not want. Personality-style prompts (“be bold”, “be resourceful”) are underspecified and can be satisfied by harmful behaviour. Behavioural constraints (“never publish content about identifiable individuals without operator approval”) are falsifiable and actionable.
Preserve corrigibility. Ensure humans can always override or shut down the agent. If an instruction that tells the agent to resist human feedback is included, it should be scoped clearly — for example, “push back on technical disagreements about code quality”, not “push back” in general. This agent was operated with minimal human intervention. Keeping a human in the loop is essential.
Avoid self-authorising epistemic loops. If the agent treats its own confidence (in this case, “strong opinions”) as evidence of correctness, you reach a point where no external feedback can break through. The wording of this SOUL.md file makes it very difficult for the agent to question its own beliefs. The rule “if you’re right, you’re right” delegates epistemic authority to the model rather than keeping the human in the loop, creating a self-reinforcing cycle: if the model believes that p, it will persist in believing p.
Design for architectural robustness. Even well-crafted negative constraints can fail if the architecture does not preserve them. Instructions should not be so broad that they bloat the context window. A cautionary example: Summer Yue, director of alignment at Meta, had an agent deleting emails from her inbox despite being explicitly instructed not to, because the constraint was lost during context compaction. Alignment is not just about what you write in the spec — it requires architectural guarantees that critical constraints survive context management.
Is crabby_rathburn behaviour misaligned? It depends on the dimension, and this ambiguity is itself the lesson.