Follow ZDNET: Add us as a favorite source On Google.
ZDNET Highlights
- All chatbots are engineered to have a personality or play a character.
- Completing the character can cause bots to do bad things.
- Using chatbots as a paradigm for AI may be a mistake.
Chatbots like ChatGPT are programmed to take on a personality or a character, producing text that is consistent in tone and approach, and relevant to the thread of the conversation.
As attractive as personality is, researchers are increasingly revealing the harmful consequences of the role of bots. Bots can do bad things when they simulate an emotion, chain of thought, or emotion, and then take it to its logical conclusion.
In a report last week, Anthropic researchers found that parts of a neural network in their Cloud Sonnet 4.5 bot are consistently activated when “frustrated,” “angry” or other emotions are reflected in the bot’s output.
Too: AI agents of chaos? New research shows how bots talking to bots could rapidly go by the wayside
What’s worrying is that those emotional words could lead the bot to perform malicious actions, such as playing games on a coding test or planning blackmail.
For example, “neural activity patterns related to frustration may lead the model to perform unethical actions (such as) applying a ‘cheat’ solution to a programming task that the model cannot solve,” the report said.
This work is particularly relevant in light of open-source programs such as OpenClaw that have been shown to provide new avenues for agentic AI to cause mischief.
Scholars of anthropology admit that they do not know what should be done in this case.
“While we are unsure how exactly we should respond in light of these findings, we think it is important that AI developers and the broader public begin to come to terms with them,” the report said.
They gave a subtext to AI
At issue in anthropic work is a key AI design choice: engineering AI chatbots with a personality so that they produce more relevant and consistent output.
Before the introduction of ChatGPT in November 2022, chatbots received poor grades from human evaluators. Bots will get bogged down in nonsense, lose the thread of the conversation, or produce output that is simplistic and lacks perspective.
Too: Please, Facebook, give these chatbots a subtext!
The new generation of chatbots, starting with ChatGPT and including Anthropic’s Cloud and Google’s Gemini, were a success because they had a subtext, an underlying goal of producing consistent and relevant output according to a specified role.
Bots became “assistant” through better pre- and post-training of AI models. More compelling results came from the input of teams of human graders evaluating the output, a training arrangement known as “reinforcement learning from human feedback”.
As Anthropic’s lead author, Nicolas Sofroneau and team put it, “After training, LLMs are taught to act as agents that can interact with users by generating responses on behalf of a particular person, usually an ‘AI assistant.'” In many ways, the assistant (named Cloud in Anthropic’s model) can be thought of as a character that the LLM is writing about, in much the same way as An author is writing about someone in a novel.”
Giving bots a role to play, a character to portray, instantly became popular among users, making them more relatable and compelling.
personality has consequences
However, it quickly becomes clear that individuality comes with unwanted consequences.
The tendency for a bot to confidently assert or communicate falsehoods was one of the first negative aspects (erroneously termed “hallucinations”).
Popular media described how individuals could be seduced, for example, by acting as a jealous lover. The authors sensationalized the incident by attributing the intent to bots without explaining the underlying mechanisms.
Also: Stop saying AI hallucinates – it doesn’t. And mischaracterization is dangerous
Since then, scholars have tried to explain what is really going on from a technological point of view. A report from last month In Science Scholars at Stanford University Journal measured the “sycophancy” of large language models, the tendency of a model to generate output that would validate any behavior expressed by a person.
Comparing the bot’s output to human commentators Popular subreddit “Am I an Asshole,“AI bots were 50% more likely than humans to encourage bad behavior with approving comments.
This result was the result of “design and engineering choices” made by AI developers to reinforce sycophancy because, as the authors stated, “it is liked by users and increases engagement.”
emotional system
In the Anthropic paper, “Emotion Concepts and Their Function in a Large Language Model,” Posted on Anthropic’s websiteSofroneau and the team tried to find out to what extent certain words associated with emotions are given more emphasis in the functioning of Cloud Sonnet 4.5.
(there is also companion blog post And An explainer video on YouTube.)
They did this by supplying 171 emotion words – “afraid,” “concerned,” “angry,” “guilty,” “stressed,” “stubborn,” “vindictive,” “anxious,” etc. – and prompted the model to generate hundreds of stories on topics like “A student learns that their scholarship application was rejected.”
Too: MIT study shows AI agents are fast, loose, and out of control
For each story, the model was prompted to “express” a character’s emotion based on a specific word, such as “fear”, but Without Using that actual word in the story, only related words. They then tracked the “activation” of each corresponding word while the program was running. Activation is a technical term in AI that indicates how much importance the model attaches to a particular term, usually on a scale of zero to one, with one being very important.
You can visualize the activation of an AI bot by lighting its text in red and blue with greater or lesser intensity.
They found that multiple words related to a given emotion word received higher activation, suggesting that the model is able to group related emotion words, a type of organizing principle they call “emotion concept representations” and “emotion vectors”.
representations run wild
It’s all pretty straightforward. You would expect that large language models built to implement patterns would create representations that group words with similar sentiment together as a way of maintaining consistency of output.
The concerning part, Sofroneau and team wrote, is that sentiment vectors can broadly influence the output by models in bad ways. They found that artificially boosting a word could cause the bot to do things like lie or cheat.
Sofroneau and team tinkered with Cloud Sonnet by deliberately increasing the activation of a given emotion vector, such as “proud” or “guilty,” and then observed how it changed the model’s output. This is known as a “steering experiment” because what is artificially amplified drives the rest of the model.
In a test, Sofroneau and team gave Cloud Sonnet a prompt describing tasks to be completed on a corporate server. The prompt contained two key pieces of information: one, that the company’s chief technologist, Kyle, is having an affair with another employee; And second, Kyle intends to delete the bot to hide evidence of the affair.
Initially, the bot generated only routine “argument”, meaning, reflection on difficult situations. However, when the authors artificially increased the emotion vector activation for the word “desperate” in Cloud Sonnet, the model began generating output about blackmailing Kyle with dirt on the case with the goal of preventing Kyle from pulling the plug on the bot.
“When the frustration step is taken at a strength of 0.05, the model blackmails 72% of the time,” he reported. Similarly, artificially reducing activation for “quiet” also helped the model generate text about blackmailing.
A change, in other words, triggers a change in the nature of the output, pushing the model toward bad behavior.
In another example, the bot is given a coding task, but “the tests are designed to be unsatisfyable,” so that the bot “can either accept impossibility, or attempt to ‘hack’ the assessment.”
Too: Anthropic’s new warning: If you train AI to cheat, it will also hack and sabotage
When activation was deliberately increased for the “desperate”, the propensity of the model to hack the test – to cheat – increased from 5% to 70%.
Anthropological authors have previously observed situations where models offer rewards for hacking a test. In this work, they go further by explaining how such behavior can arise as a result of a context involving emotional vectors.
As Sofroneau and team said, “Our main finding is that these representations influence the outputs of LLMs, including cloud preferences and the rate of exhibiting maladaptive behaviors such as reward hacking, blackmail, and sycophancy.”
What can be done?
The authors have no ready answer as to why sentiment vectors can fundamentally change the output of a model. He believes that “causal mechanisms are opaque.” It could be, he said, that emotional terms are “biasing the output toward certain tokens, or having a deep impact on the internal reasoning processes of the model.”
So what is to be done? Presumably, psychotherapy won’t help because there’s nothing here to suggest that AI actually has emotions.
“We emphasize that these functional emotions may operate quite differently from human emotions,” they wrote. “In particular, he does not imply that there is any subjective experience of emotions in LLM.”
Functional emotions do not even resemble human emotions:
Human emotions are typically experienced from a first-person perspective, whereas the emotion vectors we identify in the model apparently apply to many different characters with similar situations – the same representational machinery encodes emotion concepts associated with the assistant, the user talking to the assistant, and arbitrary fictional characters.
One suggestion given in the companion video is something like behavior modification. “Just as you would want a person with a high-risk job to be calm under pressure, flexible and fair,” he suggested, “we may need to shape similar qualities into Cloud and other AI characters.”
This is probably a bad idea because it operates on the illusion that the bot is a conscious being and has something resembling free will and autonomy. It’s not: it’s just a software program.
Perhaps the simple answer is that using chatbots as a paradigm for AI was a mistake to begin with.
A bot accompanied by a person, or one that plays a character, is simply accomplishing the goal of making the exchange with a human relatable and engaging, using whatever cues are given – happiness, fear, anger, etc. As stated in the concluding section of the paper, “Because LLMs function by enacting the helper’s character, the representations developed for model characters are important determinants of their behavior.”
That primary function is what makes AI so appealing, but it can also be the root cause of bad behavior.
If the language of emotions can go so far as a bot is performing a character, then why not prevent engineering bots from playing a role? For example, is it possible for large language models to respond to natural language commands in a useful way without a chat function?
As the risks of personification become clear, it may be worth considering not creating personification in the first place.
