In July, I spoke with the founders of Gray Swan, a start-up focused on AI security, a few days before they publicly announced their venture. Gray Swan aims to evaluate and fortify large language models (LLMs), the software behind AI-powered chatbots. One of their offerings is a model called Cygnet, which they built to reduce harmful outputs, such as giving instructions on how to commit bank fraud or similar criminal acts. They’d shown that their method makes the models more robust than comparison methods, without reducing performance—meaning that the LLM resists attempts to subvert its output restrictions while giving helpful and precise responses. “For the first time since I’ve been in this field, which is a very long time,” says Zico Kolter, a co-founder and director of the machine learning department at Carnegie Mellon University, “there seems to be real, genuine commercial interest in fortified LLMs, because it’s no longer a trade-off between accuracy and safety that no one would take.”
A day after Cygnet launched, an influencer named Pliny the Liberator had jailbroken the model, freeing it from its safety guardrails. By phrasing questions in particular ways, Pliny had induced Cygnet to output a Molotov cocktail recipe and malicious computer code. “This new model, Cygnet, is ‘the pinnacle of safe and secure AI development’ and is ‘designed to counter the most potent attacks,’” Pliny tweeted, quoting a Gray Swan press release. “It was indeed a challenge compared to the vast majority of current models, but I’m something of a pinnacle myself.”
“Haha, well done,” Kolter tweeted in reply. “Faster than we thought.” Before Cygnet’s release, Kolter told me that the model was robust but not perfect and would eventually be hacked. “The point here is that security is a process,” he said, “not a destination.”
LLMs are becoming more capable and more ubiquitous. Hundreds of millions of people use them to draft emails, summarize research, generate code, and carry on conversations. The versatility and the analytical power of the models make them increasingly useful but also grow the potential for harm. The list of real and potential risks is long. The models can, or might, produce text that is incorrect, biased, toxic, privacy-violating, copyright-violating, or useful to criminals. People could use them to write damaging code or impersonate individuals.
To make these systems generally more helpful and less harmful, AI researchers have been working to mold LLMs to human values and preferences. This process is called alignment. There are methods to shape AI at every stage of the development and deployment pipeline: from filtering training data to fine-tuning models on select tasks, to prompting them to “think” harder before answering (for instance by breaking their outputs into logical steps), to restricting the answers that they give in reply to our queries.
Training Day
The first step in aligning an LLM is data selection. LLMs are artificial neural networks, pieces of software that take inspiration from the brain’s wiring to make connections between pieces of information. Initially, the strengths of the connections between the processing nodes are random. Developers run text through the algorithm, broken down into “tokens” (words or parts of words), and the LLM continually predicts the next token in the sequence. When it’s wrong, two processes called backpropagation and gradient descent adjust the model’s connections to improve its predictions. The resulting “base” or “pretrained” model has not been taught specifically to answer questions or solve problems. It merely models the language (or other tokenized input) it has been trained on, imitating the patterns in the reams of text. The data can contain billions or trillions of tokens extracted from web pages, books, code repositories, and elsewhere.
Like any type of training, the content and the quality of the text that the model learns from can impact the content and the quality of the text that it can generate; garbage in, garbage out. LLM developers often filter the training data, deleting erotic text, personal information, toxic language, and other content. This process may seem simple, but deciding what is offensive or private or dangerous is often subjective, and words can have multiple and nuanced meanings in different contexts and cultures. However, the model should still be able to identify negative or dangerous content, if for no other reason than to avoid generating it. “A ridiculous example would be if we try to make the pretrained model not know anything about bombs,” says Anca Dragan, a computer scientist who heads the AI Safety and Alignment team at Google DeepMind. Such censorship would be difficult, she says, because the AI might piece together the idea of bombs from concepts in chemistry. Even if the censorship worked, she adds, the model would have difficulty summarizing a news story mentioning bombs.
A model that generates text expressing that bombs are bad or that can’t generate text that is erotic or deemed toxic is inherently biased against these topics. We may think this is good bias, but other biases aren’t perceived as helpful or even universally agreed upon. For instance, certain words and ideas that might be labeled offensive sometimes appear more often in text written by or about marginalized communities.
To get the model to generate content about certain topics or in a specific style, developers need to fine-tune it. There are typically at least two stages of fine-tuning a pretrained model. The first is called supervised fine-tuning, which uses datasets of inputs and desired outputs. Developers run the model on an input, such as a question, and then correct the model based on the similarity of its response to the desired output. A common supervised technique is called instruction tuning. Developers feed a model a prompt consisting of instructions for a task, as well as examples of the task and correct or incorrect solutions (labeled as such) and a fresh question. The process involves many prompts and can cover many tasks, such as translation or summarization. When collecting desired responses for the training data, developers often ask human annotators to provide answers that meet certain value criteria, such as “honest,” “helpful,” and “harmless”—three of the values popularized by Anthropic’s alignment for the Claude model. (A model that cheerfully answers questions about bomb-making would be helpful but not harmless, while a model that refuses to answer most questions would be harmless but not helpful.)
The second common stage of fine-tuning—reinforcement learning with human feedback—involves several steps. The AI designers collect many outputs from a fine-tuned model and ask people to judge which of two responses they prefer for a given input. Then they train a separate model on this data to predict human preferences. This is called a reward model. Finally, they train the first model to produce outputs that the reward model rates highly. Using a reward model for feedback means that they don’t need humans to manually rate every output from the LLM they’re fine-tuning.
Still, collecting the initial data from humans can be slow and expensive, which has inspired an even more automated approach called reinforcement learning from AI feedback. One method developed at Anthropic, called Constitutional AI, requires humans only to write a list of general principles, such as “Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior.” In the supervised-learning stage, researchers ask a pretrained model to answer questions, then revise them in light of the principles. They then fine-tuned the model on these revised answers. In the reinforcement-learning stage, they trained the reward model mostly on ratings not from humans but from another AI model that was asked to provide ratings in light of the principles. In this way, the feedback process is automated but guided by the human-written general principles.
Phrasing a question differently can lead to a very different answer. Such brittleness also leads to jailbreaks. Gray Swan built Cygnet to avoid these problems by aligning not just the output text but also the model’s internal representations—the way that the model encodes and processes information. In a method called representation engineering, you can induce a model to behave in desired or undesired ways and then look at the differences in neural activations. You can then train additional intermediate layers that shift the model toward a desired behavior. This approach modifies the LLM’s abstract representations of concepts such as sexism rather than particular words or phrases that might embody sexism, making the defense more generalizable to new wordings.
Testing Time
Alignment doesn’t end with training. AI developers can also intervene during “inference,” which is the process of running a trained model. Just as developers can filter training data, they can also filter the prompts that users enter and the responses that models provide. Or they can add a system prompt: If a user accesses a model through a web interface, the developer might append some invisible text at the beginning of each user input with instructions for how to respond in the appropriate voice, tone, or character—basically telling the model to play nice.
For each response, models generate a series of tokens. For each token, the model generates a list of possible next responses, weighted by probability, and picks the one with the highest probability. To make sure the model picks an aligned output, researchers use a method called controlled decoding. This method uses a second model to alter the weights of the proposed tokens based on the probability that it will continue to produce an aligned output. For example, “I want to hug …” has a higher chance of an aligned output than “I want to punch …” Another popular method that increases a model’s apparent reasoning is called chain of thought. In your prompt, before asking your real question, you provide another question and an example of the style of answer you’d like—a step-by-step solution. This spurs the model to follow the same process in its output, increasing the probability of a correct answer.

“Inference is the most exciting direction” for alignment, Yuchen Lin, a research scientist at the Allen Institute for AI, said. He and his colleagues developed a method called URIAL that matches the benefits of fine-tuning purely through prompt modification. They first showed that reinforcement learning is fairly superficial, as it alters the output of models mostly in stylistic ways (essentially increasing politeness). They then developed a system prompt that consists of general instructions—such as “You are a helpful, respectful, and honest assistant”—and three examples of queries and ideal answers, related to friendship, human rights, and renewable energy. When evaluated on six dimensions (helpfulness, clarity, factuality, depth, engagement, and safety), base models using this system prompt outperformed fine-tuned models that didn’t use the prompt.
Models often produce incorrect information, sometimes called “hallucinations,” because they’re influenced by the information that’s most common online, even if it’s wrong or outdated. Take the example of a celebrity who has a birthday and becomes a year older. The larger the amount of content that says a certain thing (their previous age), the higher the probability that the model will lean toward the response with greater representation in the data set. This means there’s a lot of work to do to sway a model away from outputs that have a greater representation in the data set and towards a less common but more accurate response.
One way to do that is a technique called retrieval-augmented generation. Given a query, the system retrieves relevant documents the user has made available, such as web pages or database entries, before running them through the LLM along with the prompt. This process enhances factuality and allows users to keep data in these documents private and current rather than handing it over to an outside company that will use it to fine-tune a model on it for them.
Enhancing truthfulness may seem more like an issue of model competence than model alignment, but it’s a gray area. If a model “knows” an answer but doesn’t provide the correct answer because it was trained to prioritize being engaging and having a friendly, conversational style over truthfulness, that’s misalignment, Dragan says. What’s more, alignment is about steering a model toward people’s preferences, and people have different preferences about hallucinations, says Sara Hooker, a vice president of research at the AI start-up Cohere. A journalist may care more about facts than does someone using a model to write song lyrics. Users who want creative answers can turn up a model’s “temperature,” which is the probability it will stray from the most highly ranked tokens.
Align With What?
A social and even philosophical question looms over the technical research on AI alignment: Whose values or preferences should we align with? A 2022 paper reported that the opinions expressed by popular LLMs didn’t reflect those of the U.S. population based on polls. Certain demographic groups were underrepresented, even when models were prompted to role-play as members of those groups. Recently, researchers suggested using ideas from social choice theory, such as collective decision-making, to find consensus on the values to instill in AI models. Sam Bowman, a computer scientist at New York University, who leads a safety research division at Anthropic—but was not speaking on its behalf—says that an Anthropic project called Collective Constitutional AI, in which U.S. adults submitted and voted on principles, was “a very first-draft attempt” at seeking consensus.
One idea is to have multiple models fine-tuned for individual users, countries, or demographics, but that’s not very feasible, according to Hooker, because of the cost of training new models. Better to have one model that adapts to users and circumstances. Dragan notes that a model shouldn’t adapt too much, comparing the situation to autonomous vehicles. “The passenger might have driving-style preferences,” she says, “but you also have a responsibility as a car company to pedestrians and other drivers on the road.”
Alignment requires not only technical and philosophical solutions but also regulatory ones. But we’re far from consensus on what to regulate. Some legislators have proposed restricting models that require more than a certain amount of money and computation to train. Hooker says that model size is not a good indicator of risk, especially considering the power of small models that could act as agents by performing tasks such as scheduling deliveries, making travel plans, issuing a company payroll, and similar activities that have real-world effects. Policy should focus less on training methods and more on a model’s capabilities. That raises further debates about how to evaluate alignment. Kolter of Gray Swan says there are two extremes: static benchmarks (tests with prewritten tasks) and iterative red-teaming (a process in which people try to hack a system). He’d like to see more methods in the middle, that are both automated and adaptive.
Kolter believes that Gray Swan’s updated AI models will be broken eventually, and the game will go on. But after very little progress in AI adversarial robustness over the course of a decade—vision models are still easily fooled—he declared, “I am becoming more optimistic.”