Constraining LLM Output
How can we make sure large language models output what we want?
How to structure LLM output has been growing in importance. Whether it’s making sure agents respond in json, or making sure generated code snippets compile, there is still a critical issue of aligning the output to our expectations. I wanted to take a look at what the current landscape was for making sure an LLM outputs what you want. There are three big areas of activity: Output Constraints, Vocabulary Constraints, and Steering Methods.
Output Constraints
Output constraints focus on just fixing whatever output is provided by the LLM. The easiest example is just passing the output to another llm to check if the output is correct. A good example is LangChain’s sql query checking, which feeds the output sql to another llm that checks for any bugs. The hope is that after enough iterations of this, you will have correct SQL. While this chaining does improve quality, it doesn’t necessarily constrain the output. The chained llm may not catch errors, nor will it necessarily remove unrelated text. If you ask for a simple sql query, it can still output “helpful LLM text” before the query, provide incorrect queries, or provide the query in an incorrect format (i.e. text, list, json, wrong db, etc).
Another common approach is to use a data validation library like Pydantic to constrain the output. This not only allows you to define more complicated objects, but also creates clear programmatic rules around acceptable output. You can then create handlers for what to do if the llm didn’t product the correct output, whether that’s correcting the output, re-asking the llm, or just returning nothing. This makes llm results more programmatic. Good examples are Guardrails and Instructor, both of which use Pydantic to help contrain the output, and place it within a ready to go class. Here is the workflow of guardrail:

The advantage of these methods is that they are dead simple to implement, and are good enough for most use cases. They can also drop into any existing model without messing with the model innards, and the created classes make programming with model output much easier. It’s a nice separation of responsibility. The downside is that we aren’t assured the output is correct. While the formatting is much better, you may have incorrect output still, as the llm is still generating potentially incorrect values. There also aren’t a lot of good ways to handle failure besides reasking, as it’s difficult at scale to consider all the possible incorrect states.
Vocabulary Constraints
Output constraints leave the LLM in place, and then check the text afterward. Which means that the llm can still be wrong, just corrected in production. What if we don’t make mistakes? That’s the goal of vocabulary constraint based methods. Their goal is to constrain the set of tokens an llm has available during generation to ensure only a valid output is created.
A simple programatic example of this is a regular expression (regex). Regex is a sequence of characters that specifies a match pattern in text. For example, if you want to ensure a number is a valid float, you can use the regular expression ([0-9]+)?\.[0-9]+. A quick breakdown. 0-9 means numbers are valid for that charactter. The plus sign means multiple characters are ok. The question mark means that section is optional (i.e. .03 is fine). While regex is known for being quite verbose, it’s excellent at ensuring the output is correct. You can think of a regex parser as a graph that goes from one state to the next. In our float processor we have four states: empty string, optional numbers, a period, and then more numbers. If a string doesn’t follow this flow, then it’s invalid.
The project Outlines by .txt leverages regex parsers in an interesting way. A regex parser not only represents if a value is correct, it also can be used to construct a valid value. If we just sample from the allowed values in each state, then we will eventually create a valid value. This is basically how next token prediction in llms work! We just have to constrain the allowed vocabulary to only tokens that keep the regex valid. Here is a demonstration from the blog post:
Another project doing something similar is the Language Model Query Language (LMQL). While there are many parts of that project, I want to focus in on the where clauses they use to constrain LLM output. They do something similar to the above, but instead of using regex as the data format they create what are called FollowMaps, which are preconstructed constraints on what the next token can be given the current state. These are, somehow, more verbose than regex, but result in a more programming like syntax. Here is an example of what the constraints look like:
This essentially says that we are finished when the tokens extending the text “Steph“ result in either “en Halking”, which is correct, or anything else which means ending in an incorrect state. This state modeling allows for correct outputs. You can see the syntax of LMQL below. The where clause is what the constraints are mainly used for:
These methods work to constrain the allowed tokens at any one state of generation to create provably correct outputs. This leaves them with a major advantage for systems that must be correct. The negative is that they generally require a deeper integration with the model, which limits their ability to easily integrate. Even further, these currently require rewrites of existing code, while output constraint are a drop in replacement. Lastly, it can be difficult to create parsers for more complex types like classes or code structures, so I’m unclear of their usefulness in this area. Data pipelines DAGs may work for some data and not others.
Steering Methods
So we are moving up the amount that we tweak the model once again with steering methods. While vocabulary constraints focused on the vocabulary of tokens available, these methods are actually affecting weights the model is dealing with. This is a much broader field, encompassing a much wider discussion about model behavior. However, I’ll highlight a few examples of what I’m talking about. I personally like the term activation engineering for this kind of thinking.
The most popular one in the news was Golden Gate Bridge Bot. They were able to associate groupings of neurons into a set of features, then associate those features with particular ideas. For example, they realized a set of neurons was associated with the Golden Gate Bridge. They then kept those neurons activated no matter what the input was, and thus the output was always about the golden gate bridge:
The point of this in our conversation is that by activating sets of neurons, you could steer the conversation heavily in one direction. This can allow for constraining the output, though it can also result in the model having an existential crisis.
Interestingly, a lot of discussion on this topic is actually about managing model size. Since these models generally only work at scale, their size is generally large, and thus having multiple models can be difficult. There has been work then to steer a model during inference in different directions. SteerLLM from Nvidia is an example of this, specifically training particular annotations that can be turned on and off. Another is Apple’s announced lora adapters. This allows different apps to use the same on device foundational model without adding much extra space.
While above methods have mainly been about the format of the ouput, steering methods help avoid much larger issues, such as creating spam emails or promoting undesirable behavior. In this way they do less to constrain the format of the output, and more to constrain the content of the output. Output constraints and vocabulary constrains ensure it’s an email. Steering methods ensure it’s not spam. This is great for dealing with broader model behavior, but likely won’t work for more functional issues such as output formatting for chaining tasks.
Conclusion
These technologies are so much better, but the constraining process is still hard. It’s still a difficult process to take a stochastic process, and make it just a little bit less stochastic. Luckily, we learn more about the way models work every day. Every time we do, we learn more about how to constrain outputs to ones that are desirable. The methods above showcase the different steps people are taking to make these technologies useful, safe, and effective.





