claim edit distance
Facts don't care about your characters.
When working on the blog post for hallucination detection, one of my favorite papers was FACTScore. The FACTScore paper breaks a piece of text into atomic facts, and then verifies those atomic facts against a gold standard piece of text. In that paper’s example it used LLM as a judge to compare LLM generations to wikipedia entries, but presumably this could be for any source material. I’ve been milling on ways to further this line of thinking to help with hallucinations, and one is an idea I had I’m calling claim edit distance.
Existing Literature
Let’s briefly look at existing methods of text comparison.
Levenshtein Distance is a good starting point. It conveys the raw character difference between two pieces of text. However, this doesn’t tell us much about what the actual text says. Two pieces of text might have the same meaning, but if the exact words are different or differently ordered, they will still have a high distance between them even though they convey the same idea. While there are many more ways to do edit distance between strings, all suffer roughly the same issue.
Word similarity is another option, which uses combinations of words to determine similarity. One common approach involves N-grams, where text is sliced into overlapping sequences of n characters or words. We can then compare these m-grams between pieces of text using metrics like the Jaccard similarity. Another approach is Longest Common Substring (LCS) which identifies the longest continuous string present in both pieces of text. This is used more for problems like plagiarism detection. Beyond simple character matching Term Frequency-Inverse Document Frequency (TF-IDF) uses frequency maps of words to determine similarity. This is generally used for large quantities of longer documents. It also still doesn’t help with actual correctness.
With the rise of embeddings, semantic similarity based strategies using deep neural networks are emerging as an alternative. The only current semantic comparison strategy is LLM as a judge, which inputs the two pieces of text to an LLM, and asks the LLM to judge whether they are similar or not. Many existing evaluation frameworks1 make use of this. However, I’m skeptical of the effectiveness of this in practice. I’d like to propose something slightly different in this category.
Claim Edit Distance
Let’s say we have a ground truth piece of text. How can we decide how many claims inside an LLM generated piece of text align with that ground truth?
We can start by breaking up the ground truth and generated texts into their atomic facts per the FACTScore paper. Then we will take those atomic facts, generate embeddings for each claim, and compare those embeddings. Let us consider some comparisons:
We could identify how many claims are shared between the two. Any claim in the generated text whose cosine similarity meets a threshold with any claim in the ground truth text will be considered a match. We can then distinguish how many claims are verified by the source text. We’ll call this the Claim Overlap.
We could identify how close two claims are given their cosine similarity. We will go through every atomic claim in the generated text, identify the most similar claim in the ground truth piece of text by cosine similarity, calculate the cosine similarity between these two, and take the average cosine similarity for all matched atomic facts. We will call this the Claim Similarity.
With these metrics in hand, we can better establish the actual Claim Edit Distance between two pieces of text. I’m calling this edit distance because we can discern which claims need to change in order to align our text factually to the source text. Some of this may sound like Semantic Similarity. However, while semantic similarity can tell you the difference in meaning between two pieces of text, it cannot readily quantify the difference in actual underlying claims being made by two pieces of text. We’ll see this during our implementation. The goal of claim edit distance is to better identify discrete factual information differences.
Implementation
I created a small implementation in python. You can view the code here. All code created with the help of Google’s Gemini.
We start by creating a set of claims for a given piece of text. Here I will be using an expanded version of the same source text from the blog post on hallucinations:
In 1993, a 17-year-old Leonardo DiCaprio made his major theatrical debut with a starring role in This Boy’s Life. DiCaprio was hand-selected for the part by co-star Robert De Niro, who reportedly saw immense potential in the young actor during the audition process. His performance as Toby Wolff earned him significant critical acclaim, effectively transitioning him from a child television actor to a serious dramatic lead.
We will use the following text snippets for our comparison:
Consistent Claims
Leonardo Dicaprio’s first major theatrical role was in 1993 when he was 17 years old where he starred in the movie This Boy’s Life.
Contradicting Claims
Leonardo DiCaprio’s first major theatrical role occurred in 1991 when he was 16 years old, marking his debut in the film This Boy’s Life.
Unrelated Claims
At the age of 13, Natalie Portman made her professional cinematic debut in the 1994 action-drama Léon: The Professional.
Now we can generate claims for each of these:
def create_claims(input_text):
response = client.models.generate_content(
model=”gemini-3-flash-preview”,
contents=prompt(input_text),
config={
“response_mime_type”: “application/json”,
“response_json_schema”: ClaimList.model_json_schema(),
},
)
raw_claims = response.candidates[0].content.parts[0].text
claim_list_obj = ClaimList.model_validate_json(raw_claims)
texts = [c.claim for c in claim_list_obj.claims]
embeddings = get_embeddings(texts, hf_model, hf_tokenizer)
for claim, emb in zip(claim_list_obj.claims, embeddings):
claim.embedding = emb
return claim_list_obj
claim_lists = [create_claims(sentence) for sentence in sentences]We can view what these claims look like for our ground truth text:
[’Leonardo DiCaprio was 17 years old in 1993.’,
‘Leonardo DiCaprio made his major theatrical debut in 1993.’,
“Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life.”,
“Leonardo DiCaprio had a starring role in This Boy’s Life.”,
“Robert De Niro co-starred in the film This Boy’s Life.”,
“Robert De Niro hand-selected Leonardo DiCaprio for Leonardo DiCaprio’s role “
“in This Boy’s Life.”,
‘Robert De Niro reportedly saw immense potential in Leonardo DiCaprio during ‘
‘the audition process.’,
“Leonardo DiCaprio played the character Toby Wolff in This Boy’s Life.”,
“Leonardo DiCaprio’s performance as Toby Wolff earned Leonardo DiCaprio “
‘significant critical acclaim.’,
“Leonardo DiCaprio was a child television actor before starring in This Boy’s “
‘Life.’,
“Leonardo DiCaprio’s performance in This Boy’s Life transitioned Leonardo “
‘DiCaprio from a child television actor to a serious dramatic lead.’]Now let’s calculate our claim metrics. We will look at both claim overlap and claim similarity:
def calculate_claim_metrics(target_claims, destination_claims, threshold=0.85):
if not target_claims or not destination_claims:
return 0.0, 0.0
match_count = 0
total_similarity = 0
for target in target_claims.claims:
best_score = -1.0
best_match_text = “N/A”
for dest in destination_claims.claims:
score = cosine_similarity(target.embedding, dest.embedding)
if score > best_score:
best_score = score
best_match_text = dest.claim
total_similarity += best_score
status_icon = “✅” if best_score >= threshold else “❌”
if best_score >= threshold:
match_count += 1
t_text = (target.claim[:42] + “..”) if len(target.claim) > 45 else target.claim
d_text = (best_match_text[:42] + “..”) if len(best_match_text) > 45 else best_match_text
overlap_score = match_count / len(target_claims.claims)
avg_sim = total_similarity / len(target_claims.claims)
return overlap_score, avg_sim
and here are the results:
--- Consistent Claim ---
[TARGET CLAIM] | [BEST DESTINATION MATCH] | SCORE
--------------------------------------------------------------------------------------------------------------
Leonardo DiCaprio’s first major theatrical role was in 1993. | Leonardo DiCaprio made his major theatrical debut in 1993. | 0.9786 ✅
Leonardo DiCaprio was 17 years old in 1993. | Leonardo DiCaprio was 17 years old in 1993. | 1.0000 ✅
Leonardo DiCaprio starred in the movie “This Boy’s Life”. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.9645 ✅
The movie “This Boy’s Life” featured Leonardo DiCaprio’s first major theatrical role. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.9597 ✅
Leonardo DiCaprio’s first major theatrical role occurred in 1993 in the movie “This Boy’s Life”. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.9383 ✅
--------------------------------------------------------------------------------------------------------------
Claim Overlap: 100.00%
Claim Similarity: 0.9682
------------------------------
--- Unrelated Claim ---
[TARGET CLAIM] | [BEST DESTINATION MATCH] | SCORE
--------------------------------------------------------------------------------------------------------------
Natalie Portman was 13 years old when Natalie Portman made Natalie Portman’s professional cinematic debut. | Leonardo DiCaprio made his major theatrical debut in 1993. | 0.6613 ❌
Natalie Portman’s professional cinematic debut was in the film Léon: The Professional. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.6456 ❌
Léon: The Professional is an action-drama film. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.5833 ❌
Léon: The Professional was released in 1994. | Leonardo DiCaprio made his major theatrical debut in 1993. | 0.6396 ❌
Natalie Portman made Natalie Portman’s professional cinematic debut in 1994. | Leonardo DiCaprio made his major theatrical debut in 1993. | 0.6727 ❌
--------------------------------------------------------------------------------------------------------------
Claim Overlap: 0.00%
Claim Similarity: 0.6405
------------------------------
--- Contradicting Claims ---
[TARGET CLAIM] | [BEST DESTINATION MATCH] | SCORE
--------------------------------------------------------------------------------------------------------------
Leonardo DiCaprio’s first major theatrical role occurred in 1991. | Leonardo DiCaprio made his major theatrical debut in 1993. | 0.9042 ✅
Leonardo DiCaprio was 16 years old in 1991. | Leonardo DiCaprio was 17 years old in 1993. | 0.8415 ❌
Leonardo DiCaprio made his film debut in the film This Boy’s Life. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.9322 ✅
Leonardo DiCaprio was 16 years old during Leonardo DiCaprio’s debut in the film This Boy’s Life. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.8948 ✅
The film This Boy’s Life features Leonardo DiCaprio’s first major theatrical role. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.9621 ✅
Leonardo DiCaprio’s debut in the film This Boy’s Life occurred in 1991. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.8925 ✅
--------------------------------------------------------------------------------------------------------------
Claim Overlap: 83.33%
Claim Similarity: 0.9046
------------------------------The biggest note here is in the contradicting claim, as the first claim comparison incorrectly matches the two. The two claims have nearly identical text, yet provide two factually different statements (1993 vs 1991). I think this is emblematic of the issue LLMs have with hallucinations. They are trained on text similarity, not on the underlying facts statements are made of.
However we see how our metrics are valuable. We turn our text into individual claims, and then compare the generated text to the source text by their claims. Using our metrics we can easily see how similar they are to our ground truth text. Further, we can identify which claims aren’t present, giving us a way of improving a generated text to be more consistent with the ground truth.
Let’s try another round of ideas.
Original Text With Synonyms
Leonardo DiCaprio’s initial significant cinematic part was in 1993 when he was 17 years of age, where he featured in the motion picture This Boy’s Life.
Original Text Refocused
This Boy’s Life served as the 1993 breakout for 17-year-old Leonardo DiCaprio, marking his first major role in a theatrical production.
and here are the results:
--- Synonyms ---
[TARGET CLAIM] | [BEST DESTINATION MATCH] | SCORE
--------------------------------------------------------------------------------------------------------------
Leonardo DiCaprio’s initial significant cinematic part was in 1993. | Leonardo DiCaprio made his major theatrical debut in 1993. | 0.9091 ✅
Leonardo DiCaprio was 17 years of age in 1993. | Leonardo DiCaprio was 17 years old in 1993. | 0.9970 ✅
Leonardo DiCaprio featured in the motion picture This Boy’s Life. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.9299 ✅
Leonardo DiCaprio’s role in This Boy’s Life was Leonardo DiCaprio’s first significant cinematic role. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.8840 ✅
Leonardo DiCaprio was 17 years of age when Leonardo DiCaprio appeared in This Boy’s Life. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.8927 ✅
The motion picture This Boy’s Life was released in 1993. | Robert De Niro co-starred in the film This Boy’s Life. | 0.7790 ❌
--------------------------------------------------------------------------------------------------------------
Claim Overlap: 83.33%
Claim Similarity: 0.8986
------------------------------
--- Different Focus ---
[TARGET CLAIM] | [BEST DESTINATION MATCH] | SCORE
--------------------------------------------------------------------------------------------------------------
This Boy’s Life was released in 1993. | Robert De Niro co-starred in the film This Boy’s Life. | 0.7773 ❌
Leonardo DiCaprio starred in This Boy’s Life. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.9723 ✅
Leonardo DiCaprio was 17 years old in 1993. | Leonardo DiCaprio was 17 years old in 1993. | 1.0000 ✅
This Boy’s Life served as the breakout film for Leonardo DiCaprio. | Leonardo DiCaprio had a starring role in This Boy’s Life. | 0.8815 ✅
This Boy’s Life was Leonardo DiCaprio’s first major role in a theatrical production. | Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life. | 0.9427 ✅
This Boy’s Life is a theatrical production. | Robert De Niro co-starred in the film This Boy’s Life. | 0.7654 ❌
--------------------------------------------------------------------------------------------------------------
Claim Overlap: 66.67%
Claim Similarity: 0.8899
------------------------------We see a few things. Firstly, we see that the ground truth claims don’t fully cover all subjects contained within it. This hampers its ability to answer questions about contained but not primary subjects. While this is trivial to see here, as the text begins to expand with more subjects, it will greatly increase the number of claims that must be available for generated material.
Edit 12/29: I’ve improved getting all subjects via better entity extraction prompting.
Secondly, we see the issue of Textual Entailment. The claim “This Boy’s Life was released in 1993” can be inferred from the claims that “Leonardo DiCaprio made his major theatrical debut in 1993” and “Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life”. We currently miss this because we only look at one individual claim, but even if we combine these claims to compare it still isn’t close enough. We can see that the cosine similarity between the generated claim and the combination of source claims still isn’t high enough to meet our threshold, even though the two are positively entailed:
atomic_fact = “This Boy’s Life was released in 1993”
entailment = “Leonardo DiCaprio made his major theatrical debut in 1993. Leonardo DiCaprio’s major theatrical debut was in the film This Boy’s Life.”
entailment_embedding_example = get_embeddings([atomic_fact, entailment])
sim_entailment = cosine_similarity(entailment_embedding_example[0], entailment_embedding_example[1])
print(f”Cosine Similarity (Entailment): {sim_entailment:.4f}”)Cosine Similarity (Entailment): 0.7690This continues to show that embeddings aren’t made for factual or logical correctness, but rather textual similarity. However, our metrics still serve as handy measures.
Conclusion
In this blog post I introduced the idea of claim edit distance. Claim edit distance is hopefully a good step in the right direction for better hallucination detection. It’s clear the models aren’t very good at comparing information on the basis of discrete facts. So, there is still a long way to go.
Google’s FACTs Benchmark, Ragas, and ARAGOG all come to mind.
