JSON Enforcement
If you've ever used custom LLMs to evaluate your test cases in deepeval
, you may have encountered the following error:
_ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model_
deepeval
metrics allow you to use any custom LLM for evaluation, from LangChain modules to Hugging Face’s Transformer models. Most of these metrics utilize custom LLMs to generate reasons, verdicts, statements, and other types of LLM-generated responses, which serve as criteria for calculating the final metric score for each test case.
However, each custom LLM is prompted to return a JSON object, and it's often the case that these objects are deformed, causing deepeval
to raise the above error! These deformities come in a variety of formats, from missing brackets, to incomplete strings, to mismatched keys.
For example:
{
"reaso: "The actual output does directly not address the input"
# Issues:
# 1. Missing closing bracket at the end.
# 2. The key "reaso" is misspelled and the string is not closed properly.
When using SOTA models like GPT-4 or GPT-4o, this is almost always never a problem. However, for smaller and less powerful LLMs, prompt engineering alone is not sufficient to enforce JSON outputs. As a result, it's vital to find a workaround, since this error stops the entire evaluation process.
This guide will demonstrate various methods to confine your LLM output by leveraging Pydantic models for validation. These models ensure that the JSON objects outputted adhere to predefined schemas, which helps prevent issues like missing brackets or incomplete strings.
JSON Enforcement libraries
The lm-format-enforcer
Library
The LM-Format-Enforcer is a versatile library designed to standardize the output formats of language models. It supports Python-based language models across various platforms, including popular frameworks such as Transformers, LangChain, LlamaIndex, llama.cpp, vLLM, Haystack, NVIDIA, TensorRT-LLM, and ExLlamaV2. For comprehensive details about the package and advanced usage instructions, please visit the LM-format-enforcer github page.
The LM-Format-Enforcer combines a character-level parser with a tokenizer prefix tree. Unlike other libraries that strictly enforce output formats, this method enables LLMs to sequentially generate tokens that meet output format constraints, thereby enhancing the quality of the output.
The instructor
Library
Instructor is a user-friendly python library built on top of Pydantic. It enables straightforward confinement of your LLM's output by encapsulating your LLM client within an Instructor method. It simplifies the process of extracting structured data, such as JSON, from LLMs including GPT-3.5, GPT-4, GPT-4-Vision, and open-source models like Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python. For more information on advanced usage or integration with other models not covered here, please consult the documentation.
Tutorials
To enforce JSON output in a custom DeepEvalLLM
, your custom LLM's generate and a_generate methods must accept a schema
argument of type Pydantic BaseModel
.
class Mistral7B(DeepEvalBaseLLM):
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
Your generate
and a_generate
functions must always output an object of type BaseModel
.
LLM JSON confinement is possible across a range of LLM models, including:
- Hugging Face models (Mistral-7b-v0.2, Llama-3-70b-Instruct, etc)
- LangChain, LlamaIndex, Haystack models
- llama.cpp models
- OpenAI models (GPT-4o, GPT-3.5, etc)
- Anthropic models (Claude-3 Opus, etc)
- Gemini models
In the following set of tutorials, we'll go through setting up Pydantic enforcement using the libraries from the above section for:
- Mistral-7B v0.3 (through HF)
- Llama-3 70B Instruct (through HF)
- Gemini 1.5 Flash (through Google AI)
- Claude 3 Opus Chat (through Anthropic)
1.) Mistral-7B v0.3
1. Install lm-format-enforcer
Begin by installing the lm-format-enforcer
package via pip:
pip install lm-format-enforcer
2. Create your custom LLM
Create your custom Mistral-7B v0.3 LLM class using the DeepEvalLLM
base class. Define the schema
parameter in your generate and a_generate method signatures.
from pydantic import BaseModel
class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer
def load_model(self):
return self.model
def get_model_name(self):
return "Mistral-7B v0.3"
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
3. Write the generate
method
Write the generate
method for your custom LLM by utilizng JsonSchemaParser
and build_transformers_prefix_allowed_tokens_fn
from the lmformatenforcer
library to ensure that the model's outputs strictly adhere to the defined JSON schema
.
The lmformatenforcer
helps language models output a JSON object that follows the Pydantic schema
, not the actual Pydantic schema itself. Therefore, we must convert this back to an object of type schema
for evaluation.
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
class Mistral7B(DeepEvalBaseLLM):
...
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
hf_pipeline = pipeline(
"text-generation",
model=self.model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)
parser = JsonSchemaParser(pydantic_model.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(hf_pipeline.tokenizer, parser)
output_dict = hf_pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]['generated_text'][len(prompt):]
json_result = json.loads(output)
return pydantic_model(**json_result)
4. Instantiating your model
Load your models from Hugging Face's transformers
library. Optionally, you can pass in a quantization_config
parameter if your compute resources are limited.
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3", device_map="auto",quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
mistral_custom = Mistral7B(model_4bit, tokenizer)
5. Running evaluations
Finally, evaluate your test cases using your desired metric on the custom Mistral-7B v0.3 model. You'll find that some of your LLM test cases, which previously failed to evaluate due to an invalid JSON error, will now run successfully after you have defined the schema
parameter.
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(...)
metric = AnswerRelevancyMetric(threshold=0.5, model=mistral_custom, verbose_mode=True)
metric.measure(test_case)
print(metric.reason)
2.) Llama-3 8B
1. Install lm-format-enforcer
Begin by installing the lm-format-enforcer
package via pip:
pip install lm-format-enforcer
2. Create your custom LLM
Create your custom Llama-3 8B LLM class using the DeepEvalLLM
base class. Define the schema
parameter in your generate and a_generate method signatures.
from pydantic import BaseModel
class Llama8B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer
def load_model(self):
return self.model
def get_model_name(self):
return "Mistral-3 8B"
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
3. Write the generate
method
Write the generate
method for your custom LLM by utilizng JsonSchemaParser
and build_transformers_prefix_allowed_tokens_fn
from the lmformatenforcer
library to ensure that the model's outputs strictly adhere to the defined JSON schema
.
The lmformatenforcer
helps language models output a JSON object that follows the Pydantic schema
, not the actual Pydantic schema itself. Therefore, we must convert this back to an object of type schema
for evaluation.
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
class Llama8B(DeepEvalBaseLLM):
...
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
hf_pipeline = pipeline(
"text-generation",
model=self.model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)
parser = JsonSchemaParser(pydantic_model.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(hf_pipeline.tokenizer, parser)
output_dict = hf_pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]['generated_text'][len(prompt):]
json_result = json.loads(output)
return pydantic_model(**json_result)
4. Instantiating your model
Load your models from Hugging Face's transformers
library. Optionally, you can pass in a quantization_config
parameter if your compute resources are limited.
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto",quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
mistral_custom = Llama8B(model_4bit, tokenizer)
5. Running evaluations
Finally, evaluate your test cases using your desired metric on the custom Llama-3 8B model. You'll find that some of your LLM test cases, which previously failed to evaluate due to an invalid JSON error, will now run successfully after you have defined the schema parameter.
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(...)
metric = AnswerRelevancyMetric(threshold=0.5, model=mistral_custom, verbose_mode=True)
metric.measure(test_case)
print(metric.reason)
Gemini 1.5 Flash
1. Install instructor
Begin by installing the instructor
package via pip:
pip install -U instructor
2. Create your custom LLM
Create your custom Gemini 1.5 Flash LLM class using the DeepEvalLLM
base class. Define the schema
parameter in your generate and a_generate method signatures.
from pydantic import BaseModel
import google.generativeai as genai
class GeminiFlash(DeepEvalBaseLLM):
def __init__(
self
):
self.model = genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest"
)
def load_model(self):
return self.model
def get_model_name(self):
return "Gemini 1.5 Flash"
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
3. Write the generate
method
The instructor
client automatically allows you to create a structured response by defining a response_model parameter which accepts a schema model that inherits from BaseModel
. To write your a_generate
function, wrap the Instructor client around your Async Gemini client. Alternatively, you can call the generate function directly if you prefer not to write an async method.
import instructor
class GeminiFlash(DeepEvalBaseLLM):
...
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
client = self.load_model()
instructor_client = instructor.from_gemini(
client=client,
mode=instructor.Mode.GEMINI_JSON,
)
resp = instructor_client.messages.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=schema,
)
return resp
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
4. Running evaluations
Finally, evaluate your test cases using your desired metric on the custom Gemini 1.5 Flash model. You'll find that some of your LLM test cases, which previously failed to evaluate due to an invalid JSON error, will now run successfully after you have defined the schema parameter.
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(...)
gemini_custom = GeminiFlash()
metric = AnswerRelevancyMetric(threshold=0.5, model=gemini_custom, verbose_mode=True)
metric.measure(test_case)
print(metric.reason)
Claude 3 Opus
1. Install instructor
Begin by installing the instructor
package via pip:
pip install -U instructor
2. Create your custom LLM
Create your custom Claude 3 Opus LLM class using the DeepEvalLLM
base class. Define the schema
parameter in your generate and a_generate method signatures.
from pydantic import BaseModel
from anthropic import Anthropic
class ClaudeOpus(DeepEvalBaseLLM):
def __init__(
self
):
self.model = Anthropic()
def load_model(self):
return self.model
def get_model_name(self):
return "Claude 3 Opus LLM"
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
...
3. Write the generate
method
The instructor
client automatically allows you to create a structured response by defining a response_model parameter which accepts a schema model that inherits from BaseModel
. To write your a_generate
function, wrap the Instructor client around your Async Anthropic client. Alternatively, you can call the generate function directly if you prefer not to write an async method.
import instructor
class ClaudeOpus(DeepEvalBaseLLM):
...
def generate(self, prompt: str, pydantic_model: BaseModel) -> BaseModel:
client = self.load_model()
instructor_client = instructor.from_anthropic(client)
resp = instructor_client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=User,
)
return resp
async def a_generate(self, prompt: str) -> BaseModel:
...
4. Running evaluations
Finally, evaluate your test cases using your desired metric on the custom Claude 3 Opus model. You'll find that some of your LLM test cases, which previously failed to evaluate due to an invalid JSON error, will now run successfully after you have defined the schema parameter.
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(...)
claude_custom = ClaudeOpus()
metric = AnswerRelevancyMetric(threshold=0.5, model=claude_custom, verbose_mode=True)
metric.measure(test_case)
print(metric.reason)
How All of This Fits into Improving Evaluations
Deepeval metrics will automatically check for the schema
parameter in custom LLMs. When provided, they utilize the associated pydantic
model for the task.
Ensure that the schema
field is always of type BaseModel
and the generate functions returns an object of type BaseModel
.