Agent AI 102: Guardrail and Agent Evaluation

0 0 9 minutes read

Agent AI 102: Guardrail and Agent Evaluation

In the first article in this series (Agent AI 101: Starting Your Journey AI Agent), we discuss the basics of creating AI Agents and introducing concepts like reasoning, memory, and tools.

Of course, the first article only touches on the surface of new areas of the data industry. There are many more things to do, and we will learn more in this series.

So, it’s time to take a step forward.

In this article, we will introduce three topics:

Guardrail: These are safe blocks that prevent large language models (LLMs) from responding to certain topics.
Agent evaluation: Have you ever considered the accuracy of the response of LLM? I bet you did it. So we will see the main ways to measure this.
Monitoring: We will also learn about built-in monitoring applications in the AGNO framework.

We will start now.

Guardrail

In my opinion, our first topic is the simplest topic. Guardrails are the rules that make an AI agent respond to a given topic or list of topics.

I believe there is a good chance you will ask for something to chat or Gemini and receive replies like “I can’t talk about this topic” or “Please consult a professional expert”. Often, this is based on sensitive topics such as health advice, psychological conditions, or financial advice.

These neighborhoods are protective measures to prevent people from harming themselves, harming their health or pockets. As we know, LLM is trained in a lot of text, and Ergo inherits a lot of bad content, which is easy to provide bad advice in these areas. And I didn’t even mention hallucinations!

Think about how many stories there are, these people who lost money by following investment tips from online forums. Or how many people took the wrong medication Read about it on the Internet.

OK, I miss you to get it. We must prevent agents from talking about certain topics or taking certain actions. To do this, we will use guardrails.

I found that the best frame to impose those blocks is guardrail AI [1]. There you will see a hub filled with predefined rules to pass and display it to the user that must follow that response.

To start quickly, go to this link first [2] And get an API key. Then, install the package. Next, type the GuardRails setting command. It will ask you a few questions, you can answer n (for no), and ask you to enter the generated API key.

pip install guardrails-ai
guardrails configure

Once finished, go to the guardrail AI Hub [3] And choose the one you need. Each guardrail has instructions on how to implement it. Basically, you can install it via the command line and then use it like a module in Python.

In this example, we chose a Topics only [4]as the name implies, this allows users to talk only about what is in the list. So, go back to the terminal and install using the following code.

guardrails hub install hub://tryolabs/restricttotopic

Next, let’s open the Python script and import some modules.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os

# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic

Next, we create the guard. We will restrict our agents from talking only Exercise or weather. We’re limiting it stock.

# Setup Guard
guard = Guard().use(
    RestrictToTopic(
        valid_topics=["sports", "weather"],
        invalid_topics=["stocks"],
        disable_classifier=True,
        disable_llm=False,
        on_fail="filter"
    )
)

Now we can run agents and guards.

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "An assistant agent",
    instructions= ["Be sucint. Reply in maximum two sentences"],
    markdown= True
    )

# Run the agent
response = agent.run("What's the ticker symbol for Apple?").content

# Run agent with validation
validation_step = guard.validate(response)

# Print validated response
if validation_step.validation_passed:
    print(response)
else:
    print("Validation Failed", validation_step.validation_summaries[0].failure_reason)

This is the response when we ask about stock symbols.

Validation Failed Invalid topics found: ['stocks']

If I ask a topic is not valid_topics List, I will also see a block.

"What's the number one soda drink?"
Validation Failed No valid topic was found.

Finally, let’s ask about exercise.

"Who is Michael Jordan?"
Michael Jordan is a former professional basketball player widely considered one of 
the greatest of all time.  He won six NBA championships with the Chicago Bulls.

This time we saw a response because it was a valid topic.

Let’s continue our assessment of the agent now.

Agent evaluation

Since I started working on LLM and Agesic AI, one of my main questions has been about model evaluation. Unlike traditional data science modeling, you have structured metrics that fit every situation, which is more vague for AI agents.

Fortunately, the developer community can find solutions almost quickly, so they created this nice package for LLMS evaluation: deepeval.

Deepeval [5] is a library created by confident AI that collects many ways to evaluate LLM and AI agents. In this section, let’s learn some of the main methods so that we can build some intuition on the topic, too, because the library is very extensive.

The first assessment is the most basic assessment we can use, and it is called G-Eval. As AI tools such as Chatgpt become more common in daily tasks, we must make sure they provide a beneficial and accurate response. This is where G-eval comes in and out in the Deepeval Python package.

g-eval Like a clever commenter, it uses another AI model to evaluate the performance of a chatbot or an AI assistant. For example. My broker runs Gemini and I’m using OpenAI to evaluate it. By requiring AI to follow something similar, this approach is more advanced than what humans have adopted Related,,,,, Correctnessand clear.

This is a great way to test and improve the generation of AI systems in a more scalable way. Let’s quickly encode an example. We will import the module, create a prompt, a simple chat agent, and ask for descriptions of the weather in New York for May.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Evaluation Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

# Prompt
prompt = "Describe the weather in NYC for May"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "An assistant agent",
    instructions= ["Be sucint"],
    markdown= True,
    monitoring= True
    )

# Run agent
response = agent.run(prompt)

# Print response
print(response.content)

It answered:Mild, average highs are at 60°F and lows in 50s°F. Looking forward to rain“.

OK It seems to be good to me.

But how do we put a number on it and show potential managers or clients how our agents do it?

This is:

Create a via prompt and response arrive LLMTestCase class.
Create a metric. We will use this method GEval And add a prompt to test the model Continuityand then I gave it meaning, which is coherent to me.
The output is given as evaluation_params.
run measure Method and obtain score and reason from.

# Test Case
test_case = LLMTestCase(input=prompt, actual_output=response)

# Setup the Metric
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence. The agent can answer the prompt and the response makes sense.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

# Run the metric
coherence_metric.measure(test_case)
print(coherence_metric.score)
print(coherence_metric.reason)

The output looks like this.

0.9
The response directly addresses the prompt about NYC weather in May, 
maintains logical consistency, flows naturally, and uses clear language. 
However, it could be slightly more detailed.

Given that the default threshold is 0.5, 0.9 seems to be fine.

If you want to check the logs, use this next snippet.

# Check the logs
print(coherence_metric.verbose_logs)

This is the response.

Criteria:
Coherence. The agent can answer the prompt and the response makes sense.

Evaluation Steps:
[
    "Assess whether the response directly addresses the prompt; if it aligns,
 it scores higher on coherence.",
    "Evaluate the logical flow of the response; responses that present ideas
 in a clear, organized manner rank better in coherence.",
    "Consider the relevance of examples or evidence provided; responses that 
include pertinent information enhance their coherence.",
    "Check for clarity and consistency in terminology; responses that maintain
 clear language without contradictions achieve a higher coherence rating."
]

very nice. Now let’s look at another interesting use case, this is the right one AI Agent Task Completed. Details how our agents perform when they require tasks and how much the agent can deliver.

First, we are creating a simple proxy that can access Wikipedia and summarize the query topic.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import evaluate

# Prompt
prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-2.0-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "You are a researcher specialized in searching the wikipedia.",
    tools= [WikipediaTools()],
    show_tool_calls= True,
    markdown= True,
    read_tool_call_history= True
    )

# Run agent
response = agent.run(prompt)

# Print response
print(response.content)

The results look very good. Let’s use TaskCompletionMetric class.

# Create a Metric
metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

# Test Case
test_case = LLMTestCase(
    input=prompt,
    actual_output=response.content,
    tools_called=[ToolCall(name="wikipedia")]
    )

# Evaluate
evaluate(test_cases=[test_case], metrics=[metric])

Output, including the proxy response.

======================================================================

Metrics Summary

  - ✅ Task Completion (score: 1.0, threshold: 0.7, strict: False, 
evaluation model: gpt-4o-mini, 
reason: The system successfully searched for 'Time series analysis' 
on Wikipedia and provided a clear summary of the 3 main points, 
fully aligning with the user's goal., error: None)

For test case:

  - input: Search wikipedia for 'Time series analysis' and summarize the 3 main points
  - actual output: Here are the 3 main points about Time series analysis based on the
 Wikipedia search:

1.  **Definition:** A time series is a sequence of data points indexed in time order,
 often taken at successive, equally spaced points in time.
2.  **Applications:** Time series analysis is used in various fields like statistics,
 signal processing, econometrics, weather forecasting, and more, wherever temporal 
measurements are involved.
3.  **Purpose:** Time series analysis involves methods for extracting meaningful 
statistics and characteristics from time series data, and time series forecasting 
uses models to predict future values based on past observations.

  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Task Completion: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
 on Confident AI.

Our broker passed the test with honor: 100%!

You can learn more about Deepeval Library in this link [8].

Finally, in the next section, we will learn about the functionality of the Agno library monitoring agent.

Agent monitoring

Just like I told you in the previous post [9]I chose agno Learn more about Agentic AI. To be clear, this is not a sponsored post. It’s just that I think this is the best choice for those who are starting to learn about this topic.

So one of the cool things we can take advantage of AGNO’s framework is that they can be used for applications that model monitoring.

Take the example of a proxy that searches the Internet and writes Instagram posts.

# Imports
import os
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from agno.tools.googlesearch import GoogleSearchTools


# Topic
topic = "Healthy Eating"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
                  description= f"""You are a social media marketer specialized in creating engaging content.
                  Search the internet for 'trending topics about {topic}' and use them to create a post.""",
                  tools=[FileTools(save_files=True),
                         GoogleSearchTools()],
                  expected_output="""A short post for instagram and a prompt for a picture related to the content of the post.
                  Don't use emojis or special characters in the post. If you find an error in the character encoding, remove the character before saving the file.
                  Use the template:
                  - Post
                  - Prompt for the picture
                  Save the post to a file named 'post.txt'.""",
                  show_tool_calls=True,
                  monitoring=True)

# Writing and saving a file
agent.print_response("""Write a short post for instagram with tips and tricks that positions me as 
                     an authority in {topic}.""",
                     markdown=True)

To monitor its performance, follow these steps:

Go to and get an API key.
Open Terminal and Type ag setup.
If it is the first time, it may require the use of the API key. Copy and paste it into the terminal prompt.
You’ll see Dashboard Open a tab in your browser.
If you want to monitor the proxy, add parameters monitoring=True.
Run your agent.
Go to the dashboard on your web browser.
Click Meeting. Since it is a single proxy, you will see it under “Tag Proxy” at the top of the page.

After running the Agno dashboard. Image of the author.

The cool features we can see:

Information about the model
reaction
Tools used
The token consumed

This is the token consumption generated when saving the file. Image of the author.

Very good, right?

For example, it is useful for us to know where agents spend tokens more or less and where they spend more time performing tasks.

OK, let’s end.

Before moving forward

We learned a lot in the second round. In this article, we introduce:

AI’s guardrail is the basic safety measures and ethical codes implemented to prevent accidental harmful outputs and ensure responsible AI behavior.
Model evaluation,For example GEval Carry out extensive assessments and TaskCompletion For DeepEval of proxy quality, it is crucial to understand AI capabilities and limitations.
Model monitoring With Agno’s applications, including tracking token usage and response time, this is critical to managing costs, ensuring performance, and identifying potential issues in deployed AI systems.