Data Science

Create an AI agent to explore your data directory in natural language

Every data-driven application, product, or dashboard has a key component: a database. These systems have long been the basis for storing, managing, and querying structured data, whether it is relationships, time series, or distribution on cloud platforms.

To interact with these systems, we rely on SQL (Structured Query Language), a standardized and incredibly powerful method to retrieve, manipulate and analyze data. SQL is expressive, precise and optimized for performance. However, for many users, especially data novice, SQL can be daunting. Remember the syntax, understanding connectivity and navigating complex patterns can be a barrier to productivity.

However, the idea of ​​querying databases in natural language is nothing new! In fact, research on the natural language interface of databases (NLIDB) can be traced back to the 1970s. picture The moon and Exactly explores how users ask questions in simple English and receive structured answers powered by SQL. Despite a great academic interest, these early systems struggle with generalization, ambiguity, and scalability. Back in 2029, Powerbi also showed us early glimpses of natural language data queries in 2019. Although the Q&A feature is promising, it struggles with complex queries, requires precise wording and relies heavily on how the data model is cleaned. Finally, it lacks the reasoning and flexibility of users’ expectations of real assistants!

But what about 2025? Do we know there are technologies to implement it?

Can LLM do things we couldn’t do before now?

Based on our knowledge of LLM and its capabilities, we also understand that they, as well as the concept of AI proxy, have unique capabilities to bridge the gap between technical SQL and natural human queries. They are good at explaining vague problems, producing syntactical correct SQL and adapting to different user intents. This makes them ideal for interfaces with data conversations. However, LLM is not sure. They rely heavily on the inferences of probability, which can lead to hallucinations, false assumptions or

This is the AI ​​agent-related place. By wrapping LLM in a structured system (including memory, tools, verification layers, and definition purposes), we can reduce the drawbacks of probability output. Agents are more than just text generators: it becomes a collaborator who understands its running environment. Combining strategies for proper grounding, pattern checking and user intent detection, the agent allows us to build systems that are more reliable than quickly set up.

This is the basis of this short tutorial: How to build your first AI proxy assistant to query your data directory!

A guide to creating a Databricks directory assistant step by step

First, we need to choose our technology stack. We need a model provider, a tool to help us execute structures in proxy streams, with a database connector, and a simple UI to power the chat experience!

  • Openai (GPT-4): A category of natural language understanding, reasoning and SQL generation.
  • pydantic aI: Add structure to the LLM response. No hallucinations or fuzzy answers – just clean, pattern validation output.
  • simplify: Quickly build a responsive chat interface with built-in LLM and feedback components.
  • Databricks SQL Connector: Access your Databricks Workspace’s directory, schema and query results.

Well, don’t forget – this is just a small and simple project. If you plan to deploy it in production, across multiple users and across multiple databases, you definitely need to consider other issues: scalability, access control, identity management, use case design, user experience, data privacy…the list goes on.

1. Environment settings

Before we dive into coding, let’s prepare our development environment. This step ensures that all required packages are installed and isolated in a clean virtual environment. This avoids version conflicts and keeps our project organized.

conda create -n sql-agent python=3.12
conda activate sql-agent

pip install pydantic-ai openai streamlit databricks-sql-connector

2. Create tools and logic to access data string data directory information

While building a dialogue SQL proxy seems to be an LLM problem, it is Data issues First. You need metadata, column-level context, constraints, ideally an analysis layer to understand what security queries and how to interpret the results. This is part of what we say Data-centric AI stack (It sounds like 2021, but I promise you are still super relevant!!) – Place for analysis, quality and pattern verification forward Timely engineering.

In this case, since the agent needs context to reason about your data, this step involves setting up a connection to the data string workspace and programmatically extracting the structure of the data directory. This metadata will serve as the basis for generating accurate SQL queries.

def set_connection(server_hostname: str, http_path: str, access_token: str):
    connection = sql.connect(
        server_hostname=server_hostname,
        http_path=http_path,
        access_token=access_token
    )
    return connection

The complete code for the metadata connector can be found here.

3. Building SQL Agents with Pydantic AI

This is what we define our AI proxy. We are using pydantic-ai To implement structured output, in this case we want to make sure we will always receive a clean SQL query from the LLM. This allows the proxy to be used safely for applications and reduces the chances of blurring and, more importantly, the inevitable code.

To define a proxy, we first need to specify the output mode with pydantic, in which case it is a field code Represents SQL query. Then, we use Agent The class connects the system prompt, model name, and output type together.

from pydantic import BaseModel
from pydantic_ai.agent import Agent
from pydantic_ai.messages import ModelResponse, TextPart

# ==== Output schema ====
class CatalogQuery(BaseModel):
    code: str

# ==== Agent Factory ====
def catalog_metadata_agent(system_prompt: str, model: str="openai:gpt-4o") -> Agent:
    return Agent(
        model=model,
        system_prompt=system_prompt,
        output_type=CatalogQuery,
        instrument=True
    )

# ==== Response Adapter ====
def to_model_response(output: CatalogQuery, timestamp: str) -> ModelResponse:
    return ModelResponse(
        parts=[TextPart(f"```sqln{output.code}n```")],
        timestamp=timestamp
    )

System prompts provide guidance and examples to guide the behavior of LLM, and instrument=True Implements tracking and observability for debugging or evaluation.

The system prompt itself is designed to guide the agent’s behavior. It clearly illustrates the goal of the assistant (written SQL query Unity directory), including the metadata context to base its inference, and provides specific examples to illustrate the expected output format. This structure helps the LLM model stay focused, reduce ambiguity and return predictable effective responses.

4. Build a simplified chat interface

Now that we have provided the foundation for our SQL proxy, it’s time to make it interactive. With simplification, we will create a simple front-end where we can ask natural language questions and receive generated SQL queries in real time.

Fortunately, the Lite version has provided us with powerful building blocks to create an LLM-driven chat experience. If you are curious, this is a great tutorial that will go into the whole process in detail.

Screenshot of the author – Databricks SQL proxy and OpenAI and streamlined

You can find the full code for this tutorial here to try using the app on the simplified community cloud.

The final thought

In this tutorial, you have learned to browse the initial mechanisms for building simple AI agents. The point is to create a lightweight prototype to help you understand how to construct proxy flows and try out modern AI tools.

However, if you are going to do further production, consider the following things:

  • The hallucination is real and you can’t make sure that the return of SQL is correct. Using SQL static analysis to verify the output and implement a retry mechanism, ideally more certain;
  • Use the schema-aware tool to check table names and columns.
  • When the query fails, add a fallback flow – for example, “Are you talking about this table?”
  • Make it statement
  • Infrastructure, determines the management and operation of the system.

Ultimately, what makes these systems work is not only the model, it is data This is the basic one. Clean metadata,

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button