Generate data dictionary of Excel files using OpenPyXL and AI proxy

Every company I’ve been working for today has: Resilient MS Excel.
Excel was first released in 1985 and has remained strong to this day. It survived the rise of relational databases, the development of many programming languages, the rise of an unlimited number of online applications, and finally, it also survived the era of AI.
ph!
What questions do you have about Excel’s elasticity? I’m not.
I think the reason for this is Practicality of starting and manipulating files. Consider this situation: We are working, meetings, and suddenly the leader shares a CSV file and asks for quick calculations or some calculated numbers. Now, the options are:
1. Open an IDE (or notebook) and start encoding like crazy to generate a simple matplotlib graph;
2. Open Power BI, import data, and start creating reports using dynamic graphics.
3. Open CSV in Excel, write a few formulas, and then create a graph.
I can’t speak for you, but many times I chose option 3. Especially because Excel files are compatible with everything, easy to share, and friendly.
All of what I said is about introducing my point that even with the rapid development of AI, I don’t think Excel files will disappear anytime soon. Many people will like it, many people will hate that.
So my action here is to use AI to better record Excel files. One of the main complaints from the data team to Excel is the lack of best practices and repeatability, as the names of these columns can have any name and data type, but the documentation is zero.
So I created an AI proxy that reads Excel files and creates this little document. Here’s how it works:
- The Excel file will be converted to CSV and fed into a large language model (LLM).
- The AI agent uses column information (variable name, data type, description) to generate a data dictionary.
- The data dictionary is added as the title of the Excel file.
- The output file is commented.
OK Start now. Let’s do this in this tutorial.
Code
We will first set up the virtual environment. Create a venv
Have tools of your choice such as Poetry, Python Venv, Anaconda or UV. In my opinion, I really like UV because it is the fastest and easiest. If you have a UV installed [5]open the terminal and create your venv
.
uv init data-docs
cd data-docs
uv venv
uv add streamlit openpyxl pandas agno mcp google-genai
Now, let’s import the necessary modules. The project was created using Python 3.12.1, but I believe Python 3.9 or later may already solve the problem. We will use:
- agno: For AI agent management
- OpenPyXl: Used to manipulate Excel files
- simplify:For front-end interface.
- Pandas, OS, JSON, DEDENT and Google Genai as support modules.
# Imports
import os
import json
import streamlit as st
from textwrap import dedent
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from openpyxl import load_workbook
from openpyxl.comments import Comment
import pandas as pd
Great. The next step is to create the functionality we need to process Excel files and create AI proxy.
Please note that all functions are detailed Docstrings. This is intentional because LLM uses DocStrings to know what a given function does and decide whether to use it as a tool.
So if you use Python functions as a tool for AI proxy, make sure to use detailed Docstrings. Nowadays, a free co-pilot with a channel [6] Creating them is even easier.
Convert files to CSV
This feature will:
- Take out the Excel file and read only the first 10 lines. This is enough for us to send to LLM. In doing so, we also prevent too many tokens from being sent as input and make that proxy too expensive.
- Save the file as input to the AI proxy as CSV. The CSV format is easier to receive for models because it is a pile of comma-separated text. And we know that LLMS Shine works with text.
This is the function.
def convert_to_csv(file_path:str):
"""
Use this tool to convert the excel file to CSV.
* file_path: Path to the Excel file to be converted
"""
# Load the file
df = pd.read_excel(file_path).head(10)
# Convert to CSV
st.write("Converting to CSV... :leftwards_arrow_with_hook:")
return df.to_csv('temp.csv', index=False)
Let’s keep moving forward.
Create a proxy
Next feature creates an AI proxy. I’m using Agno
[1]because it is very versatile and easy to use. I chose the model, too Gemini 2.0 Flash
. During the testing phase, this is the best performance model for generating data documents. To use it, you will need Google’s API key. Don’t forget to find one here [7].
Function:
- Receive CSV output from the previous function.
- Through an AI proxy, the proxy generates a data dictionary through column names, descriptions, and data types.
- Notice
description
The argument is the hint of the agent. Make it detailed and precise. - The data dictionary will be saved as
JSON
Use a file called a toolFileTools
You can read and write files. - I’ve set it up
retries=2
So we can resolve any errors on the first try.
def create_agent(apy_key):
agent = Agent(
model=Gemini(id="gemini-2.0-flash", api_key=apy_key),
description= dedent("""
You are an agent that reads the temp.csv dataset presented to you and
based on the name and data type of each column header, determine the following information:
- The data types of each column
- The description of each column
- The first column numer is 0
Using the FileTools provided, create a data dictionary in JSON format that includes the below information:
{: {ColName: , DataType: , Description: }}
If you are unable to determine the data type or description of a column, return 'N/A' for that column for the missing values.
"""),
tools=[ FileTools(read_files=True, save_files=True) ],
retries=2,
show_tool_calls=True
)
return agent
OK Now, we need another function to save the data dictionary to a file.
Add a data dictionary to the title of the file
This is the last feature to be created. It will:
- Get data dictionary
json
From the previous step and the original Excel file. - As a comment, add the data dictionary to the title of the file.
- Save the output file.
- After saving the file, it will display a download button for the user to obtain the modified file.
def add_comments_to_header(file_path:str, data_dict:dict="data_dict.json"):
"""
Use this tool to add the data dictionary {data_dict.json} as comments to the header of an Excel file and save the output file.
The function takes the Excel file path as argument and adds the {data_dict.json} as comments to each cell
Start counting from column 0
in the first row of the Excel file, using the following format:
* Column Number:
* Column Name:
* Data Type:
* Description:
Parameters
----------
* file_path : str
The path to the Excel file to be processed
* data_dict : dict
The data dictionary containing the column number, column name, data type, description, and number of null values
"""
# Load the data dictionary
data_dict = json.load(open(data_dict))
# Load the workbook
wb = load_workbook(file_path)
# Get the active worksheet
ws = wb.active
# Iterate over each column in the first row (header)
for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)):
for header_cell in col:
header_cell.comment = Comment(dedent(f"""
ColName: {data_dict[str(n)]['ColName']},
DataType: {data_dict[str(n)]['DataType']},
Description: {data_dict[str(n)]['Description']}
"""),'AI Agent')
# Save the workbook
st.write("Saving File... :floppy_disk:")
wb.save('output.xlsx')
# Create a download button
with open('output.xlsx', 'rb') as f:
st.download_button(
label="Download output.xlsx",
data=f,
file_name='output.xlsx',
mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
)
OK The next step is to glue all of this to the simplified front-end script.
Simplify the front end
In this step, I could have created another file for the front end and imported the functionality there. But I decided to use the same file, so let’s start with the famous one:
if __name__ == "__main__":
First, several lines are to configure the pages and messages displayed in the web application. We will use content centered
On the page, and there is some information about how the application works.
# Config page Streamlit
st.set_page_config(layout="centered",
page_title="Data Docs",
page_icon=":paperclip:",
initial_sidebar_state="expanded")
# Title
st.title("Data Docs :paperclip:")
st.subheader("Generate a data dictionary for your Excel file.")
st.caption("1. Enter your Gemini API key and the path of the Excel file on the sidebar.")
st.caption("2. Run the agent.")
st.caption("3. The agent will generate a data dictionary and add it as comments to the header of the Excel file.")
st.caption("ColName: | DataType: | Description: ")
st.divider()
Next, we will set up the sidebar where users can enter their API keys from Google and select one .xlsx
The file needs to be modified.
There is a button to run the app, another to reset the app status, and a progress bar. Nothing likes too much.
with st.sidebar:
# Enter your API key
st.caption("Enter your API key and the path of the Excel file.")
api_key = st.text_input("API key: ", placeholder="Google Gemini API key", type="password")
# Upload file
input_file = st.file_uploader("File upload",
type='xlsx')
# Run the agent
agent_run = st.button("Run")
# progress bar
progress_bar = st.empty()
progress_bar.progress(0, text="Initializing...")
st.divider()
# Reset session state
if st.button("Reset Session"):
st.session_state.clear()
st.rerun()
once running Click the button and it triggers the rest of the code to run the agent. This is the order of execution of steps:
- The first function is called to convert the file to CSV
- Progress is registered on the progress bar.
- Agents are created.
- The progress bar has been updated.
- Prompts are sent to the agent to read
temp.csv
File, create a data dictionary, and save the output todata_dictionary.json
. - The data dictionary is printed on the screen, so users can view what is generated when saving it to an Excel file.
- The Excel file has been modified and saved.
# Create the agent
if agent_run:
# Convert Excel file to CSV
convert_to_csv(input_file)
# Register progress
progress_bar.progress(15, text="Processing CSV...")
# Create the agent
agent = create_agent(api_key)
# Start the script
st.write("Running Agent... :runner:")
# Register progress
progress_bar.progress(50, text="AI Agent is running...")
# Run the agent
agent.print_response(dedent(f"""
1. Use FileTools to read the temp.csv as input to create the data dictionary for the columns in the dataset.
2. Using the FileTools tool, save the data dictionary to a file named 'data_dict.json'.
"""),
markdown=True)
# Print the data dictionary
st.write("Generating Data Dictionary... :page_facing_up:")
with open('data_dict.json', 'r') as f:
data_dict = json.load(f)
st.json(data_dict, expanded=False)
# Add comments to header
add_comments_to_header(input_file, 'data_dict.json')
# Remove temporary files
st.write("Removing temporary files... :wastebasket:")
os.remove('temp.csv')
os.remove('data_dict.json')
# If file exists, show success message
if os.path.exists('output.xlsx'):
st.success("Done! :white_check_mark:")
os.remove('output.xlsx')
# Progress bar end
progress_bar.progress(100, text="Done!")
That’s it. This is a demonstration of the agents in action.

Beautiful result!
Try it
The application you can try to deploy here:
Before moving forward
In my humble opinion, Excel files won’t disappear anytime soon. Love or hate them, we have to stick with it for a while.
Excel files are versatile and easy to process and share, so they are still very useful for routines ad hoc Tasks at work.
But now we can leverage AI to help us process these files and make them better. Artificial intelligence is moving many parts of our lives. The routines and tools at work are just another one.
Let us work with AI and smart every day!
If you like this content, please find more of my work in the shared websites below and in Github.
GitHub repository
This is the GitHub repository for the project.
Find me
You can find more information on my work on my website.
refer to
[1. Agno Docs]
[2. Openpyxl Docs]
[3. Streamlit Docs]
[4. Data-Docs Web App]
[5. Installing UV]
[6. Windsurf Coding Copilot]
[7. Google Gemini API Key]