Generate data dictionary of Excel files using OpenPyXL and AI proxy

liralbes 5 hours ago

0 0 8 minutes read

Generate data dictionary of Excel files using OpenPyXL and AI proxy

Every company I’ve been working for today has: Resilient MS Excel.

Excel was first released in 1985 and has remained strong to this day. It survived the rise of relational databases, the development of many programming languages, the rise of an unlimited number of online applications, and finally, it also survived the era of AI.

ph!

What questions do you have about Excel’s elasticity? I’m not.

I think the reason for this is Practicality of starting and manipulating files. Consider this situation: We are working, meetings, and suddenly the leader shares a CSV file and asks for quick calculations or some calculated numbers. Now, the options are:

1. Open an IDE (or notebook) and start encoding like crazy to generate a simple matplotlib graph;

2. Open Power BI, import data, and start creating reports using dynamic graphics.

3. Open CSV in Excel, write a few formulas, and then create a graph.

I can’t speak for you, but many times I chose option 3. Especially because Excel files are compatible with everything, easy to share, and friendly.

All of what I said is about introducing my point that even with the rapid development of AI, I don’t think Excel files will disappear anytime soon. Many people will like it, many people will hate that.

So my action here is to use AI to better record Excel files. One of the main complaints from the data team to Excel is the lack of best practices and repeatability, as the names of these columns can have any name and data type, but the documentation is zero.

So I created an AI proxy that reads Excel files and creates this little document. Here’s how it works:

The Excel file will be converted to CSV and fed into a large language model (LLM).
The AI agent uses column information (variable name, data type, description) to generate a data dictionary.
The data dictionary is added as the title of the Excel file.
The output file is commented.

OK Start now. Let’s do this in this tutorial.

Code

Let’s code! | Images generated by AI. Meta Llama, 2025.

We will first set up the virtual environment. Create a venv Have tools of your choice such as Poetry, Python Venv, Anaconda or UV. In my opinion, I really like UV because it is the fastest and easiest. If you have a UV installed [5]open the terminal and create your venv.

uv init data-docs
cd data-docs
uv venv
uv add streamlit openpyxl pandas agno mcp google-genai

Now, let’s import the necessary modules. The project was created using Python 3.12.1, but I believe Python 3.9 or later may already solve the problem. We will use:

agno: For AI agent management
OpenPyXl: Used to manipulate Excel files
simplify:For front-end interface.
Pandas, OS, JSON, DEDENT and Google Genai as support modules.

# Imports
import os
import json
import streamlit as st
from textwrap import dedent

from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools

from openpyxl import load_workbook
from openpyxl.comments import Comment
import pandas as pd

Great. The next step is to create the functionality we need to process Excel files and create AI proxy.

Please note that all functions are detailed Docstrings. This is intentional because LLM uses DocStrings to know what a given function does and decide whether to use it as a tool.

So if you use Python functions as a tool for AI proxy, make sure to use detailed Docstrings. Nowadays, a free co-pilot with a channel [6] Creating them is even easier.

Convert files to CSV

This feature will:

Take out the Excel file and read only the first 10 lines. This is enough for us to send to LLM. In doing so, we also prevent too many tokens from being sent as input and make that proxy too expensive.
Save the file as input to the AI proxy as CSV. The CSV format is easier to receive for models because it is a pile of comma-separated text. And we know that LLMS Shine works with text.

This is the function.

def convert_to_csv(file_path:str):
   """
    Use this tool to convert the excel file to CSV.

    * file_path: Path to the Excel file to be converted
    """
   # Load the file  
   df = pd.read_excel(file_path).head(10)

   # Convert to CSV
   st.write("Converting to CSV... :leftwards_arrow_with_hook:")
   return df.to_csv('temp.csv', index=False)

Let’s keep moving forward.

Create a proxy

Next feature creates an AI proxy. I’m using Agno [1]because it is very versatile and easy to use. I chose the model, too Gemini 2.0 Flash. During the testing phase, this is the best performance model for generating data documents. To use it, you will need Google’s API key. Don’t forget to find one here [7].

Function:

Receive CSV output from the previous function.
Through an AI proxy, the proxy generates a data dictionary through column names, descriptions, and data types.
Notice description The argument is the hint of the agent. Make it detailed and precise.
The data dictionary will be saved as JSON Use a file called a tool FileTools You can read and write files.
I’ve set it up retries=2 So we can resolve any errors on the first try.

def create_agent(apy_key):
    agent = Agent(
        model=Gemini(id="gemini-2.0-flash", api_key=apy_key),
        description= dedent("""
                            You are an agent that reads the temp.csv dataset presented to you and 
                            based on the name and data type of each column header, determine the following information:
                            - The data types of each column
                            - The description of each column
                            - The first column numer is 0

                            Using the FileTools provided, create a data dictionary in JSON format that includes the below information:
                            {: {ColName: , DataType: , Description: }}

                            If you are unable to determine the data type or description of a column, return 'N/A' for that column for the missing values.
                            
                            """),
        tools=[ FileTools(read_files=True, save_files=True) ],
        retries=2,
        show_tool_calls=True
        )

    return agent

OK Now, we need another function to save the data dictionary to a file.

Add a data dictionary to the title of the file

This is the last feature to be created. It will:

Get data dictionary json From the previous step and the original Excel file.
As a comment, add the data dictionary to the title of the file.
Save the output file.
After saving the file, it will display a download button for the user to obtain the modified file.

def add_comments_to_header(file_path:str, data_dict:dict="data_dict.json"):
    """
    Use this tool to add the data dictionary {data_dict.json} as comments to the header of an Excel file and save the output file.

    The function takes the Excel file path as argument and adds the {data_dict.json} as comments to each cell
    Start counting from column 0
    in the first row of the Excel file, using the following format:    
        * Column Number: 
        * Column Name: 
        * Data Type: 
        * Description: 

    Parameters
    ----------
    * file_path : str
        The path to the Excel file to be processed
    * data_dict : dict
        The data dictionary containing the column number, column name, data type, description, and number of null values

    """
    
    # Load the data dictionary
    data_dict = json.load(open(data_dict))

    # Load the workbook
    wb = load_workbook(file_path)

    # Get the active worksheet
    ws = wb.active

    # Iterate over each column in the first row (header)
    for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)):
        for header_cell in col:
            header_cell.comment = Comment(dedent(f"""
                              ColName: {data_dict[str(n)]['ColName']}, 
                              DataType: {data_dict[str(n)]['DataType']},
                              Description: {data_dict[str(n)]['Description']}
    """),'AI Agent')

    # Save the workbook
    st.write("Saving File... :floppy_disk:")
    wb.save('output.xlsx')

    # Create a download button
    with open('output.xlsx', 'rb') as f:
        st.download_button(
            label="Download output.xlsx",
            data=f,
            file_name='output.xlsx',
            mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
        )

OK The next step is to glue all of this to the simplified front-end script.

Simplify the front end

In this step, I could have created another file for the front end and imported the functionality there. But I decided to use the same file, so let’s start with the famous one:

if __name__ == "__main__":

First, several lines are to configure the pages and messages displayed in the web application. We will use content centered On the page, and there is some information about how the application works.

# Config page Streamlit
    st.set_page_config(layout="centered", 
                       page_title="Data Docs", 
                       page_icon=":paperclip:",
                       initial_sidebar_state="expanded")
    
    # Title
    st.title("Data Docs :paperclip:")
    st.subheader("Generate a data dictionary for your Excel file.")
    st.caption("1. Enter your Gemini API key and the path of the Excel file on the sidebar.")
    st.caption("2. Run the agent.")
    st.caption("3. The agent will generate a data dictionary and add it as comments to the header of the Excel file.")
    st.caption("ColName:  | DataType:  | Description: ")
    
    st.divider()

Next, we will set up the sidebar where users can enter their API keys from Google and select one .xlsx The file needs to be modified.

There is a button to run the app, another to reset the app status, and a progress bar. Nothing likes too much.

with st.sidebar:
        # Enter your API key
        st.caption("Enter your API key and the path of the Excel file.")
        api_key = st.text_input("API key: ", placeholder="Google Gemini API key", type="password")
        
        # Upload file
        input_file = st.file_uploader("File upload", 
                                       type='xlsx')
        

        # Run the agent
        agent_run = st.button("Run")

        # progress bar
        progress_bar = st.empty()
        progress_bar.progress(0, text="Initializing...")

        st.divider()

        # Reset session state
        if st.button("Reset Session"):
            st.session_state.clear()
            st.rerun()

once running Click the button and it triggers the rest of the code to run the agent. This is the order of execution of steps:

The first function is called to convert the file to CSV
Progress is registered on the progress bar.
Agents are created.
The progress bar has been updated.
Prompts are sent to the agent to read temp.csv File, create a data dictionary, and save the output to data_dictionary.json.
The data dictionary is printed on the screen, so users can view what is generated when saving it to an Excel file.
The Excel file has been modified and saved.

# Create the agent
    if agent_run:
        # Convert Excel file to CSV
        convert_to_csv(input_file)

        # Register progress
        progress_bar.progress(15, text="Processing CSV...")

        # Create the agent
        agent = create_agent(api_key)

        # Start the script
        st.write("Running Agent... :runner:")

        # Register progress
        progress_bar.progress(50, text="AI Agent is running...")

        # Run the agent    
        agent.print_response(dedent(f"""
                                1. Use FileTools to read the temp.csv as input to create the data dictionary for the columns in the dataset. 
                                2. Using the FileTools tool, save the data dictionary to a file named 'data_dict.json'.
                                
                                """),
                        markdown=True)

        # Print the data dictionary
        st.write("Generating Data Dictionary... :page_facing_up:")
        with open('data_dict.json', 'r') as f:
            data_dict = json.load(f)
            st.json(data_dict, expanded=False)

        # Add comments to header
        add_comments_to_header(input_file, 'data_dict.json')

        # Remove temporary files
        st.write("Removing temporary files... :wastebasket:")
        os.remove('temp.csv')
        os.remove('data_dict.json')    
    
    # If file exists, show success message
    if os.path.exists('output.xlsx'):
        st.success("Done! :white_check_mark:")
        os.remove('output.xlsx')

    # Progress bar end
    progress_bar.progress(100, text="Done!")

That’s it. This is a demonstration of the agents in action.

Add data documents to your Excel file. Image of the author.

Beautiful result!

Try it

The application you can try to deploy here:

Before moving forward

In my humble opinion, Excel files won’t disappear anytime soon. Love or hate them, we have to stick with it for a while.

Excel files are versatile and easy to process and share, so they are still very useful for routines ad hoc Tasks at work.

But now we can leverage AI to help us process these files and make them better. Artificial intelligence is moving many parts of our lives. The routines and tools at work are just another one.

Let us work with AI and smart every day!

If you like this content, please find more of my work in the shared websites below and in Github.