LLMS + PANDAS: How do I generate PANDAS DATAFRAME summary using Generate AI

Datasets and are looking for quick insights without much manual grinding, here you are.
In 2025, datasets often contain millions of rows and hundreds of columns, which makes manual analysis nearly impossible. Local large language models can convert your raw data frame statistics into polished, readable reports in seconds (worst minutes). This approach eliminates the tedious process of analyzing data by hand and writing execution reports, especially when the data structure remains unchanged.
Pandas handles the heavy extraction of data extraction, while LLMS converts your technical output into considerable reports. You still need to write the function to extract key statistics from the dataset, but it’s a one-time effort.
This guide assumes that you have Ollama installed locally. If you don’t, you can still use a third-party LLM vendor, but I won’t explain how to connect its API.
Table of contents:
- Dataset introduction and exploration
- The boring part: Extract summary statistics
- The cool part: Working with LLMS
- What can you improve
Dataset introduction and exploration
For this guide, I’m using Kaggle’s MBA Admissions Dataset. If you want to continue downloading, download it.
The dataset is already in Apache 2.0 Licensewhich means you are free to use it for personal and commercial projects.
First, you need to install some Python libraries on your system.
Once everything is installed, import the necessary libraries in a new script or notebook:
import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal
Dataset loading and preprocessing
First, use pandas to load the dataset. This snippet loads a CSV file, prints basic information about the shape of the dataset, and shows how many missing values are present in each column:
df = pd.read_csv("data/MBA.csv")
# Basic dataset info
print(f"Dataset shape: {df.shape}n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)

Since data cleaning is not the main focus of this article, I will keep the minimum preprocessing. There are only a few missing values in the dataset that need attention:
df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")
That’s it! Let’s see how to go to a meaningful report from now on.
The boring part: Extract summary statistics
Even with all the advancements in AI capabilities and usability, you may not want to send the entire dataset to the LLM provider. There are two good reasons.
Can Too many tokensdirectly converts to higher costs. Processing large data sets can Take a long timeespecially when you run your model locally on your own hardware. You may still be dealing with it Sensitive data That shouldn’t leave your organization.
Some manual work is still the way to go.
This approach requires you to write a function that extracts key elements and statistics from the Pandas data framework. You have to write this functionality for different datasets from scratch, but it’s easy to transfer core ideas between projects.
this get_summary_context_message()
The function is fetched within the data range and returns a formatted multi-line string containing a detailed summary. It includes:
- Total number of applications and gender distribution
- International and domestic applicant breakdown
- GPA and GMAT score quartile statistics
- Enrollment rate for academic majors (sorted by ratio)
- Enrollment rate in the job industry (top 8 industries)
- Work experience analysis and classification failures
- Key Insights Highlight the Best Category
Here is the full source code for the feature:
def get_summary_context_message(df: pd.DataFrame) -> str:
"""
Generate a comprehensive summary report of MBA admissions dataset statistics.
This function analyzes MBA application data to provide detailed statistics on
applicant demographics, academic performance, professional backgrounds, and
admission rates across various categories. The summary includes gender and
international status distributions, GPA and GMAT score statistics, admission
rates by academic major and work industry, and work experience impact analysis.
Parameters
----------
df : pd.DataFrame
DataFrame containing MBA admissions data with the following expected columns:
- 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
Returns
-------
str
A formatted multi-line string containing comprehensive MBA admissions
statistics.
"""
# Basic application statistics
total_applications = len(df)
# Gender distribution
gender_counts = df["gender"].value_counts()
male_count = gender_counts.get("Male", 0)
female_count = gender_counts.get("Female", 0)
# International status
international_count = (
df["international"].sum()
if df["international"].dtype == bool
else (df["international"] == True).sum()
)
# GPA statistics
gpa_data = df["gpa"].dropna()
gpa_avg = gpa_data.mean()
gpa_25th = gpa_data.quantile(0.25)
gpa_50th = gpa_data.quantile(0.50)
gpa_75th = gpa_data.quantile(0.75)
# GMAT statistics
gmat_data = df["gmat"].dropna()
gmat_avg = gmat_data.mean()
gmat_25th = gmat_data.quantile(0.25)
gmat_50th = gmat_data.quantile(0.50)
gmat_75th = gmat_data.quantile(0.75)
# Major analysis - admission rates by major
major_stats = []
for major in df["major"].unique():
major_data = df[df["major"] == major]
admitted = len(major_data[major_data["admission"] == "Admit"])
total = len(major_data)
rate = (admitted / total) * 100
major_stats.append((major, admitted, total, rate))
# Sort by admission rate (descending)
major_stats.sort(key=lambda x: x[3], reverse=True)
# Work industry analysis - admission rates by industry
industry_stats = []
for industry in df["work_industry"].unique():
if pd.isna(industry):
continue
industry_data = df[df["work_industry"] == industry]
admitted = len(industry_data[industry_data["admission"] == "Admit"])
total = len(industry_data)
rate = (admitted / total) * 100
industry_stats.append((industry, admitted, total, rate))
# Sort by admission rate (descending)
industry_stats.sort(key=lambda x: x[3], reverse=True)
# Work experience analysis
work_exp_data = df["work_exp"].dropna()
avg_work_exp_all = work_exp_data.mean()
# Work experience for admitted students
admitted_students = df[df["admission"] == "Admit"]
admitted_work_exp = admitted_students["work_exp"].dropna()
avg_work_exp_admitted = admitted_work_exp.mean()
# Work experience ranges analysis
def categorize_work_exp(exp):
if pd.isna(exp):
return "Unknown"
elif exp 0:
admitted = len(category_data[category_data["admission"] == "Admit"])
total = len(category_data)
rate = (admitted / total) * 100
work_exp_category_stats.append((category, admitted, total, rate))
# Build the summary message
summary = f"""MBA Admissions Dataset Summary (2025)
Total Applications: {total_applications:,} people applied to the MBA program.
Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)
International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)
Academic Performance Statistics:
GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- 25th percentile: {gpa_25th:.2f}
- 50th percentile (median): {gpa_50th:.2f}
- 75th percentile: {gpa_75th:.2f}
GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- 25th percentile: {gmat_25th:.0f}
- 50th percentile (median): {gmat_50th:.0f}
- 75th percentile: {gmat_75th:.0f}
Major Analysis - Admission Rates by Academic Background:"""
for major, admitted, total, rate in major_stats:
summary += (
f"n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
summary += (
"nnWork Industry Analysis - Admission Rates by Professional Background:"
)
# Show top 8 industries by admission rate
for industry, admitted, total, rate in industry_stats[:8]:
summary += (
f"n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
summary += "nnWork Experience Impact on Admissions:nnOverall Work Experience Comparison:"
summary += (
f"n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
)
summary += f"n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"
summary += "nnAdmission Rates by Work Experience Range:"
for category, admitted, total, rate in work_exp_category_stats:
summary += (
f"n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
# Key insights
best_major = major_stats[0]
best_industry = industry_stats[0]
summary += "nnKey Insights:"
summary += (
f"n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
)
summary += f"n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"
if avg_work_exp_admitted > avg_work_exp_all:
summary += f"n- Admitted students have slightly more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
else:
summary += "n- Work experience shows minimal difference between admitted and all applicants"
return summary
After defining the function, just call and print the result:
print(get_summary_context_message(df))

Now let’s keep moving forward.
The cool part: Working with LLMS
This is where things get interesting and your manual data extraction will pay off when it works.
Python Accessibility Partners with LLMS
If you have decent hardware, I highly recommend using a local LLM for such simple tasks. I use Ollama and the latest version Mistral Model For actual LLM processing.

If you want to use something like Chatgpt through the OpenAI API, you can still do this. You just need to modify the following function to set the API keys and then return the appropriate instance from Langchain.
Regardless of which option you choose, please call get_llm()
Using a test message should not return an error:
def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
"""
Create and configure a ChatOllama instance for local LLM inference.
This function initializes a ChatOllama client configured to connect to a
local Ollama server. The client is set up with deterministic output
(temperature=0) for consistent responses across multiple calls with the
same input.
Parameters
----------
model_name : str, optional
The name of the Ollama model to use for chat completions.
Must be a valid model name that is available on the local Ollama
installation. Default is "mistral:latest".
Returns
-------
ChatOllama
A configured ChatOllama instance ready for chat completions.
"""
return ChatOllama(
model=model_name, base_url=" temperature=0
)
print(get_llm().invoke("test").content)

Summary Tips
Here you can get creativity here and write super specific instructions for LLM. I decided to keep things light for demonstration purposes, but feel free to experiment here.
There is no correct or wrong prompt.
Whatever you do, make sure to include format parameters using curly brackets – these values will be dynamically populated later:
SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer. Your task is to take in complex datasets
and return user-friendly descriptions and findings.
You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}
This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:
{context}
Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.
Report:
"""
Summary of python functions
With prompts and get_llm()
The only thing left to declare the function is the connection point. this get_report_summary()
The function brings in the parameter that will fill in the format placeholder in the prompt, and then calls the LLM with that prompt to generate the report.
You can choose between Markdown or HTML formats:
def get_report_summary(
dataset: pd.DataFrame,
dataset_name: str,
dataset_source: str,
report_format: Literal["markdown", "html"] = "markdown",
) -> str:
"""
Generate an AI-powered summary report from a pandas DataFrame.
This function analyzes a dataset and generates a comprehensive summary report
using a large language model (LLM). It first extracts statistical context
from the dataset, then uses an LLM to create a human-readable report in the
specified format.
Parameters
----------
dataset : pd.DataFrame
The pandas DataFrame to analyze and summarize.
dataset_name : str
A descriptive name for the dataset that will be included in the
generated report for context and identification.
dataset_source : str
Information about the source or origin of the dataset.
report_format : {"markdown", "html"}, optional
The desired output format for the generated report. Options are:
- "markdown" : Generate report in Markdown format (default)
- "html" : Generate report in HTML format
Returns
-------
str
A formatted summary report.
"""
context_message = get_summary_context_message(df=dataset)
prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
dataset_name=dataset_name,
dataset_source=dataset_source,
context=context_message,
report_format=report_format,
)
return get_llm().invoke(input=prompt).content
Using this function is simple – just pass the dataset, its name and source. The report format defaults to Markdown:
md_report = get_report_summary(
dataset=df,
dataset_name="MBA Admissions (2025)",
dataset_source="
)
print(md_report)

Details of HTML reports, but some styles can be used. Maybe you can ask LLM to handle this too!

What can you improve
I could easily turn it into a 30 minute read by optimizing every detail of the pipeline, but I used it simply for demonstration purposes. You don’t have to (and shouldn’t) stop here.
Here are things you can improve to make this pipeline even more powerful:
- A function that will save the report (Markdown or HTML) directly to disk saved (Markdown or HTML). This way, you can automate the entire process and generate reports on a schedule without manual intervention.
- In the prompt, Require LLM to add CSS style to HTML report Make it look more rendered. You can even provide the company’s brand colors and fonts to maintain consistency in all data reports.
- Extend the prompts to follow more specific instructions. You may want to report reports focusing on specific business metrics, follow specific templates, or based on suggestions found.
- Extended
get_llm()
Functional, so Connect to Ollama and other vendors Like Openai, Anthropic or Google. This gives you the flexibility to switch between on-premises and cloud-based models, depending on your needs. - Literally see any operation in the get_summary_context_message() function, because it is the basis for everyone Context data provided to LLM. Here you can gain creativity through functional engineering, statistical analysis, and data insights that are important to specific use cases.
I hope this minimal example puts you on the right track to automate your own data reporting workflow.