LLMS + PANDAS: How do I generate PANDAS DATAFRAME summary using Generate AI

0 0 8 minutes read

LLMS + PANDAS: How do I generate PANDAS DATAFRAME summary using Generate AI

Datasets and are looking for quick insights without much manual grinding, here you are.

In 2025, datasets often contain millions of rows and hundreds of columns, which makes manual analysis nearly impossible. Local large language models can convert your raw data frame statistics into polished, readable reports in seconds (worst minutes). This approach eliminates the tedious process of analyzing data by hand and writing execution reports, especially when the data structure remains unchanged.

Pandas handles the heavy extraction of data extraction, while LLMS converts your technical output into considerable reports. You still need to write the function to extract key statistics from the dataset, but it’s a one-time effort.

This guide assumes that you have Ollama installed locally. If you don’t, you can still use a third-party LLM vendor, but I won’t explain how to connect its API.

Dataset introduction and exploration

For this guide, I’m using Kaggle’s MBA Admissions Dataset. If you want to continue downloading, download it.

The dataset is already in Apache 2.0 Licensewhich means you are free to use it for personal and commercial projects.

First, you need to install some Python libraries on your system.

Image 1 – Required Python library and version (author image)

Once everything is installed, import the necessary libraries in a new script or notebook:

import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal

Dataset loading and preprocessing

First, use pandas to load the dataset. This snippet loads a CSV file, prints basic information about the shape of the dataset, and shows how many missing values are present in each column:

df = pd.read_csv("data/MBA.csv")

# Basic dataset info
print(f"Dataset shape: {df.shape}n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)

Image 2 – Basic Dataset Statistics (Author Image)

Since data cleaning is not the main focus of this article, I will keep the minimum preprocessing. There are only a few missing values in the dataset that need attention:

df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")

That’s it! Let’s see how to go to a meaningful report from now on.

The boring part: Extract summary statistics

Even with all the advancements in AI capabilities and usability, you may not want to send the entire dataset to the LLM provider. There are two good reasons.

Can Too many tokensdirectly converts to higher costs. Processing large data sets can Take a long timeespecially when you run your model locally on your own hardware. You may still be dealing with it Sensitive data That shouldn’t leave your organization.

Some manual work is still the way to go.

This approach requires you to write a function that extracts key elements and statistics from the Pandas data framework. You have to write this functionality for different datasets from scratch, but it’s easy to transfer core ideas between projects.

this get_summary_context_message() The function is fetched within the data range and returns a formatted multi-line string containing a detailed summary. It includes:

Total number of applications and gender distribution
International and domestic applicant breakdown
GPA and GMAT score quartile statistics
Enrollment rate for academic majors (sorted by ratio)
Enrollment rate in the job industry (top 8 industries)
Work experience analysis and classification failures
Key Insights Highlight the Best Category

Here is the full source code for the feature:

def get_summary_context_message(df: pd.DataFrame) -> str:
    """
    Generate a comprehensive summary report of MBA admissions dataset statistics.
    
    This function analyzes MBA application data to provide detailed statistics on
    applicant demographics, academic performance, professional backgrounds, and
    admission rates across various categories. The summary includes gender and
    international status distributions, GPA and GMAT score statistics, admission
    rates by academic major and work industry, and work experience impact analysis.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing MBA admissions data with the following expected columns:
        - 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
    
    Returns
    -------
    str
        A formatted multi-line string containing comprehensive MBA admissions
        statistics.
    """
    # Basic application statistics
    total_applications = len(df)

    # Gender distribution
    gender_counts = df["gender"].value_counts()
    male_count = gender_counts.get("Male", 0)
    female_count = gender_counts.get("Female", 0)

    # International status
    international_count = (
        df["international"].sum()
        if df["international"].dtype == bool
        else (df["international"] == True).sum()
    )

    # GPA statistics
    gpa_data = df["gpa"].dropna()
    gpa_avg = gpa_data.mean()
    gpa_25th = gpa_data.quantile(0.25)
    gpa_50th = gpa_data.quantile(0.50)
    gpa_75th = gpa_data.quantile(0.75)

    # GMAT statistics
    gmat_data = df["gmat"].dropna()
    gmat_avg = gmat_data.mean()
    gmat_25th = gmat_data.quantile(0.25)
    gmat_50th = gmat_data.quantile(0.50)
    gmat_75th = gmat_data.quantile(0.75)

    # Major analysis - admission rates by major
    major_stats = []
    for major in df["major"].unique():
        major_data = df[df["major"] == major]
        admitted = len(major_data[major_data["admission"] == "Admit"])
        total = len(major_data)
        rate = (admitted / total) * 100
        major_stats.append((major, admitted, total, rate))

    # Sort by admission rate (descending)
    major_stats.sort(key=lambda x: x[3], reverse=True)

    # Work industry analysis - admission rates by industry
    industry_stats = []
    for industry in df["work_industry"].unique():
        if pd.isna(industry):
            continue
        industry_data = df[df["work_industry"] == industry]
        admitted = len(industry_data[industry_data["admission"] == "Admit"])
        total = len(industry_data)
        rate = (admitted / total) * 100
        industry_stats.append((industry, admitted, total, rate))

    # Sort by admission rate (descending)
    industry_stats.sort(key=lambda x: x[3], reverse=True)

    # Work experience analysis
    work_exp_data = df["work_exp"].dropna()
    avg_work_exp_all = work_exp_data.mean()

    # Work experience for admitted students
    admitted_students = df[df["admission"] == "Admit"]
    admitted_work_exp = admitted_students["work_exp"].dropna()
    avg_work_exp_admitted = admitted_work_exp.mean()

    # Work experience ranges analysis
    def categorize_work_exp(exp):
        if pd.isna(exp):
            return "Unknown"
        elif exp  0:
            admitted = len(category_data[category_data["admission"] == "Admit"])
            total = len(category_data)
            rate = (admitted / total) * 100
            work_exp_category_stats.append((category, admitted, total, rate))

    # Build the summary message
    summary = f"""MBA Admissions Dataset Summary (2025)
    
Total Applications: {total_applications:,} people applied to the MBA program.

Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)

International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)

Academic Performance Statistics:

GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- 25th percentile: {gpa_25th:.2f}
- 50th percentile (median): {gpa_50th:.2f}
- 75th percentile: {gpa_75th:.2f}

GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- 25th percentile: {gmat_25th:.0f}
- 50th percentile (median): {gmat_50th:.0f}
- 75th percentile: {gmat_75th:.0f}

Major Analysis - Admission Rates by Academic Background:"""

    for major, admitted, total, rate in major_stats:
        summary += (
            f"n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += (
        "nnWork Industry Analysis - Admission Rates by Professional Background:"
    )

    # Show top 8 industries by admission rate
    for industry, admitted, total, rate in industry_stats[:8]:
        summary += (
            f"n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += "nnWork Experience Impact on Admissions:nnOverall Work Experience Comparison:"
    summary += (
        f"n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
    )
    summary += f"n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"

    summary += "nnAdmission Rates by Work Experience Range:"
    for category, admitted, total, rate in work_exp_category_stats:
        summary += (
            f"n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    # Key insights
    best_major = major_stats[0]
    best_industry = industry_stats[0]

    summary += "nnKey Insights:"
    summary += (
        f"n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
    )
    summary += f"n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"

    if avg_work_exp_admitted > avg_work_exp_all:
        summary += f"n- Admitted students have slightly more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
    else:
        summary += "n- Work experience shows minimal difference between admitted and all applicants"

    return summary

After defining the function, just call and print the result:

print(get_summary_context_message(df))

Figure 3 – Discovery and Statistics Extracted from Dataset (Author Image)

Now let’s keep moving forward.

The cool part: Working with LLMS

This is where things get interesting and your manual data extraction will pay off when it works.

Python Accessibility Partners with LLMS

If you have decent hardware, I highly recommend using a local LLM for such simple tasks. I use Ollama and the latest version Mistral Model For actual LLM processing.

Image 4 – Available Ollama Models (Author Image)

If you want to use something like Chatgpt through the OpenAI API, you can still do this. You just need to modify the following function to set the API keys and then return the appropriate instance from Langchain.

Regardless of which option you choose, please call get_llm() Using a test message should not return an error:

def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
    """
    Create and configure a ChatOllama instance for local LLM inference.
    
    This function initializes a ChatOllama client configured to connect to a
    local Ollama server. The client is set up with deterministic output
    (temperature=0) for consistent responses across multiple calls with the
    same input.
    
    Parameters
    ----------
    model_name : str, optional
        The name of the Ollama model to use for chat completions.
        Must be a valid model name that is available on the local Ollama
        installation. Default is "mistral:latest".
    
    Returns
    -------
    ChatOllama
        A configured ChatOllama instance ready for chat completions.
    """
    return ChatOllama(
        model=model_name, base_url=" temperature=0
    )


print(get_llm().invoke("test").content)

Image 5 – LLM Test Message (Author Image)

Summary Tips

Here you can get creativity here and write super specific instructions for LLM. I decided to keep things light for demonstration purposes, but feel free to experiment here.

There is no correct or wrong prompt.

Whatever you do, make sure to include format parameters using curly brackets – these values will be dynamically populated later:

SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer. Your task is to take in complex datasets
and return user-friendly descriptions and findings.

You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}

This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:


{context}


Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.

Report:
"""

Summary of python functions

With prompts and get_llm() The only thing left to declare the function is the connection point. this get_report_summary() The function brings in the parameter that will fill in the format placeholder in the prompt, and then calls the LLM with that prompt to generate the report.

You can choose between Markdown or HTML formats:

def get_report_summary(
    dataset: pd.DataFrame,
    dataset_name: str,
    dataset_source: str,
    report_format: Literal["markdown", "html"] = "markdown",
) -> str:
    """
    Generate an AI-powered summary report from a pandas DataFrame.
    
    This function analyzes a dataset and generates a comprehensive summary report
    using a large language model (LLM). It first extracts statistical context
    from the dataset, then uses an LLM to create a human-readable report in the
    specified format.
    
    Parameters
    ----------
    dataset : pd.DataFrame
        The pandas DataFrame to analyze and summarize.
    dataset_name : str
        A descriptive name for the dataset that will be included in the
        generated report for context and identification.
    dataset_source : str
        Information about the source or origin of the dataset.
    report_format : {"markdown", "html"}, optional
        The desired output format for the generated report. Options are:
        - "markdown" : Generate report in Markdown format (default)
        - "html" : Generate report in HTML format
    
    Returns
    -------
    str
        A formatted summary report.
    
    """
    context_message = get_summary_context_message(df=dataset)
    prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
        dataset_name=dataset_name,
        dataset_source=dataset_source,
        context=context_message,
        report_format=report_format,
    )
    return get_llm().invoke(input=prompt).content

Using this function is simple – just pass the dataset, its name and source. The report format defaults to Markdown:

md_report = get_report_summary(
    dataset=df, 
    dataset_name="MBA Admissions (2025)",
    dataset_source="
)
print(md_report)

Image 6 – Final Report on Price-Slashed (Author Image)

Details of HTML reports, but some styles can be used. Maybe you can ask LLM to handle this too!

Image 7 – Final Report in HTML Format (Author Image)

What can you improve

I could easily turn it into a 30 minute read by optimizing every detail of the pipeline, but I used it simply for demonstration purposes. You don’t have to (and shouldn’t) stop here.

Here are things you can improve to make this pipeline even more powerful:

A function that will save the report (Markdown or HTML) directly to disk saved (Markdown or HTML). This way, you can automate the entire process and generate reports on a schedule without manual intervention.
In the prompt, Require LLM to add CSS style to HTML report Make it look more rendered. You can even provide the company’s brand colors and fonts to maintain consistency in all data reports.
Extend the prompts to follow more specific instructions. You may want to report reports focusing on specific business metrics, follow specific templates, or based on suggestions found.
Extended get_llm() Functional, so Connect to Ollama and other vendors Like Openai, Anthropic or Google. This gives you the flexibility to switch between on-premises and cloud-based models, depending on your needs.
Literally see any operation in the get_summary_context_message() function, because it is the basis for everyone Context data provided to LLM. Here you can gain creativity through functional engineering, statistical analysis, and data insights that are important to specific use cases.