Data Science

AI Agent Processing Time Series and Big Data Frame

Agents are AI systems powered by LLM that can reason about their goals and take action to achieve their ultimate goal. Their purpose is not only to respond to queries, but also to coordinate a range of operations, including processing data (i.e., data range and time series). This capability unlocks many real-world applications to enable access to data analytics, such as automated reporting, codeless queries, support for data cleaning and manipulation.

A proxy that can interact with the data range in two different ways:

  • and Natural Language LLM reads table as string and try to understand it based on its knowledge base
  • go through Generate and execute code – The proxy activates the tool to process the dataset as an object.

Thus, by combining the power of NLP with the accuracy of code execution, AI proxy enables a wider range of users to interact with complex data sets and gain insight.

In this tutorial, I will show you how Processing data range and time series using AI agent. I’ll cover some useful Python code that can be easily in other similar cases (just copy, paste, run) and then every line of code with comments so that you can copy this example (link to the full code at the end of the post).

set up

Let’s start with settings Horama ((pip install ollama==0.4.7), allowing users to run libraries of open source LLM locally without cloud-based services, providing more control over data privacy and performance. Since it runs locally, no conversation data will leave your machine.

First, you need to download Horama From the website.

Then, on the prompt housing of the laptop, use the command to download the selected LLM. I want to have it with Alibaba QWENbecause it is both smart and light.

Once the download is complete, you can proceed to Python and start writing the code.

import ollama
llm = "qwen2.5"

Let’s test LLM:

stream = ollama.generate(model=llm, prompt='''what time is it?''', stream=True)
for chunk in stream:
    print(chunk['response'], end='', flush=True)

Time series

A time series is a series of data points that are often used for analysis and prediction over time. It allows us to see the changes in variables over time and is used to identify trends and seasonal patterns.

I’ll generate a fake time series dataset for use as an example.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## create data
np.random.seed(1) #

Generally, time series datasets have a very simple structure, the main variables are columns, and the time is index.

Before converting it to a string, I want to make sure to put everything under the column so we don’t lose any information.

dtf = ts.reset_index().rename(columns={"index":"date"})
dtf.head()

Then, I will change the data type From data frame to dictionary.

data = dtf.to_dict(orient='records')
data[0:5]

at last, From dictionary to string.

str_data = "n".join([str(row) for row in data])
str_data

Now we have a string that can be Included in prompts Any language model can handle it. When you paste the dataset into the prompt, the LLM reads the data as plain text, but can still understand the structure and meaning based on the patterns you see during training.

prompt = f'''
Analyze this dataset, it contains monthly sales data of an online retail product:
{str_data}
'''

We can chat with LLM easily. Note that for now, this is not a proxy, because it doesn’t have any tools, we just use the language model. While it doesn’t handle numbers like a computer, LLM can identify column names, time-based patterns, trends, and outliers, especially in smaller datasets. It can simulate analysis and interpret discoveries, but does not perform precise calculations independently because it does not execute code like a proxy.

Message= [{"role":"system", "content":prompt}]

true: ## user q = input('🙂>') If q == "quit":break messages.append({"crom'':"user":"user", "content":q}) ## model agent_res_res_res = ollama.chat(model = llm, message = messages = messages = messages, messages = tools = tools =[]) res = agent_res["message"]["content"]

   
    

   
    ##Response Print ("👽>",f"x1b[1;30m{res}x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

The LLM recognizes numbers and understands the general context, the same way it might understand a recipe or a line of code. 

As you can see, using LLMs to analyze time series is great for quick and conversational insights.

Agent

LLMs are good for brainstorming and lite exploration, while an Agent can run code. Therefore, it can handle more complex tasks like plotting, forecasting, and anomaly detection. So, let’s create the Tools. 

Sometimes, it can be more effective to treat the “final answer” as a Tool. For example, if the Agent does multiple actions to generate intermediate results, the final answer can be thought of as the Tool that integrates all of this information into a cohesive response. By designing it this way, you have more customization and control over the results.

def final_answer(text:str) -> str:
    return text

tool_final_answer = {'type':'function', 'function':{
  'name': 'final_answer',
  'description': 'Returns a natural language response to the user',
  'parameters': {'type': 'object',
                'required': ['text']'properties':{'text':{'type':'str', 'description':'Natural Language Response'}}}}}} final_answer(text="hi")

Then, Coding Tools.

import io
import contextlib

def code_exec(code:str) -> str:
    output = io.StringIO()
    with contextlib.redirect_stdout(output):
        try:
            exec(code)
        except Exception as e:
            print(f"Error: {e}")
    return output.getvalue()

tool_code_exec = {'type':'function', 'function':{
  'name': 'code_exec',
  'description': 'Execute python code. Use always the function print() to get the output.',
  'parameters': {'type': 'object',
                'required': ['code'],
                'properties': {
                    'code': {'type':'str', 'description':'code to execute'},
}}}}

code_exec("from datetime import datetime; print(datetime.now().strftime('%H:%M'))")

And, I’ll add a few UTILS features Used to use and run the agent.

dic_tools = {"final_answer":final_answer, "code_exec":code_exec}

# Utils
def use_tool(agent_res:dict, dic_tools:dict) -> dict:
    ## use tool
    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print('🔧 >', f"x1b[1;31m{t_name} -> Inputs: {t_inputs}x1b[0m")
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                print(t_output)
                ### final res
                res = t_output
            else:
                print('🤬 >', f"x1b[1;31m{t_name} -> NotFoundx1b[0m")
    ## don't use tool
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
        t_name, t_inputs = '', ''
    return {'res':res, 'tool_used':t_name, 'inputs_used':t_inputs}

When an agent tries to solve a task, I want it to track the tools used, the inputs tried, and the results obtained. Iteration can only stop when the model is ready to give the final answer.

def run_agent(llm, messages, available_tools):
    tool_used, local_memory = '', ''
    while tool_used != 'final_answer':
        ### use tools
        try:
            agent_res = ollama.chat(model=llm,
                                    messages=messages,                                                                                                              tools=[v for v in available_tools.values()])
            dic_res = use_tool(agent_res, dic_tools)
            res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
        ### error
        except Exception as e:
            print("⚠️ >", e)
            res = f"I tried to use {tool_used} but didn't work. I will try something else."
            print("👽 >", f"x1b[1;30m{res}x1b[0m")
            messages.append( {"role":"assistant", "content":res} )
        ### update memory
        if tool_used not in ['','final_answer']:
            local_memory += f"nTool used: {tool_used}.nInput used: {inputs_used}.nOutput: {res}"
            messages.append( {"role":"assistant", "content":local_memory} )
            available_tools.pop(tool_used)
            if len(available_tools) == 1:
                messages.append( {"role":"user", "content":"now activate the tool final_answer."} )
        ### tools not used
        if tool_used == '':
            break
    return res

Regarding coding tools, I noticed that agencies tend to recreate data frames in each step. So I’ll use Memory enhancement Remind the model dataset already exists. Tips for getting the desired behavior. Ultimately, memory reinforcements help you get more meaningful and effective interactions.

#Start chat message= [{"role":"system", "content":prompt}]
Memory=''The dataset already exists, it is called "DTF", don't create a new one. ''''''''''''''''''''''''''''''If q == "quit":break message.append({"ron of":"user","content":q})## memory message.append({"requ {"user":"user":"user","content" content})"code_exec":tool_code_exec} res = run_agent(llm, message, ablese_tools)## response print("👽>",f",f" x1b[1;30m{res}x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

Creating a plot is something that the LLM alone can’t do. But keep in mind that even if Agents can create images, they can’t see them, because after all, the engine is still a language model. So the user is the only one who visualises the plot.

The Agent is using the library statsmodels to train a model and forecast the time series. 

Large Dataframes

LLMs have limited memory, which restricts how much information they can process at once, even the most advanced models have token limits (a few hundred pages of text). Additionally, LLMs don’t retain memory across sessions unless a retrieval system is integrated. In practice, to effectively work with large dataframes, developers often use strategies like chunking, RAG, vector databases, and summarizing content before feeding it into the model.

Let’s create a big dataset to play with.

import random
import string

length = 1000

dtf = pd.DataFrame(data={
    'Id': [''.join(random.choices(string.ascii_letters, k=5)) for _ in range(length)]'age': np.random.randint (low = 18, high = 80, size = length), 'score': np.random.riform (low = 50, high = 100, size = length['Active','Inactive','Pending']size = length) }) dtf.tail()

I’ll add one Web search toolsso with the ability to execute Python code and search the Internet, general-purpose AI can access all the available knowledge and make data-driven decisions.

In Python, the easiest way to create a web search tool is to use a famous private browser Dakdakego ((pip install duckduckgo-search==6.3.5). You can use the original library directly or import Langchain Wrapper (pip install langchain-community==0.3.17).

from langchain_community.tools import DuckDuckGoSearchResults

def search_web(query:str) -> str:
  return DuckDuckGoSearchResults(backend="news").run(query)

tool_search_web = {'type':'function', 'function':{
  'name': 'search_web',
  'description': 'Search the web',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'the topic or subject to search on the web'},
}}}}

search_web(query="nvidia")

Overall, the agent now has 3 tools.

dic_tools = {'final_answer':final_answer,
             'search_web':search_web,
             'code_exec':code_exec}

Since I can’t add the full dataframe in the prompt, I can only feed the first 10 rows so that the LLM can understand the general environment of the dataset. Additionally, I will specify where to find the complete dataset.

str_data = "n".join([str(row) for row in dtf.head(10).to_dict(orient='records')])

prompt = f'''
You are a Data Analyst, you will be given a task to solve as best you can.
You have access to the following tools:
- tool 'final_answer' to return a text response.
- tool 'code_exec' to execute Python code.
- tool 'search_web' to search for information on the internet.

If you use the 'code_exec' tool, remember to always use the function print() to get the output.
The dataset already exists and it's called 'dtf', don't create a new one.

This dataset contains credit score for each customer of the bank. Here's the first rows:
{str_data}
'''

Finally, we can run the proxy.

messages = [{"role":"system", "content":prompt}]
memory = '''
The dataset already exists and it's called 'dtf', don't create a new one.
'''
while True:
    ## User
    q = input('🙂 >')
    if q == "quit":
        break
    messages.append( {"role":"user", "content":q} )

    ## Memory
    messages.append( {"role":"user", "content":memory} )     
   
    ## Model
    available_tools = {"final_answer":tool_final_answer, "code_exec":tool_code_exec, "search_web":tool_search_web}
    res = run_agent(llm, messages, available_tools)
   
    ## Response
    print("👽 >", f"x1b[1;30m{res}x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

In this interaction, the agent uses the encoding tool correctly. Now, I want to make it use other tools as well.

Finally, I need an agent to organize all the information that is far from this chat.

in conclusion

This article is a proof tutorial How to build process time series and big data scope from scratch agent. We introduce two ways that a model can interact with data: through natural language, LLM uses its knowledge base to interpret tables as strings, and uses tools to process data sets as objects by generating and executing code.

The complete code of this article: github

Hope you like it! Please feel free to contact me for questions and feedback, or just share your interesting projects.

👉 Let’s connect 👈

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button