Artificial Intelligence

Coding tutorial for model context protocols, focusing on context ratings for semantic blocks, dynamic token management, and effective LLM interactions

Efficiently managing context is a critical challenge when using large language models, especially in environments like Google Colab where resource constraints and long documents can quickly exceed the available token windows. In this tutorial, we guide you through the actual implementation of the ModelContextManager that automatically builds incoming text, and use a sentence transformer to generate semantic embeddings and score each block based on restoration, importance, and correlation. You will learn how to integrate that manager with the hugged face sequence to sequence model and demonstrated here using Flan-T5 to add, optimize and retrieve the most relevant context. Along the way, we will introduce token counting using GPT-2 tokens, context window optimization policies, and interactive sessions that allow you to query and visualize dynamic contexts in real time.

import torch
import numpy as np
from typing import List, Dict, Any, Optional, Union, Tuple
from dataclasses import dataclass
import time
import gc
from tqdm.notebook import tqdm

We import the basic library for building dynamic context managers: Torch and Numpy handle tensor and numerical operations, while typing and DataClasses provide structured type annotations and data containers. Utilities modules such as time and GC, support for timestamps and memory cleaning, and TQDM.NoteBook provides an interactive progress bar for block processing in COLAB.

@dataclass
class ContextChunk:
    """A chunk of text with metadata for the Model Context Protocol."""
    text: str
    embedding: Optional[torch.Tensor] = None
    importance: float = 1.0
    timestamp: float = 0.0
    metadata: Dict[str, Any] = None
   
    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}
        if self.timestamp == 0.0:
            self.timestamp = time.time()

ContextChunk Dataclass encapsulates individual text and its embeddings, user-assigned importance scores, timestamps, and arbitrary metadata. Its __post_init__ method ensures that each block is stamped with the current time when it is created, and if not provided, the metadata defaults to an empty dictionary.

class ModelContextManager:
    """
    Manager for implementing Model Context Protocol in LLMs on Google Colab.
    Handles context window optimization, token management, and relevance scoring.
    """
   
    def __init__(
        self,
        max_context_length: int = 8192,
        embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
        relevance_threshold: float = 0.7,
        recency_weight: float = 0.3,
        importance_weight: float = 0.3,
        semantic_weight: float = 0.4,
        device: str = "cuda" if torch.cuda.is_available() else "cpu"
    ):
        """
        Initialize the Model Context Manager.
       
        Args:
            max_context_length: Maximum number of tokens in context window
            embedding_model: Model to use for text embeddings
            relevance_threshold: Threshold for chunk relevance to be included
            recency_weight: Weight for recency in relevance calculation
            importance_weight: Weight for importance in relevance calculation
            semantic_weight: Weight for semantic similarity in relevance calculation
            device: Device to run computations on
        """
        self.max_context_length = max_context_length
        self.device = device
        self.chunks = []
        self.current_token_count = 0
        self.relevance_threshold = relevance_threshold
       
        self.recency_weight = recency_weight
        self.importance_weight = importance_weight
        self.semantic_weight = semantic_weight
       
        try:
            from sentence_transformers import SentenceTransformer
            print(f"Loading embedding model {embedding_model}...")
            self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
            print(f"Embedding model loaded successfully on {self.device}")
        except ImportError:
            print("Installing sentence-transformers...")
            import subprocess
            subprocess.run(["pip", "install", "sentence-transformers"])
            from sentence_transformers import SentenceTransformer
            self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
            print(f"Embedding model loaded successfully on {self.device}")
           
        try:
            from transformers import GPT2Tokenizer
            self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        except ImportError:
            print("Installing transformers...")
            import subprocess
            subprocess.run(["pip", "install", "transformers"])
            from transformers import GPT2Tokenizer
            self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
   
    def add_chunk(self, text: str, importance: float = 1.0, metadata: Dict[str, Any] = None) -> None:
        """
        Add a new chunk of text to the context manager.
       
        Args:
            text: The text content to add
            importance: Importance score (0-1)
            metadata: Additional metadata for the chunk
        """
        with torch.no_grad():
            embedding = self.embedding_model.encode(text, convert_to_tensor=True)
       
        chunk = ContextChunk(
            text=text,
            embedding=embedding,
            importance=importance,
            timestamp=time.time(),
            metadata=metadata or {}
        )
       
        self.chunks.append(chunk)
        self.current_token_count += len(self.tokenizer.encode(text))
       
        if self.current_token_count > self.max_context_length:
            self.optimize_context()
   
    def optimize_context(self) -> None:
        """Optimize context by removing less relevant chunks to fit within token limit."""
        if not self.chunks:
            return
           
        print("Optimizing context window...")
       
        scores = self.score_chunks()
       
        sorted_indices = np.argsort(scores)[::-1]
       
        new_chunks = []
        new_token_count = 0
       
        for idx in sorted_indices:
            chunk = self.chunks[idx]
            chunk_tokens = len(self.tokenizer.encode(chunk.text))
           
            if new_token_count + chunk_tokens  self.relevance_threshold * 1.5:
                    for i, included_chunk in enumerate(new_chunks):
                        included_idx = sorted_indices[i]
                        if scores[included_idx]  np.ndarray:
        """
        Score chunks based on recency, importance, and semantic relevance.
       
        Args:
            query: Optional query to calculate semantic relevance against
           
        Returns:
            Array of scores for each chunk
        """
        if not self.chunks:
            return np.array([])
           
        current_time = time.time()
        max_age = max(current_time - chunk.timestamp for chunk in self.chunks) or 1.0
        recency_scores = np.array([
            1.0 - ((current_time - chunk.timestamp) / max_age)
            for chunk in self.chunks
        ])
       
        importance_scores = np.array([chunk.importance for chunk in self.chunks])
       
        if query is not None:
            query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)
            similarity_scores = np.array([
                torch.cosine_similarity(chunk.embedding, query_embedding, dim=0).item()
                for chunk in self.chunks
            ])
           
            similarity_scores = (similarity_scores - similarity_scores.min()) / (similarity_scores.max() - similarity_scores.min() + 1e-8)
        else:
            similarity_scores = np.ones(len(self.chunks))
       
        final_scores = (
            self.recency_weight * recency_scores +
            self.importance_weight * importance_scores +
            self.semantic_weight * similarity_scores
        )
       
        return final_scores
   
    def retrieve_context(self, query: str = None, k: int = None) -> str:
        """
        Retrieve the most relevant context for a given query.
       
        Args:
            query: The query to retrieve context for
            k: The maximum number of chunks to return (None = all relevant chunks)
           
        Returns:
            String containing the combined relevant context
        """
        if not self.chunks:
            return ""
           
        scores = self.score_chunks(query)
       
        relevant_indices = np.where(scores >= self.relevance_threshold)[0]
       
        relevant_indices = relevant_indices[np.argsort(scores[relevant_indices])[::-1]]
       
        if k is not None:
            relevant_indices = relevant_indices[:k]
           
        relevant_texts = [self.chunks[i].text for i in relevant_indices]
        return "nn".join(relevant_texts)
   
    def get_stats(self) -> Dict[str, Any]:
        """Get statistics about the current context state."""
        return {
            "chunk_count": len(self.chunks),
            "token_count": self.current_token_count,
            "max_tokens": self.max_context_length,
            "usage_percentage": self.current_token_count / self.max_context_length * 100 if self.max_context_length else 0,
            "avg_chunk_size": self.current_token_count / len(self.chunks) if self.chunks else 0,
            "oldest_chunk_age": time.time() - min(chunk.timestamp for chunk in self.chunks) if self.chunks else 0,
        }


    def visualize_context(self):
        """Visualize the current context window distribution."""
        try:
            import matplotlib.pyplot as plt
            import pandas as pd
           
            if not self.chunks:
                print("No chunks to visualize")
                return
           
            scores = self.score_chunks()
            chunk_sizes = [len(self.tokenizer.encode(chunk.text)) for chunk in self.chunks]
            timestamps = [chunk.timestamp for chunk in self.chunks]
            relative_times = [time.time() - ts for ts in timestamps]
            importance = [chunk.importance for chunk in self.chunks]
           
            df = pd.DataFrame({
                'Size (tokens)': chunk_sizes,
                'Age (seconds)': relative_times,
                'Importance': importance,
                'Score': scores
            })
           
            fig, axs = plt.subplots(2, 2, figsize=(14, 10))
           
            axs[0, 0].bar(range(len(chunk_sizes)), chunk_sizes)
            axs[0, 0].set_title('Token Distribution by Chunk')
            axs[0, 0].set_ylabel('Tokens')
            axs[0, 0].set_xlabel('Chunk Index')
           
            axs[0, 1].scatter(chunk_sizes, scores)
            axs[0, 1].set_title('Score vs Chunk Size')
            axs[0, 1].set_xlabel('Tokens')
            axs[0, 1].set_ylabel('Score')
           
            axs[1, 0].scatter(relative_times, scores)
            axs[1, 0].set_title('Score vs Chunk Age')
            axs[1, 0].set_xlabel('Age (seconds)')
            axs[1, 0].set_ylabel('Score')
           
            axs[1, 1].scatter(importance, scores)
            axs[1, 1].set_title('Score vs Importance')
            axs[1, 1].set_xlabel('Importance')
            axs[1, 1].set_ylabel('Score')
           
            plt.tight_layout()
            plt.show()
           
        except ImportError:
            print("Please install matplotlib and pandas for visualization")
            print('!pip install matplotlib pandas')

The ModelContextManager class coordinates end-to-end processing of the LLM context by decomposing input text, generating embedding and tracking token usage. It implements correlation scores (combining novelty, importance, and semantic similarity), automatic context pruning, retrieving the most relevant blocks, and convenient utilities to monitor and visualize context statistics.

class MCPColabDemo:
    """Demonstration of Model Context Protocol in Google Colab with a Language Model."""
   
    def __init__(
        self,
        model_name: str = "google/flan-t5-base",
        max_context_length: int = 2048,
        device: str = "cuda" if torch.cuda.is_available() else "cpu"
    ):
        """
        Initialize the MCP Colab demo with a specified model.
       
        Args:
            model_name: Hugging Face model name
            max_context_length: Maximum context length for the MCP manager
            device: Device to run the model on
        """
        self.device = device
        self.context_manager = ModelContextManager(
            max_context_length=max_context_length,
            device=device
        )
       
        try:
            from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
            print(f"Loading model {model_name}...")
            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            print(f"Model loaded successfully on {device}")
        except ImportError:
            print("Installing transformers...")
            import subprocess
            subprocess.run(["pip", "install", "transformers"])
            from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            print(f"Model loaded successfully on {device}")
   
    def add_document(self, text: str, chunk_size: int = 512, overlap: int = 50) -> None:
        """
        Add a document to the context by chunking it appropriately.
       
        Args:
            text: Document text
            chunk_size: Size of each chunk in characters
            overlap: Overlap between chunks in characters
        """
        chunks = []
        for i in range(0, len(text), chunk_size - overlap):
            chunk = text[i:i + chunk_size]
            if len(chunk) > 20:  
                chunks.append(chunk)
       
        print(f"Adding {len(chunks)} chunks to context...")
        for i, chunk in enumerate(tqdm(chunks)):
            pos = i / len(chunks)
            importance = 1.0 - 0.5 * min(pos, 1 - pos)
           
            self.context_manager.add_chunk(
                text=chunk,
                importance=importance,
                metadata={"source": "document", "position": i, "total_chunks": len(chunks)}
            )
   
    def process_query(self, query: str, max_new_tokens: int = 256) -> str:
        """
        Process a query using the context manager and model.
       
        Args:
            query: The query to process
            max_new_tokens: Maximum number of tokens in response
           
        Returns:
            Model response
        """
        self.context_manager.add_chunk(query, importance=1.0, metadata={"type": "query"})
       
        relevant_context = self.context_manager.retrieve_context(query=query)
       
        prompt = f"Context: {relevant_context}nnQuestion: {query}nnAnswer:"
       
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
       
        print("Generating response...")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
            )
       
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
       
        self.context_manager.add_chunk(
            response,
            importance=0.9,
            metadata={"type": "response", "query": query}
        )
       
        return response
   
    def interactive_session(self):
        """Run an interactive session in the notebook."""
        from IPython.display import clear_output
       
        print("Starting interactive MCP session. Type 'exit' to end.")
        conversation_history = []
       
        while True:
            query = input("nYour query: ")
           
            if query.lower() == 'exit':
                break
               
            if query.lower() == 'stats':
                print("nContext Statistics:")
                stats = self.context_manager.get_stats()
                for key, value in stats.items():
                    print(f"{key}: {value}")
                self.context_manager.visualize_context()
                continue
               
            if query.lower() == 'clear':
                self.context_manager.chunks = []
                self.context_manager.current_token_count = 0
                conversation_history = []
                clear_output(wait=True)
                print("Context cleared!")
                continue
           
            response = self.process_query(query)
            conversation_history.append((query, response))
           
            print("nResponse:")
            print(response)
            print("n" + "-"*50)
           
            stats = self.context_manager.get_stats()
            print(f"Context usage: {stats['token_count']}/{stats['max_tokens']} tokens ({stats['usage_percentage']:.1f}%)")

The McPcolabdemo class connects the context manager with SEQ2SEQ LLM, loads Flan-T5 (or any specified embrace facial model) on the selected device, and provides utility methods for the whole document and ingesting the entire document, handles only the most relevant context to handle user queries, and performs real-time statistics through interactive COLAB meetings, and commands for real-time statistics, and commands for competition, and commands for commands and commands, and commands for commands and commands.

def run_mcp_demo():
    """Run a simple demo of the Model Context Protocol."""
    print("Running Model Context Protocol Demo...")
   
    context_manager = ModelContextManager(max_context_length=4096)
   
    print("Adding sample chunks...")
   
    context_manager.add_chunk(
        "The Model Context Protocol (MCP) is a framework for managing context "
        "windows in large language models. It helps optimize token usage and improve relevance.",
        importance=1.0
    )
   
    context_manager.add_chunk(
        "Context management involves techniques like sliding windows, chunking, "
        "and relevance filtering to handle large documents efficiently.",
        importance=0.8
    )
   
    for i in range(10):
        context_manager.add_chunk(
            f"This is test chunk {i} with some filler content to simulate a larger context "
            f"window that needs optimization. This helps demonstrate the MCP functionality "
            f"for context window management in language models on Google Colab.",
            importance=0.5 - (i * 0.02)  
        )
   
    stats = context_manager.get_stats()
    print("nInitial Statistics:")
    for key, value in stats.items():
        print(f"{key}: {value}")
       
    query = "How does the Model Context Protocol work?"
    print(f"nRetrieving context for: '{query}'")
    context = context_manager.retrieve_context(query)
    print(f"nRelevant context:n{context}")
   
    print("nVisualizing context:")
    context_manager.visualize_context()
   
    print("nDemo complete!")

The run_mcp_demo function ties everything together in a single script: it instantiates the ModelContextManager, adds a series of sample chunks with varying importance, prints out initial statistics, retrieves and displays the most relevant context for a test query, and finally visualizes the context window, providing a complete, end-to-end demonstration of the Model Context Protocol in action.

if __name__ == "__main__":
    run_mcp_demo()

Finally, this standard Python entry point protector ensures that the RUN_MCP_DEMO() function can be executed only when the script is run directly (rather than imported as a module), triggering an end-to-end demonstration of the model context.

All in all, we will have a fully-featured MCP system that not only curbs out-of-control token usage, but also prioritizes context snippets that really matter to your queries. ModelContextManager provides you with tools to balance semantic correlation, time freshness, and the importance of user allocation. Meanwhile, the included McPcolabDemo class provides an accessible framework for real-time experimentation and visualization. With these patterns, you can extend the core principles by adjusting correlation thresholds, experimenting with various embedding models, or integrating with alternative LLM backends to tailor the workflow for a specific domain. Ultimately, this approach allows you to create concise but highly relevant tips that produce more accurate and efficient responses from your language model.


This is COLAB notebook. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button