How to build a powerful and smart question-asking system using the Tavily Search API, Chroma, Google Gemini LLM, and Langchain frameworks

0 0 9 minutes read

How to build a powerful and smart question-asking system using the Tavily Search API, Chroma, Google Gemini LLM, and Langchain frameworks

In this tutorial, we demonstrate how to build a powerful and smart questioning system by combining the advantages of the Tavily Search API, Chroma, Google Gemini LLM, and Langchain Framework. The pipeline utilizes Tavily, semantic document caches using the Chroma Vector Store, and real-time web searches that generate context responses through Gemini models. These tools are integrated through Langchain’s modular components such as RunnableLambda, ChatPromptTemplate, ConsingBufferMemory, and Google GenerativeAiembeddings. By introducing a hybrid search mechanism, it goes beyond simple Q&A, which checks cached embeddings before calling a new web search. The retrieved documents are in a smart format, summarized and passed structured LLM prompts, with attention to source attribution, user history and confidence scores. Key features such as advanced timely engineering, sentiment and entity analysis, and dynamic vector store updates make the pipeline suitable for advanced use cases such as research help, domain-specific summary, and smart proxy.

!pip install -qU langchain-community tavily-python langchain-google-genai streamlit matplotlib pandas tiktoken chromadb langchain_core pydantic langchain

We installed and upgraded a comprehensive set of libraries needed to build advanced AI search assistants. It includes tools for retrieval (Tavily-Python, Chromadb), LLM integration (Langchain-Genai, Langchain), data processing (Pandas, Pydantic), visualization (Matplotlib, Selllit), and Swikenization (Tiktoken). These components form the core foundation for building a real-time, context-aware quality inspection system.

import os
import getpass
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
import time
from typing import List, Dict, Any, Optional
from datetime import datetime

We import the required Python libraries used throughout our notebook. It includes a standard library for environment variables, safe input, time tracking and data types (OS, GEGPASS, time, typing, DateTime). Additionally, it provides core data science tools such as pandas, matplotlib, and numpy for data processing, visualization and numerical calculations, and JSON for parsing structured data.

if "TAVILY_API_KEY" not in os.environ:
    os.environ["TAVILY_API_KEY"] = getpass.getpass("Enter Tavily API key: ")
   
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter Google API key: ")


import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

We can securely initialize API keys and Google Gemini only when users have not set up in the environment to ensure secure and repetitive access to external services by prompting the user. It also configures standardized record settings using Python’s record module, which helps monitor execution flow and capture debugging or error messages throughout the notebook.

from langchain_community.retrievers import TavilySearchAPIRetriever
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.memory import ConversationBufferMemory

We import key components from the Langchain ecosystem and its integration. It brings TavilySearchApiREtriever for real-time web search, Chroma and Google GenerativeAi modules for vector storage for chat and embed models. Core Langchain modules such as ChatPromptTemplate, RunnableLambda, ConvertyBufferMemory and output parser enable flexible timely builds, memory processing, and pipeline execution.

class SearchQueryError(Exception):
    """Exception raised for errors in the search query."""
    pass


def format_docs(docs):
    formatted_content = []
    for i, doc in enumerate(docs):
        metadata = doc.metadata
        source = metadata.get('source', 'Unknown source')
        title = metadata.get('title', 'Untitled')
        score = metadata.get('score', 0)
       
        formatted_content.append(
            f"Document {i+1} [Score: {score:.2f}]:n"
            f"Title: {title}n"
            f"Source: {source}n"
            f"Content: {doc.page_content}n"
        )
   
    return "nn".join(formatted_content)

We define two basic components for search and document processing. The SearchQueryError class creates a custom exception to gracefully manage invalid or failed search queries. The Format_Docs function handles the retrieved document list by extracting metadata (such as title, source, and related scores) and formatting it into a clean, readable string.

class SearchResultsParser:
    def parse(self, text):
        try:
            if isinstance(text, str):
                import re
                import json
                json_match = re.search(r'{.*}', text, re.DOTALL)
                if json_match:
                    json_str = json_match.group(0)
                    return json.loads(json_str)
                return {"answer": text, "sources": [], "confidence": 0.5}
            elif hasattr(text, 'content'):
                return {"answer": text.content, "sources": [], "confidence": 0.5}
            else:
                return {"answer": str(text), "sources": [], "confidence": 0.5}
        except Exception as e:
            logger.warning(f"Failed to parse JSON: {e}")
            return {"answer": str(text), "sources": [], "confidence": 0.5}

The SearchResultsparser class provides a reliable method for extracting structured information from LLM responses. It tries to parse JSON-like strings from the model output, and returns a plain text response format if parsing fails. It gracefully handles string output and message objects, ensuring consistent downstream processing. If an error occurs, it logs the warning and returns a fallback response containing the original answer, empty source, and default confidence scores, enhancing the system’s fault tolerance.

class EnhancedTavilyRetriever:
    def __init__(self, api_key=None, max_results=5, search_depth="advanced", include_domains=None, exclude_domains=None):
        self.api_key = api_key
        self.max_results = max_results
        self.search_depth = search_depth
        self.include_domains = include_domains or []
        self.exclude_domains = exclude_domains or []
        self.retriever = self._create_retriever()
        self.previous_searches = []
       
    def _create_retriever(self):
        try:
            return TavilySearchAPIRetriever(
                api_key=self.api_key,
                k=self.max_results,
                search_depth=self.search_depth,
                include_domains=self.include_domains,
                exclude_domains=self.exclude_domains
            )
        except Exception as e:
            logger.error(f"Failed to create Tavily retriever: {e}")
            raise
   
    def invoke(self, query, **kwargs):
        if not query or not query.strip():
            raise SearchQueryError("Empty search query")
       
        try:
            start_time = time.time()
            results = self.retriever.invoke(query, **kwargs)
            end_time = time.time()
           
            search_record = {
                "timestamp": datetime.now().isoformat(),
                "query": query,
                "num_results": len(results),
                "response_time": end_time - start_time
            }
            self.previous_searches.append(search_record)
           
            return results
        except Exception as e:
            logger.error(f"Search failed: {e}")
            raise SearchQueryError(f"Failed to perform search: {str(e)}")
   
    def get_search_history(self):
        return self.previous_searches

The enhanced TavilyReTriever class is a custom wrapper around TavilySearchApirEtriever, adding greater flexibility, control and traceability to search operations. It supports advanced features such as limiting search depth, domain inclusion/exclusion filters, and configurable result counts. The Invoke method performs a web search and tracks the metadata (timestamp, response time, and result count) of each query, storing it as a later analysis.

class SearchCache:
    def __init__(self):
        self.embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
        self.vector_store = None
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
       
    def add_documents(self, documents):
        if not documents:
            return
       
        try:
            if self.vector_store is None:
                self.vector_store = Chroma.from_documents(
                    documents=documents,
                    embedding=self.embedding_function
                )
            else:
                self.vector_store.add_documents(documents)
        except Exception as e:
            logger.error(f"Failed to add documents to cache: {e}")
   
    def search(self, query, k=3):
        if self.vector_store is None:
            return []
       
        try:
            return self.vector_store.similarity_search(query, k=k)
        except Exception as e:
            logger.error(f"Vector search failed: {e}")
            return []

The SearchCache class implements a semantic cache layer that uses vector embedding to store and retrieve documents for efficient similarity searches. It uses Google GenerativeAiembedDings to convert documents into dense vectors and store them in a chroma vector database. The add_documents method initializes or updates vector storage, while the search method can quickly retrieve the most relevant accelerated documents based on semantic similarity. This reduces redundant API calls and improves response time for duplicate or related queries and serves as a lightweight hybrid memory layer in AI Assistant Pipeline.

search_cache = SearchCache()
enhanced_retriever = EnhancedTavilyRetriever(max_results=5)
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


system_template = """You are a research assistant that provides accurate answers based on the search results provided.
Follow these guidelines:
1. Only use the context provided to answer the question
2. If the context doesn't contain the answer, say "I don't have sufficient information to answer this question."
3. Cite your sources by referencing the document numbers
4. Don't make up information
5. Keep the answer concise but complete


Context: {context}
Chat History: {chat_history}
"""


system_message = SystemMessagePromptTemplate.from_template(system_template)
human_template = "Question: {question}"
human_message = HumanMessagePromptTemplate.from_template(human_template)


prompt = ChatPromptTemplate.from_messages([system_message, human_message])

We initialized the core components of the AI assistant: Semantic Search Tracks, EnhancedTavilyReTriever for Web-based queries, and a conversation bag that preserves chat history across rounds. It also defines structured prompts using ChatPromptTemplate to guide the LLM as a research assistant. It is timely to enforce strict rules to ensure factual accuracy, contextual use, source citations and concise replies, thus ensuring a reliable and rooted response.

def get_llm(model_name="gemini-2.0-flash-lite", temperature=0.2, response_mode="json"):
    try:
        return ChatGoogleGenerativeAI(
            model=model_name,
            temperature=temperature,
            convert_system_message_to_human=True,
            top_p=0.95,
            top_k=40,
            max_output_tokens=2048
        )
    except Exception as e:
        logger.error(f"Failed to initialize LLM: {e}")
        raise


output_parser = SearchResultsParser()

We define the get_llm function, which initializes a Google Gemini language model with configurable parameters such as model name, temperature, and decoding settings (e.g., TOP_P, TOP_K and MAX tokens). It ensures error handling using failed model initialization and ensures robustness. An instance of Search Resultsparser was also created to normalize and build the original response of the LLM, so that the answers and downstream processing of the metadata are consistent.

def plot_search_metrics(search_history):
    if not search_history:
        print("No search history available")
        return
   
    df = pd.DataFrame(search_history)
   
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.plot(range(len(df)), df['response_time'], marker="o")
    plt.title('Search Response Times')
    plt.xlabel('Search Index')
    plt.ylabel('Time (seconds)')
    plt.grid(True)
   
    plt.subplot(1, 2, 2)
    plt.bar(range(len(df)), df['num_results'])
    plt.title('Number of Results per Search')
    plt.xlabel('Search Index')
    plt.ylabel('Number of Results')
    plt.grid(True)
   
    plt.tight_layout()
    plt.show()

The plot_search_metrics function visualizes performance trends in past queries using matplotlib. It converts the search history to a data frame and draws two sub-graphs: one showing the response time for each search and the other showing the number of results returned. This helps analyze the efficiency and search quality of the system, helping developers fine-tune the hound or identify bottlenecks in the real world over time.

def retrieve_with_fallback(query):
    cached_results = search_cache.search(query)
   
    if cached_results:
        logger.info(f"Retrieved {len(cached_results)} documents from cache")
        return cached_results
   
    logger.info("No cache hit, performing web search")
    search_results = enhanced_retriever.invoke(query)
   
    search_cache.add_documents(search_results)
   
    return search_results


def summarize_documents(documents, query):
    llm = get_llm(temperature=0)
   
    summarize_prompt = ChatPromptTemplate.from_template(
        """Create a concise summary of the following documents related to this query: {query}
       
        {documents}
       
        Provide a comprehensive summary that addresses the key points relevant to the query.
        """
    )
   
    chain = (
        {"documents": lambda docs: format_docs(docs), "query": lambda _: query}
        | summarize_prompt
        | llm
        | StrOutputParser()
    )
   
    return chain.invoke(documents)

These two functions enhance the assistant’s intelligence and efficiency. The retirevie_with_fallback function implements a hybrid retrieval mechanism: it first tries to get semantic related documents from the local chroma cache, and if it is unsuccessful, it returns back to a real-time tavily web search, thus caches new results for future use. Meanwhile, summary_documents utilizes Gemini LLM to generate concise summary from retrieved documents and guided by structured prompts to ensure that they are relevant to the query. Together, they achieve low latency, informative and background-aware responses.

def advanced_chain(query_engine="enhanced", model="gemini-1.5-pro", include_history=True):
    llm = get_llm(model_name=model)
   
    if query_engine == "enhanced":
        retriever = lambda query: retrieve_with_fallback(query)
    else:
        retriever = enhanced_retriever.invoke
   
    def chain_with_history(input_dict):
        query = input_dict["question"]
        chat_history = memory.load_memory_variables({})["chat_history"] if include_history else []
       
        docs = retriever(query)
       
        context = format_docs(docs)
       
        result = prompt.invoke({
            "context": context,
            "question": query,
            "chat_history": chat_history
        })
       
        memory.save_context({"input": query}, {"output": result.content})
       
        return llm.invoke(result)
   
    return RunnableLambda(chain_with_history) | StrOutputParser()

The Advanced_chain function defines a modular end-to-end inference workflow for answering user queries using cache or real-time search. It initializes the specified Gemini model, selects a search policy (cachedled back-up or direct search), constructs a response pipeline that contains the chat history (if enabled), formats the document to context, and prompts the LLM with a system-booted template. The chain also records the interactions in memory and returns the final answer and parses into clean text. The design allows flexible experimentation through models and retrieval strategies while maintaining dialogue coherence.

qa_chain = advanced_chain()


def analyze_query(query):
    llm = get_llm(temperature=0)
   
    analysis_prompt = ChatPromptTemplate.from_template(
        """Analyze the following query and provide:
        1. Main topic
        2. Sentiment (positive, negative, neutral)
        3. Key entities mentioned
        4. Query type (factual, opinion, how-to, etc.)
       
        Query: {query}
       
        Return the analysis in JSON format with the following structure:
        {{
            "topic": "main topic",
            "sentiment": "sentiment",
            "entities": ["entity1", "entity2"],
            "type": "query type"
        }}
        """
    )
   
    chain = analysis_prompt | llm | output_parser
   
    return chain.invoke({"query": query})


print("Advanced Tavily-Gemini Implementation")
print("="*50)


query = "what year was breath of the wild released and what was its reception?"
print(f"Query: {query}")

We initialize the final component of the smart assistant. QA_Chain is an assembled inference pipeline that is ready to process user queries using search, memory and GEMINI-based response generation. The Analyze_Query function uses the Gemini model and structured JSON prompt to perform lightweight semantic analysis of query, extracting main topics, emotions, entities, and query types. Example Query examples about breaths released and received in the field, showing how to trigger the assistant and prepare for full-stack reasoning and semantic interpretation. The printed title marks the beginning of interactive execution.

try:
    print("nSearching for answer...")
    answer = qa_chain.invoke({"question": query})
    print("nAnswer:")
    print(answer)
   
    print("nAnalyzing query...")
    try:
        query_analysis = analyze_query(query)
        print("nQuery Analysis:")
        print(json.dumps(query_analysis, indent=2))
    except Exception as e:
        print(f"Query analysis error (non-critical): {e}")
except Exception as e:
    print(f"Error in search: {e}")


history = enhanced_retriever.get_search_history()
print("nSearch History:")
for i, h in enumerate(history):
    print(f"{i+1}. Query: {h['query']} - Results: {h['num_results']} - Time: {h['response_time']:.2f}s")


print("nAdvanced search with domain filtering:")
specialized_retriever = EnhancedTavilyRetriever(
    max_results=3,
    search_depth="advanced",
    include_domains=["nintendo.com", "zelda.com"],
    exclude_domains=["reddit.com", "twitter.com"]
)


try:
    specialized_results = specialized_retriever.invoke("breath of the wild sales")
    print(f"Found {len(specialized_results)} specialized results")
   
    summary = summarize_documents(specialized_results, "breath of the wild sales")
    print("nSummary of specialized results:")
    print(summary)
except Exception as e:
    print(f"Error in specialized search: {e}")


print("nSearch Metrics:")
plot_search_metrics(history)

We demonstrated the complete action pipeline. It uses qa_chain to perform searches, displays generated answers, and then analyzes queries for emotions, topics, entities, and types. It can also retrieve and print the search history, response time, and result count for each query. Additionally, it runs a domain filtered search for sites related to Nintendo, summarizes the results, and uses Plot_Search_Metrics to visualize search performance, providing a comprehensive view of the Assistant’s capabilities in real-time use.

All in all, following this tutorial provides users with a comprehensive blueprint for creating a highly functional, context-aware and scalable rag system that blends real-time web intelligence with conversational AI. The Tavely search API allows users to get new and related content directly from the web. Gemini LLM adds powerful inference and summary capabilities, while Langchain’s abstraction layer can be seamlessly orchestrated between memory, embedding, and model output. This implementation includes advanced features such as fallback strategies specific to domain-specific filtering, query analysis (emotional, topic, and entity extraction), and semantic vector caches built using Chroma and Google GenerativeAiembeddings. Additionally, structured logging, error handling, and analysis dashboards provide transparency and diagnostics for realistic deployments.

View Colab notebook. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 90K+ ml reddit.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)