Step-by-step guide to build fast semantic search and RAG QA engines using AI embedding, Faiss search and Langchain

0 0 3 minutes read

Step-by-step guide to build fast semantic search and RAG QA engines using AI embedding, Faiss search and Langchain

In this tutorial, we strive to rely on the growing ecosystem of AI to show how fast we can transform unstructured text into question services that quote their sources. We will scrape some live web pages, cut them into consistent chunks, and feed these chunks to the Colde Computer/M2-Bert-80m-80m-8k-8k-Retrreval embed model. These vectors are trapped in the faiss index with millisecond similarity searches, after which the lightweight coronal model drafted the answers that remained in the retrieved paragraphs. Because AI handles embeddings together and chats behind a single API key, we avoid juggling multiple providers, quotas, or SDK dialects.

!pip -q install --upgrade langchain-core langchain-community langchain-together 
faiss-cpu tiktoken beautifulsoup4 html2text

This quiet (-Q) PIP command upgrades and installs everything you need for Colab Rag. It integrates the core Lanchain library with AI, Faiss for vector search, token processing with Tiktoken, and lightweight HTML parsing via BeautifulSoup4 and HTML2Text to ensure that the laptop runs end-to-end notebooks without other settings.

import os, getpass, warnings, textwrap, json
if "TOGETHER_API_KEY" not in os.environ:
    os.environ["TOGETHER_API_KEY"] = getpass.getpass("🔑 Enter your Together API key: ")

We check if the _api_key environment variable has been set. If not, it can safely prompt us to use GetPass’ key and store it in OS.Environ. The rest of the notebook can dial together the AI’s API without hard-coded secrets, and can also expose them by capturing credentials once per run.

from langchain_community.document_loaders import WebBaseLoader
URLS = [
    "
    "
    "  
]
raw_docs = WebBaseLoader(URLS).load()

Webbaseloader gets each URL, Strips boilerplate, and returns a Langchain document object containing clean page text and metadata. By passing a list with links, we immediately collect real-time document and blog content that will later be broken down and embedded for semantic search.

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
docs = splitter.split_documents(raw_docs)


print(f"Loaded {len(raw_docs)} pages → {len(docs)} chunks after splitting.")

Recursivecharactertextsplitter cuts each fetched page into ~800 character segments with 100 character overlap so context clues on the block boundaries are not lost. The final list document contains these bite-sized Langchain document objects, and the printout shows how many blocks were produced in the original page, an essential preparation for high-quality embedding.

from langchain_together.embeddings import TogetherEmbeddings
embeddings = TogetherEmbeddings(
    model="togethercomputer/m2-bert-80M-8k-retrieval"  
)
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(docs, embeddings)

Here we instantiate the 80 M parameter M2-bert search model of AI together as an insertion lance embedder, and then feed each block of text into faiss.from_documents.from_documents. The generated vector storage supports millisecond cosine searches, turning scratch pages into searchable semantic databases.

from langchain_together.chat_models import ChatTogether
llm = ChatTogether(
    model="mistralai/Mistral-7B-Instruct-v0.3",        
    temperature=0.2,
    max_tokens=512,
)

Chattogether combines a chat-type model that will be used with any other Langchain LLM, using Mistral-7b-7b-instruct-V0.3. The low temperature is 0.2, making the answer grounded and repeatable, while max_tokens = 512 leaves room for detailed multi-paragraph responses without the cost of getting out of control.

from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

Search stitches together: it takes our faiss hound (returns to the first 4 similar blocks) and feeds these fragments into the LLM using a simple “Quitt” prompt template. Setting return_source_documents = true means that each answer will be returned with the exact paragraph it depends on, giving us an instant reference Q-and-a.

QUESTION = "How do I use TogetherEmbeddings inside LangChain, and what model name should I pass?"
result = qa_chain(QUESTION)


print("n🤖 Answer:n", textwrap.fill(result['result'], 100))
print("n📄 Sources:")
for doc in result['source_documents']:
    print(" •", doc.metadata['source'])

Finally, we send a natural language query via QA_Chain, which will retrieve the four most relevant blocks, feed them into the Star Room model, and return the concise answer. It then prints the formatted response and then prints out the source URL list, allowing us to have a one-shot comprehensive description and transparent references.

The output of the final unit

In short, in about fifty lines of code, we build a complete rag ring end-to-end through AI: ingest, embed, store, retrieve and Converse. The method is intentionally modular, for swaps of chroma, swapping the 80 m parameter embedder for a larger multilingual model, or inserting a rereader without touching the rest of the pipeline. Keeping constant is the convenience of a unified AI backend: fast, affordable embedded, call the following teaching chat model and generous free hierarchy that makes the experiment painless. Use this template to guide the internal knowledge assistant, as a document robot or personal research assistant for the client.

Check COLAB notebook is here. Also, please stay tuned for us twitter And don’t forget to join us 90K+ ml reddit.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.