Data Science

Connect points for better movie suggestions

The promise of Retrieval Function Generation (RAG) is that it allows AI systems to answer questions with latest or domain-specific information without retraining the model. However, most rag pipes still treat documents and information as flat and disconnected connections – based on vector similarity, the isolated blocks are retrieved back to orphaned blocks without a sense of relationships with those blocks.

To make up for the rag’s ignorance about the connection between documents and blocks, developers have turned to the graphic rag method, but often find that the benefits of graphic rags are not worth adding to the complexity of implementing it.

In our recent article on the Open Source Graph Rag Project and GraphRetriever, we have introduced a new, easier way to combine your existing vector search with lightweight, metadata-based graph traversals that do not require graphics construction or storage. Graph “edges” can be defined by specifying which document metadata values ​​to use and these connections are traversed during Graph Rag retrieval, and graph connections can be defined at runtime (or even query time).

In this article, we have expanded a use case in the Graphics RAG project documentation (a demonstration notebook can be found here), and here is a simple but explanatory example: Search for movie reviews: Perform movie reviews from a rotten tomato dataset, automatically connect each review with a subgraph of its local related information, then summarize the query responses, and then summarize the responses of the entire context and relationship between the movie and the relationship, reviewing, reviewing, and other data, as well as other data, as well as other data.

Dataset: Rotten Tomato Reviews and Movie Metadata

The dataset used in this case study comes from a public Kaggle dataset called “Large Rotten Tomato Movies and Reviews.” It includes two main CSV files:

  • rotten_tomatoes_movies.csv – Contains structured information about over 200,000 movies, including areas such as title, actor, director, genre, language, release date, runtime and box office revenue.
  • ROTTER_TOMATOES_MOVIE_REVIEWS.CSV – A collection of nearly 2 million movie reviews sucked by users, including fields such as comment text, ratings (e.g. 3/5), emotional classification, comment dates, and citations to related movies.

Each comment is linked to the movie via a shared Movie_ID, creating a natural relationship between the content of the unstructured comment and the structured movie metadata. This makes it an ideal candidate for using only metadata to demonstrate the ability of GraphReTriever to traverse document relationships without manually building or storing separate graphs.

By treating metadata fields such as Movie_ID, type and even shared cast and directors as graph edges, we can build a connected search flow to enrich each query automatically with the relevant context.

Challenge: Putting movie reviews in context

A common goal in AI-driven search and recommendation systems is to enable users to ask natural open-ended questions and obtain meaningful, contextual results. With a large set of movie reviews and metadata data, we want to support full-text responses to the following prompts:

  • “What good family movies are there?”
  • “What are some suggestions for exciting action movies?”
  • “What classic movies come with amazing photography?”

Each of these prompts requires subjective review content as well as some semi-structured properties such as types, audiences, or visual styles. To give a good answer in the complete context, the system needs:

  1. Use vector-based semantic similarity to retrieve the most relevant comments based on user queries
  2. Enrich each review with full movie details (subject, release year, genre, director, etc.), so the model can provide complete, rooted advice
  3. Link this information to other comments or movies that provide a wider context, such as: What other commenters say? How do other movies in the genre compare?

A traditional RAG pipeline may process Step 1 well and insert relevant text fragments into it. However, without understanding the relationship between the search block and other information in the dataset, the model’s response may lack context, depth, or accuracy.

How graphic rags solve the challenge

Given the user’s query, a normal rag system may recommend a movie based on a small number of comments that are directly related to semantics. However, the Graph Retriever can easily compare and contrast them in relevant contexts (e.g., other comments on the same movie or other movies in the same genre) before making a suggestion.

From an implementation point of view, Graph Rag offers a clean two-step solution:

Step 1: Build a standard rag system

First, just like any rag system, we embed document text using a language model and then embed it in a vector database. Each embedded comment can include structured metadata such as Reviewed_movie_id, ratings and sentiments – the information will define the relationship later using. Each embedded movie description includes metadata such as Movie_ID, type, Release_Year, Director, etc.

This allows us to handle typical vector-based retrieval: when the user enters a query such as “What good family movies?”, we can quickly get comments from the dataset related to the family movies from the dataset. Linking these to a broader context happens in the next step.

Step 2: Add GraphRetriever to Add Graph Traversal

Once semantic related comments were retrieved in step 1 using vector search, we can use GraphReTriever to traverse the connections between comments and their associated movie records.

Specifically, GraphReTriever:

  • Get relevant comments through semantic search (RAG)
  • Follow the metadata-based edge (e.g. Reviewed_movie_id) to retrieve more information directly related to each comment, such as movie descriptions and attributes, data about reviewers, etc.
  • Merge content into a single context window to use the language model when generating answers

One key point: no pre-built knowledge graph is required. The graph is completely defined based on metadata and passes through dynamically at query time. If you want to extend the connection to include shared participants, genres, or time periods, you simply update back to the edge definition in the hound configuration without reprocessing or reshaping the data.

So when a user asks for exciting action movies with certain specific qualities, the system can be brought into the release year, genre and actors of the movie, thereby improving relevance and readability. When someone asks for a classic movie with amazing photography, the system can draw on comments from old movies and pair them with genre or era (such as genre or era) to make a reply that is both subjective and fact-based.

In short, GraphReTriever bridges the gap between unstructured opinions (subjective text) and structured context (connected metadata) – resulting in smarter, trustworthy and complete query responses.

GraphReTriever in action

To show how GraphReTriever connects unstructured comment content with structured movie metadata, we browse the basic settings using the rotten tomato dataset example. This involves three main steps: creating vector storage, converting raw data into Lanchain documents, and configuring a graphical traversal strategy.

For complete working code, see the sample notebook in the Graphics Rag Project.

Create vector stores and embeds

We start by embedding and storing documents, just like we do in any rag system. Here we are using OpenAiembedDings and Astra DB vector store:

from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

COLLECTION = "movie_reviews_rotten_tomatoes"
vectorstore = AstraDBVectorStore(
    embedding=OpenAIEmbeddings(),
    collection_name=COLLECTION,
)

Structure of data and metadata

We store and embed document content like we usually do with any rag system, but we also retain structured metadata for graph traversal. Document content is kept minimal (comment text, movie title, description), while rich structured data is stored in the “Metadata” field in the stored document object.

Here is an example JSON of a movie document in a vector store:

> pprint(documents[0].metadata)

{'audienceScore': '66',
 'boxOffice': '$111.3M',
 'director': 'Barry Sonnenfeld',
 'distributor': 'Paramount Pictures',
 'doc_type': 'movie_info',
 'genre': 'Comedy',
 'movie_id': 'addams_family',
 'originalLanguage': 'English',
 'rating': '',
 'ratingContents': '',
 'releaseDateStreaming': '2005-08-18',
 'releaseDateTheaters': '1991-11-22',
 'runtimeMinutes': '99',
 'soundMix': 'Surround, Dolby SR',
 'title': 'The Addams Family',
 'tomatoMeter': '67.0',
 'writer': 'Charles Addams,Caroline Thompson,Larry Wilson'}

Note that graph traversal with GraphReTriever only uses properties of this metadata field, does not require a dedicated graph DB, and does not use any LLM calls or other expensive properties

Configure and run GraphReTriever

GraphRetriever traverses a simple graph defined by metadata connections. In this case, we use each comment to the corresponding movie to define an advantage Comments_MOVIE_ID (in the comments) and movie_id (In the movie description).

We use the “craving” traversal strategy, which is one of the easiest traversal strategies. For more details on the policy, see the documentation for the Graphics Rag Project.

from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

retriever = GraphRetriever(
    store=vectorstore,
    edges=[("reviewed_movie_id", "movie_id")],
    strategy=Eager(start_k=10, adjacent_k=10, select_k=100, max_depth=1),
)

In this configuration:

  • start_k=10: Use semantic search to retrieve 10 comment documents
  • adjacent_k=10: Allows up to 10 adjacent documents to be pulled in each step of the graph traversal
  • select_k=100: Up to 100 files can be returned
  • max_depth=1: This picture travels through only one level, from comments to movies

Note that since each comment is linked to a reviewed movie, in this simple example, the graph traversal depth will stop at 1. For more complex traversals, see more examples in the Graphics Rag Project.

Call query

You can now run natural language queries, for example:

INITIAL_PROMPT_TEXT = "What are some good family movies?"

query_results = retriever.invoke(INITIAL_PROMPT_TEXT)

By doing some sorting and reformatting the text (see notebook for details), we can print out a basic list of checked movies and comments, such as:

 Movie Title: The Addams Family
 Movie ID: addams_family
 Review: A witty family comedy that has enough sly humour to keep adults chuckling throughout.

 Movie Title: The Addams Family
 Movie ID: the_addams_family_2019
 Review: ...The film's simplistic and episodic plot put a major dampener on what could have been a welcome breath of fresh air for family animation.

 Movie Title: The Addams Family 2
 Movie ID: the_addams_family_2
 Review: This serviceable animated sequel focuses on Wednesday's feelings of alienation and benefits from the family's kid-friendly jokes and road trip adventures.
 Review: The Addams Family 2 repeats what the first movie accomplished by taking the popular family and turning them into one of the most boringly generic kids films in recent years.

 Movie Title: Addams Family Values
 Movie ID: addams_family_values
 Review: The title is apt. Using those morbidly sensual cartoon characters as pawns, the new movie Addams Family Values launches a witty assault on those with fixed ideas about what constitutes a loving family. 
 Review: Addams Family Values has its moments -- rather a lot of them, in fact. You knew that just from the title, which is a nice way of turning Charles Addams' family of ghouls, monsters and vampires loose on Dan Quayle.

We can then pass the above output to the LLM using the full information in the comments along with the linked movie to generate the final response.

Setting the final prompt and LLM call looks like this:

from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pprint import pprint

MODEL = ChatOpenAI(model="gpt-4o", temperature=0)

VECTOR_ANSWER_PROMPT = PromptTemplate.from_template("""

A list of Movie Reviews appears below. Please answer the Initial Prompt text
(below) using only the listed Movie Reviews.

Please include all movies that might be helpful to someone looking for movie
recommendations.

Initial Prompt:
{initial_prompt}

Movie Reviews:
{movie_reviews}
""")

formatted_prompt = VECTOR_ANSWER_PROMPT.format(
    initial_prompt=INITIAL_PROMPT_TEXT,
    movie_reviews=formatted_text,
)

result = MODEL.invoke(formatted_prompt)

print(result.content)

Moreover, the final response of the graphic rag system might be like this:

Based on the reviews provided, "The Addams Family" and "Addams Family Values" are recommended as good family movies. "The Addams Family" is described as a witty family comedy with enough humor to entertain adults, while "Addams Family Values" is noted for its clever take on family dynamics and its entertaining moments.

Remember that this final response is the result of the initial semantic search for comments referring to family movies – in addition, the context is extended from the documents directly related to these comments. By extending windows of relevant contexts beyond simple semantic searches, the LLM and overall graphics rag system can integrate more complete and useful responses.

Try it yourself

The case studies in this article show:

  • Mix unstructured and structured data in rag pipes
  • Use metadata as dynamic knowledge graphs without building or storing one
  • Improve the depth and relevance of AI-generated responses through surfaced environments

In short, this is the graphic rag in action: adding structures and relationships so that LLM not only needs to be retrieved, but also constructs context and reason more efficiently. If you’ve already stored rich metadata with your documents, GraphReTriever gives you a practical way to run metadata – no other infrastructure.

We hope this inspires you to try GraphReTriever on your own data, which is all open source – especially if you have used documents that are implicitly connected by shared properties, links, or references.

You can explore the full notebook and implementation details here: The rag in the movie review of Rotten Tomatoes.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button