Step by step guide to powering applications with LLM

liralbes April 25, 2025

1 0 6 minutes read

Step by step guide to powering applications with LLM

Is Genai hype or external noise? I also think it’s hype and I can sit down until the dust clears. Oh boy, I was wrong. Genai has practical applications. It also generates revenue for the company, so we want the company to invest heavily in research. Whenever technology destroys something, the process usually runs through the following stages: denial, anger, and acceptance. The same thing happened when the computer was introduced. If we work in software or hardware fields, we may need to use Genai at some point.

In this article, I’ve covered how to use large language models to power applications (LLMS) and discuss the challenges I face when setting up LLM. Let’s get started.

1. First, clearly define the use cases

Before we jump to LLM, we should ask ourselves some questions

one. What problem does my LLM solve?
b. Can my application be executed without LLM
c. Do I have enough resources and computing power to develop and deploy this application?

Shrink your use cases and record them. In my case, I’m working on data platforms as a service. We have a lot of information about Wiki, Slack, Team Channels, and more. We hope that the chatbot can read this information and answer questions on our behalf. Chatbots will answer customers’ questions and requests on our behalf and if customers are still upset, they will be routed to the engineer.

2. Select your model

Photos of Solen Feyissa on Unsplash

You have two options: train the model from scratch or use a pre-trained model and build it on top of it. Unless you have a specific use case, the latter will work in most cases. Training your model from scratch will require a lot of computing power, significant engineering efforts and costs, etc. Now, the next question is, which pre-trained model should I choose? You can select a model based on the use case. The 1B parameter model has basic knowledge and pattern matching. The use case can be restaurant reviews. The 10B parameter model has good knowledge and can be followed by the instructions of the food order chat robot and other instructions. The 100B+ parameter model has rich world knowledge and complex reasoning. This can be used as a brainstorming partner. There are many models available, such as Llama and Chatgpt. Once you have a model, you can expand the model.

3. Augment the model based on your data

Once you have a model, you can expand the model. The LLM model trains general data. We want to train our data. Our model needs more context to provide the answer. Suppose we want to build a restaurant chatbot to answer customer questions. This model does not know about your restaurant’s special information. Therefore, we want to provide some models context. There are a number of ways we can achieve this. Let’s dig into some of these.

Timely engineering

Timely engineering involves using more context to increase input prompts during inference. You provide the context in the input quote. This is the easiest thing, and there is no enhancement. But this brings its shortcomings. You cannot provide a larger context within the prompt. There is a limitation for context prompts. Also, you cannot expect the user to always provide a complete context. The context may be broad. This is a quick and easy solution, but there are several limit. This is an example timely engineering.

“Categorize this comment
I love movies
Emotion: Positive

Categorize this comment
I hate this movie.
Emotion: Negative

Classify movies
The ending is exciting”

Strengthen learning through human feedback (RLHF)

RLHF is one of the most commonly used methods to integrate LLM into applications. You provide some context data for model learning. Here is the process it follows: The model takes action from the action space and observes the result state changes of the action. The reward model generates reward rankings based on the output. The model updates its weights accordingly to maximize rewards and iteratively learn. For example, in LLM, action is the next word generated by LLM, and action space is a dictionary of all possible words and vocabulary. The environment is the text context; the state is the current text in the context window.

The above explanation is more like the textbook explanation. Let’s take a look at a real-life example. You want your chatbot to answer questions about the wiki documentation. Now you have selected a pretrained model like chatgpt. Your wiki will be your context data. You can use the Langchain library to execute rags. You can here be the sample code in Python

from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-openai-key-here"

# Step 1: Load Wikipedia documents
query = "Alan Turing"
wiki_loader = WikipediaLoader(query=query, load_max_docs=3)
wiki_docs = wiki_loader.load()

# Step 2: Split the text into manageable chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = splitter.split_documents(wiki_docs)

# Step 3: Embed the chunks into vectors
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(split_docs, embeddings)

# Step 4: Create a retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Step 5: Create a RetrievalQA chain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # You can also try "map_reduce" or "refine"
    retriever=retriever,
    return_source_documents=True,
)

# Step 6: Ask a question
question = "What did Alan Turing contribute to computer science?"
response = qa_chain(question)

# Print the answer
print("Answer:", response["result"])
print("n--- Sources ---")
for doc in response["source_documents"]:
    print(doc.metadata)

4. Evaluate your model

You have now added the rag to the model. How do you check if the model behaves correctly? This is not a code where you provide some input parameters and receive fixed output that you can test. Since this is language-based communication, it can Multiple correct Answer. But what you can be sure of is whether the answer is incorrect. There are many metrics you can test your model.

Manual evaluation

You can evaluate the model manually. For example, we integrated a Slack chatbot that strengthened RAG using our Wikis and Jira. Once we added the chatbot to the Slack channel, we initially obscured its response. The customer cannot view the response. Once we gain confidence, we make chatbots public as customers. We manually evaluated its response. But it’s a quick and vague way. You can’t get confidence from this manual test. So the solution is to target certain benchmarks, such as Rouge.

Evaluated with a rouge score.

Rouge indicators are used for text summary. The Rouge metric compares the generated summary with the reference summary using different rouge metrics. Rouge metrics were evaluated using recall, accuracy and F1 scores. There are many types of rouge indicators, and poor completion can still lead to good scores. Therefore, we are referring to different rouge indicators. In some cases, a word is a word. One big one is two words. n-gram is n words.

Rouge-1 recall
Rouge-1 Precision = Generate Unigram match/UMIGRAM in output
rouge-1 f1 = 2 * (remember*precision/(remember+precision))
Rouge-2 recall = Bigram Matches/Bigram Reference
Rouge-2 Precision = BigRam Matches /Bigram in generated output
rouge-2 f1 = 2 * (remember*precision/(remember+precision))
Rouge-l recall = longest common subsequence/reference
Rouge-l Precision = The longest common subsequence/summary in output
rouge-l f1 = 2 * (remember*precision/ (remember+precision))

For example,

Reference: “It’s cold outside.”
The output produced: “It’s very cold outside.”

Rouge-1 recall = 4/4 = 1.0
Rouge-1 Precision = 4/5 = 0.8
Rouge-1 F1 = 2 * 0.8/1.8 = 0.89
Rouge-2 recall = 2/3 = 0.67
Rouge-2 Precision = 2/4 = 0.5
Rouge-2 F1 = 2 * 0.335/1.17 = 0.57
Rouge-l recall = 2/4 = 0.5
Rouge-L Precision = 2/5 = 0.4
Rouge-l F1 = 2 * 0.335/1.17 = 0.44

Reduce trouble with external benchmarks

Rouge scores are used to understand how model evaluation works. Other benchmarks exist, such as BLEU scores. However, we can’t actually build a dataset to evaluate our model. We can use external libraries to benchmark our model. The most commonly used are glue benchmarks and ultra-bright benchmarks.

5. Optimize and deploy the model

This step may not be important, but it is always good to reduce the cost of calculation and get faster results. Once the model is ready, you can optimize it to improve performance and reduce memory requirements. We will introduce some concepts that require more engineering work, knowledge, time and cost. These concepts will help you get familiar with some techniques.

Quantify weight

The model has parameters, internal variables in the model, which are learned from the data during training, and their value determines the model’s prediction. 1 parameter usually requires 24 bytes of processor memory. So if 1B is selected, the parameter will require 24 GB of processor memory. Quantization converts model weights from higher precision floating point numbers to lower precision floating point numbers for efficient storage. Changing the storage precision can significantly affect the number of bytes required to store a single weight value. The following table illustrates the different precisions of storage weights.

prune

Pruning involves removing weights in the model, which is less important and has little impact, such as weight equals or is close to zero. Some trimming techniques are
one. Complete model retraining
b. Peft like Lora
c. After training.

in conclusion

All in all, you can choose a pre-trained model, such as Chatgpt or Flan-T5, and build it on top of it. Building a pre-trained model requires expertise, resources, time and budget. If you want, you can fine-tune it according to the use case. You can then use LLM to power the application using technologies such as rags and customize it into your application use cases. You can evaluate the model based on certain benchmarks to see if it performs correctly. You can then deploy the model.

liralbes April 25, 2025

1 0 6 minutes read