Data Science

How to not write an MCP server

I have the opportunity to create an MCP server for observability applications to provide dynamic code analysis capabilities for AI agents. Because of its potential to change applications, MCP is a technology I was more ecstatic than I did with Genai initially. I wrote more about this in my previous post and wrote some briefs in general posts.

Although the initial POC proved that there was a huge This may make it a power multiplier of the value of our products, which took several iterations and a few accidental chances to fulfill this promise. In this post, I will try to capture some lessons learned, as I think this can benefit other MCP server developers.

My stack

  • I was using the cursor and VSCODE as the main MCP client at the time
  • To develop the MCP server itself, I use the .NET MCP SDK because I decided to host the server on another service written in .NET

Lesson 1: Don’t dump all data to the proxy

In my application, a tool returns a summary of information about errors and exceptions. The API is very detailed because it has a complex UI view and emits a lot of deep-linked data:

  • Error frame
  • Affected Endpoints
  • Stack Trace
  • Priority and Trends
  • Histogram

My first intuition is simply revealing the API turn out to be As an MCP tool. After all, the proxy should be able to make more sense than any UI view and understand interesting details or connections between events. I have several situations regarding several situations where I expect these data to be useful. Agents can automatically provide fixes for the latest exceptions recorded in production or testing environments, let me know about bugs that stand out, or help me solve some system problems that are the root cause of the problem.

Therefore, the basic premise is to allow the agent to play its “magic” role, and more data may mean more hooks can be locked in the investigation work. I quickly encoded the API on the MCP endpoint and decided to start with the basic prompts to see if everything works:

Image of the author

We can see that the agent is smart enough to know that it needs to call another tool to get that environment ID.testThe environment I mentioned. With that, after discovering that there were no recent exceptions in the past 24 hours, the free scan was followed by a longer period of time, which is when things got a little weird:

Image of the author

What a strange response. The agency checked the exceptions for the last seven days, and this time retrieved some tangible results, but was still strolling, as if the data was completely ignored. It continues to try to use the tool in different ways and in combination with different parameters, obviously groping until I noticed that it flatly summons the fact that the data is completely invisible. When an error in the response is sent back, the agent actually claims No errors. what is going on?

Image of the author

After an investigation, the question reveals the fact that our ability to process large amounts of data simply in the agency’s response reaches an upper limit.

I used an extremely lengthy existing API, which I initially thought was an advantage. However, the end result is that I managed to submerge the model somehow. Overall, there are about 360k characters and 16k words in the response json. This includes call stack, error frames, and references. this should Supported just by looking at the context window limitations of the model I’m using (Claude 3.7 sonnet should support up to 200k tokens), but despite this, the big data dump completely confuses the agent.

One strategy is to change the model to a model that supports a larger context window. I switched to Gemini 2.5 Pro The model is just to test the theory, as it has a cruel limit of one million tokens. Sure enough, the same query now produces a smarter response:

Image of the author

This is so good! Agents are able to parse errors and find system causes for many reasons through some basic reasoning. However, we cannot rely on users and complicate things with a specific model, which is output from a relatively low bandwidth test environment. What if the dataset is larger?
To solve this problem, I made some fundamental changes to the structure of the API:

  • Nested data hierarchy: The focus of maintaining initial response is on advanced details and aggregation. Create a separate API that retrieves the call stack for a specific frame as needed.
  • Enhance queryability: All queries that agents have made to the data so far use very small page sizes (10), and if we want the agent to be able to access more relevant data of the data to fit the limitations of its context, we need to provide more APIs for query errors based on different dimensions, such as: affected methods, priority and influence, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority, priority,

With new changes, the tool now consistently analyzes important new exceptions and proposes fixes. But, I glanced at another small detail that I needed to classify before I could actually use it.

Lesson 2: What time is it?

Images generated by the author and Midjourney

Sensitive readers may have noticed that in the previous example, to retrieve errors within a specific time range, the proxy uses ISO 8601 Duration Format instead of actual date and time. So, not including the standard’from‘ and ‘arrive‘A parameter with DateTime value, AI sends a duration value, such as 7 days or P7D, To indicate that it wants to check for errors from the past week.

The reason is a bit strange – The agent may not know the current date and time! You can verify yourself by asking the agent this simple question. If I’m not typing the prompts around noon on May 4, then the following makes sense.

Image of the author

Use time period Values ​​prove to be a great solution for agents to handle well. However, don’t forget to record expected values ​​and sample syntax in the tool parameter description!

Lesson 3: When a proxy makes a mistake, show how to do better

In the first example, I was actually surprised at how the proxy can interpret dependencies between different tool calls to provide the correct environment identifier. While studying the MCP contract, it found that it had to call another tool for dependencies in order to get the list of environment IDs first.

However, when responding to other requests, the proxy sometimes mentions the environment name in the verbatim prompt. For example, I noticed that this is the answer to this question: Is there a significant difference in comparing slow traces of this approach between testing and production environments? According to the context, The proxy sometimes uses the environment name mentioned in the request and sends the strings “test” and “product” as the environment ID.

In my original implementation, my MCP server will silently fail in this case, thus returning a blank response. After the proxy receives no data or a common error, it simply exits and tries to resolve the request using another policy. To offset that behavior, I quickly changed the implementation so that if an incorrect value is provided, the JSON response will accurately describe what is wrong, and even provide a list of possible values ​​to save the proxy another tool call.

Image of the author

This is enough for an agent to learn from the error, which repeats the call with the correct value and somehow avoids making the same mistake in the future.

Lesson 4: Focus on user intentions rather than functions

While it is easy to describe the role of an API simply, sometimes the general terminology does not completely allow an agent to be aware of the types of requirements that this feature may apply.

Let’s take a simple example: my MCP server has a tool that for each method, endpoint, or code location can indicate how it is used at runtime. Specifically, it uses tracking data to indicate which application streams achieve a specific function or method.

The original documentation simply describes this feature:

[McpServerTool,
Description(
@"For this method, see which runtime flows in the application
(including other microservices and code not in this project)
use this function or method.
This data is based on analyzing distributed tracing.")]
public static async Task GetUsagesForMethod(IMcpService client,
[Description("The environment id to check for usages")]
string environmentId,
[Description("The name of the class. Provide only the class name without the namespace prefix.")]
string codeClass,
[Description("The name of the method to check, must specify a specific method to check")]
string codeMethod)

The above represents an accurate description of the function of the tool, but it does not necessarily make it clear about the types of activities it may be related to. After seeing that the agent did not select this tool for various prompts, which I thought was very useful for this, I decided to rewrite the tool description, highlighting the use case this time:

[McpServerTool,
Description(
@"Find out what is the how a specific code location is being used and by
which other services/code.
Useful in order to detect possible breaking changes, to check whether
the generated code will fit the current usages,
to generate tests based on the runtime usage of this method,
or to check for related issues on the endpoints triggering this code
after any change to ensure it didnt impact it"

Updating the text helped the agent realize why the information was useful. For example, before making this change, the agent would not even trigger the tool in response to a prompt similar to the one below. Now, it has become completely seamless, without the user having to directly mention that this tool should be used:

Image by author

Lesson 5: Document your JSON responses

The JSON standard, at least officially, does not support comments. That means that if the JSON is all the agent has to go on, it might be missing some clues about the context of the data you’re returning. For example, in my aggregated error response, I returned the following score object:

"Score": {"Score":21,
"ScoreParams":{ "Occurrences":1,
"Trend":0,
"Recent":20,
"Unhandled":0,
"Unexpected":0}}

Without proper documentation, any non-clairvoyant agent would be hard pressed to make sense of what these numbers mean. Thankfully, it is easy to add a comment element at the beginning of the JSON file with additional information about the data provided:

"_comment": "Each error contains a link to the error trace,
which can be retrieved using the GetTrace tool,
information about the affected endpoints the code and the
relevant stacktrace.
Each error in the list represents numerous instances
of the same error and is given a score after its been
prioritized.
The score reflects the criticality of the error.
The number is between 0 and 100 and is comprised of several
parameters, each can contribute to the error criticality,
all are normalized in relation to the system
and the other methods.
The score parameters value represents its contributation to the
overall score, they include:

1. 'Occurrences', representing the number of instances of this error
compared to others.
2. 'Trend' whether this error is escalating in its
frequency.
3. 'Unhandled' represents whether this error is caught
internally or poropagates all the way
out of the endpoint scope
4. 'Unexpected' are errors that are in high probability
bugs, for example NullPointerExcetion or
KeyNotFound",
"EnvironmentErrors":[]

This allows the agent to explain to the user what the score means if it is requested, and also provides this explanation to its own reasoning and suggestions.

Choose the correct architecture: SSE vs stdio,

You can use two architectures when developing an MCP server. The more common and widely supported implementation is to make your server as a Order Triggered by the MCP client. This may be a command triggered by any cli; NPX, Dockerand Python are some common examples. In this configuration, all communication is done through the process Stioand the process itself runs on the client computer. The client is responsible for instantiating and maintaining the life cycle of the MCP server.

Image of the author

From my point of view, this client architecture has one major drawback: Since the MCP server implementation is run by the client on the local machine, it is much harder to roll out updates or new features. Even if this issue is solved, the tight coupling between the MCP server and the backend API depends on in our application will further complicate the model in terms of version control and forward/back compatibility.

For these reasons, I chose the second type of MCP server, which is a SSE server hosted by part of our application service. This removes any friction from running CLI commands on the client computer and allows me to update and version of the MCP server code as well as the application code it consumes. In this case, the client is provided with the URL of the SSE endpoint to interact with. While not all clients currently support this option, there is an excellent CommandMCP called SuperGateway that can be used as a proxy for SSE server implementations. This means that users can still add more widely supported STDIO variants and can still eliminate the functionality hosted on the SSE backend.

Image of the author

MCP is still new

There are more lessons and nuances to use this seemingly simple technique. I found that there is a big gap between implementing viable MCPs to those that can be combined with user needs and usage, even beyond those you would expect. Hopefully, as the technology matures, we will see more posts about best practices,

Want to connect? You can contact me on Twitter @doppleware Or via LinkedIn.
Follow meMCP Dynamic code analysis for using observability

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button