Reduce token usage with MCP Optimizer
Overview
The MCP Optimizer acts as an intelligent intermediary between AI clients and MCP servers. It provides tool discovery, unified access to multiple MCP servers through a single endpoint, and intelligent routing of requests to appropriate MCP tools.
The optimizer is now integrated into Virtual MCP Server (vMCP), which provides the same tool filtering and token reduction at the team level. You can deploy it in Kubernetes today, and a local experience is coming soon. This tutorial walks you through the Kubernetes deployment.
In this tutorial, you deploy the optimizer on Kubernetes using Virtual MCP Server (vMCP) and an EmbeddingServer for semantic tool search.
What you'll learn
- How to create an MCPGroup with multiple backend MCP servers
- How to deploy an EmbeddingServer for semantic search
- How to create a VirtualMCPServer with the optimizer enabled
- How to connect your AI client to the optimized endpoint and verify it exposes
only
find_toolandcall_tool
About MCP Optimizer
Instead of exposing every backend tool to the model, the optimizer introduces
two lightweight primitives: find_tool for semantic search and call_tool for
routing. This keeps context small and improves tool selection accuracy. For the
full parameter reference and configuration options, see
Optimize tool discovery.
How it works
The workflow is as follows:
- You send a prompt that requires tool assistance (for example, fetching a web page)
- The assistant calls
find_toolwith relevant keywords extracted from the prompt - The optimizer returns the most relevant tools (up to 8 by default, but this is configurable)
- Only those tools and their descriptions are included in the context sent to the model
- The assistant uses
call_toolto execute the task with the selected tool
Prerequisites
Before starting this tutorial, make sure you have:
- A Kubernetes cluster with the ToolHive operator installed (see Quickstart: Kubernetes Operator)
kubectlconfigured to communicate with your cluster- The ToolHive CLI installed on your local machine (used in Step 4 to register the endpoint with your MCP clients)
- An MCP client (Visual Studio Code with GitHub Copilot is used in this tutorial)
The default text embeddings inference (TEI) images depend on Intel MKL, which is x86_64-only. Native ARM64 support has been merged upstream but is not yet included in a published release. If you are using Apple Silicon or any other ARM64 nodes (including kind on macOS), you can run the amd64 image under emulation as a workaround. See the EmbeddingServer resource section for the required steps, including a Docker Desktop configuration change.
Step 1: Create an MCPGroup and deploy backend MCP servers
Create an MCPGroup to organize the backend MCP servers that the optimizer will index and route to:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPGroup
metadata:
name: optimizer-demo
namespace: toolhive-system
spec:
description: Backend servers for the optimizer tutorial
Apply the resource:
kubectl apply -f mcpgroup.yaml
Next, deploy two MCP servers in the group. Both reference optimizer-demo in
the groupRef field:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: fetch
namespace: toolhive-system
spec:
image: ghcr.io/stackloklabs/gofetch/server
transport: streamable-http
proxyPort: 8080
mcpPort: 8080
groupRef:
name: optimizer-demo
resources:
limits:
cpu: '100m'
memory: '128Mi'
requests:
cpu: '50m'
memory: '64Mi'
---
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: osv
namespace: toolhive-system
spec:
image: ghcr.io/stackloklabs/osv-mcp/server
transport: streamable-http
proxyPort: 8080
mcpPort: 8080
groupRef:
name: optimizer-demo
resources:
limits:
cpu: '100m'
memory: '128Mi'
requests:
cpu: '50m'
memory: '64Mi'
Apply the resources and wait for both servers to be ready:
kubectl apply -f mcpservers.yaml
kubectl get mcpservers -n toolhive-system -w
You should see both servers with Ready status before continuing.
If you still have an MCPServer left over from the K8s Operator Quickstart, you can delete it first to avoid confusion:
kubectl delete mcpserver fetch -n toolhive-system
Then apply the YAML above, which creates a new fetch server with the correct
groupRef.
Step 2: Deploy an EmbeddingServer
The optimizer uses semantic search to find relevant tools. This requires an EmbeddingServer, which runs a text embeddings inference (TEI) server.
Create an EmbeddingServer with default settings. This deploys the
BAAI/bge-small-en-v1.5 model:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
name: optimizer-embedding
namespace: toolhive-system
spec: {}
Apply the resource:
kubectl apply -f embedding-server.yaml
Wait for the EmbeddingServer to reach the Ready phase before proceeding. The
first startup may take a few minutes while the model downloads:
kubectl get embeddingserver optimizer-embedding -n toolhive-system -w
The EmbeddingServer deploys a TEI container that generates vector embeddings from text. The optimizer uses these embeddings to perform semantic search across all backend tools, finding the most relevant tools for a given query even when the exact keywords don't match.
Step 3: Create a VirtualMCPServer with the optimizer
Create a VirtualMCPServer that aggregates the backend servers and enables the
optimizer. Adding embeddingServerRef is the only change needed to enable the
optimizer - sensible defaults are applied automatically:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: optimizer-vmcp
namespace: toolhive-system
spec:
embeddingServerRef:
name: optimizer-embedding
incomingAuth:
type: anonymous
serviceType: ClusterIP
config:
groupRef:
name: optimizer-demo
aggregation:
conflictResolution: prefix
conflictResolutionConfig:
prefixFormat: '{workload}_'
Apply the resource:
kubectl apply -f virtualmcpserver.yaml
Check the status:
kubectl get virtualmcpservers -n toolhive-system
After about 30 seconds, you should see output similar to:
NAME PHASE URL BACKENDS AGE READY
optimizer-vmcp Ready http://vmcp-optimizer-vmcp.toolhive-system.svc.cluster.local:4483 2 30s True
Setting embeddingServerRef tells the operator to enable the optimizer on this
VirtualMCPServer. Instead of exposing all backend tools directly, the optimizer
builds a semantic index of tools and exposes only find_tool and call_tool to
clients. This dramatically reduces the number of tools (and tokens) sent to the
model.
Step 4: Connect your AI client
The vMCP service runs inside Kubernetes and is not directly reachable by desktop
AI clients. This tutorial uses kubectl port-forward because it works with any
cluster, but in production you would typically expose the service through an
Ingress, Gateway API, or LoadBalancer. See
Expose the service for
the available options.
In a separate terminal, port-forward the vMCP service to your local machine:
kubectl port-forward service/vmcp-optimizer-vmcp -n toolhive-system 4483:4483
Test the health endpoint:
curl http://localhost:4483/health
You should see {"status":"ok"}.
The ToolHive CLI bridges the remaining gap: it registers the port-forwarded endpoint as a local workload and automatically updates your MCP client configuration to point at it.
Register the port-forwarded vMCP endpoint as a ToolHive-managed workload:
thv run http://localhost:4483/mcp --name optimizer-vmcp
If you haven't set up client configuration yet, run thv client setup to
register your MCP clients. See
Client configuration for more details.
Open your AI client and check its MCP configuration. You should see only two
tools available: find_tool and call_tool. This confirms the optimizer is
working.
Step 5: Test the optimizer
Try these sample prompts to verify the optimizer is routing requests correctly across both backend MCP servers:
- "Fetch the contents of https://docs.stacklok.com and summarize the page"
- "Check if the Go package github.com/stacklok/toolhive has any known vulnerabilities"
Watch how the optimizer uses find_tool to locate the relevant tool across all
backends, then call_tool to execute it - all through a single endpoint.
To check your token savings, send this prompt to your AI client:
- "How many tokens did I save using MCP Optimizer?"
Clean up
Remove the local workload and delete the Kubernetes resources when you're done:
thv rm optimizer-vmcp
kubectl delete virtualmcpserver optimizer-vmcp -n toolhive-system
kubectl delete embeddingserver optimizer-embedding -n toolhive-system
kubectl delete mcpserver fetch osv -n toolhive-system
kubectl delete mcpgroup optimizer-demo -n toolhive-system
To tear down the entire kind cluster from the K8s Quickstart:
kind delete cluster --name toolhive
Next steps
- Tune the optimizer to adjust search parameters for your workload
- Configure authentication for production deployments
- Monitor vMCP activity with OpenTelemetry tracing and metrics
- Configure failure handling for circuit breakers and partial failure modes
- Provide feedback on your experience on the Stacklok Discord community
Related information
- Optimize tool discovery - full parameter reference, high availability, and ARM64 workaround details
- Optimizing LLM context - background on tool filtering and context pollution
- Virtual MCP Server overview - conceptual overview of vMCP
- MCP Optimizer UI guide - standalone desktop approach without Kubernetes (legacy, being replaced by the vMCP path)
- Quickstart: Kubernetes Operator - prerequisite tutorial