How We Built Our MCP Server
We shipped a Model Context Protocol server before we shipped half the integrations on our roadmap. This is the engineering story of how it works, why we made the architectural calls we made, and what we got wrong the first few attempts. 27 tools, hybrid vector + BM25 search, a strict PAT-scoping model, and a per-call audit row that an admin can replay.
TL;DR
What we shipped: A first-party MCP server at /api/mcp exposing 27 typed tools over the Model Context Protocol. Auth is a bcrypt-hashed PAT scoped to a single new permission (MCP_AGENT). Every call writes an AiOperation row with actorType="mcp_agent".
What's actually interesting: Not the tool count, the search tool. search_org_knowledge runs a hybrid vector + BM25 query across projects, tasks, risks, goals, comments, and wiki pages in a single call. Most PM-tool MCPs expose CRUD; this answers "find me everything related to X" in one round-trip.
Why we built it first: Agentic clients (Claude Desktop, Cursor, ChatGPT custom connectors) are now the fastest-growing source of new active users on the platform. Treating MCP as "yet another integration" would have meant shipping it after Slack, after Jira, after Microsoft Teams — by which point the discovery window for being one of the canonical PM-tool MCPs would have closed.
Why we shipped MCP before another integration
The roadmap question we were asked most often in early 2026 was a variant of "when do you ship Slack / Teams / Jira / [name an integration]?" That's the right question for a product team three years ago. The question for 2026 is subtly different: "where will most of your new user sessions actually originate from a year from now?"
The honest answer for us, and we think for most B2B SaaS, is "agentic clients." A growing share of users won't open the Onplana web app at all on a given workday. They'll ask Claude or Cursor or ChatGPT or a custom in-house agent to "create a task for the bug report John filed" or "summarize the active risks across the migration projects" or "tell me which engineer is overallocated next sprint and propose a rebalance." That session has Onplana data flowing through it, but the user never logs in to onplana.com.
If we'd queued MCP behind Slack, by the time we shipped it the agentic ecosystem would have settled on canonical PM-tool MCP servers. We'd be a late-arriving sixth option behind whoever shipped first. There's a real first-mover dynamic in the MCP-registry world: the agent's prompt-time decision about which MCP to query is heavily weighted by which servers are already known and trusted.
So we treated MCP not as an integration but as a primary product surface, peer-class to the web app and the REST API. We allocated a senior engineer full-time for six weeks. We held the bar deliberately high on three axes: tool design quality, security model, and the search surface. The thesis was that the right MCP server is closer to a database than to a Zapier connector, and the design decisions should reflect that.
Architecture: the request flow end-to-end
Onplana's MCP server is an Express route at /api/mcp on the same backend that serves the web app. It speaks Model Context Protocol over the Streamable HTTP transport (the current MCP standard, replacing the earlier stdio-and-SSE-only options). We deliberately chose not to run it as a separate service. Co-locating with the main backend meant we could share the existing middleware stack, plan-gate dispatcher, audit-log writer, and Prisma client without re-implementing any of them. The MCP layer is a thin adapter on top of an already-mature service surface.
A typical tool call traverses six layers, in order:
- MCP transport. The client (Claude Desktop, Cursor, etc.) POSTs a JSON-RPC message to
/api/mcp. The Streamable HTTP transport returns either a regular JSON response or an SSE-framed response for long-running tools likeanalyze_project_risks. - Authenticate. Bearer PAT, bcrypt-verified against
ApiToken. The token must carry theMCP_AGENTscope; tokens without it are rejected pre-dispatch. - Resolve org context. The PAT is bound to an org at creation time, no
X-Organization-Idheader spoofing risk. Org plan + role + permission policy are loaded in the same single DB round-trip the rest of the backend uses. - Tools/list. The MCP
tools/listresponse is filtered against the org's plan. A FREE-plan agent literally cannot seeanalyze_project_risksin the tool catalog — it doesn't appear in the list, which means the upstream LLM can't even attempt to call it. This was a deliberate trade-off: hiding tools (rather than returning "upgrade required") makes the agent's planning more predictable, at the cost of one round-trip to discover the surface. - Dispatch. The tool name resolves through the same
aiToolsregistry that powers in-app chat function-calling. We deliberately did not write a parallel registry for MCP. One catalog, two transports. - Audit. Before the response goes out, an
AiOperationrow is created with the full tool name, input, output (truncated), durationMs,actorType="mcp_agent", and the originating PAT's id. Admins query this from the AI Operations panel filtered by actor type. The same audit shape covers in-app AI, scheduled background AI workers, and MCP — one query gets you the entire AI footprint of your tenant.
The single most important decision in this list is #5 — reuse the existingaiTools registry. We considered building a parallel MCP-specific tools layer with its own schema, validation, and dispatch logic. That would have shipped faster (a week, maybe). It would have rotted within three months as the two registries drifted. Future-us would be the one paying for that, and future-us has been bitten enough times to push back hard on speed-shaped technical debt during the build.
The 27-tool surface, and how we chose it
We started by enumerating "everything an agent might want to do with PM data" and ended up with a list of 60+ candidate tools. We trimmed aggressively. The 27 that shipped fit three groups:
- Read & search (12 tools):
list_projects,get_project,list_tasks,get_task,list_my_tasks,list_overdue,list_org_members,list_team_members,list_risks,find_similar_projects,summarize_project,search_org_knowledge. - Mutations (11 tools): Create / update for projects, tasks (including
assign_taskandmove_task_to_sprint), milestones, comments, project membership,link_dependency, andsubmit_timesheet. - Bulk + agentic (4 tools):
bulk_update_tasks,create_sprint_with_tasks,analyze_project_risks,generate_status_report.
Three principles drove the trimming:
Bias toward verbs the agent will compose into useful workflows, not verbs that match REST endpoints. Our REST surface has 280+ endpoints. We didn't expose 280 tools. The MCP surface should be a curated palette an LLM can pick from without combinatorial confusion. When in doubt about a tool, we asked "would a senior PM ever phrase a request that way?" If the answer was no, the tool got cut.
One canonical way to do each thing. We could have shipped four variants of list_tasks (by project, by assignee, by status, by tag). We shipped one with optional filter parameters. The argument for variants is "clearer prompt-time tool choice"; the argument against is that variant proliferation is the leading cause of tool-catalog rot in MCP servers we've audited. One tool with optional filters is the right trade.
Mutations are explicit, with idempotency. update_task takes an optional idempotency key in the input schema. Agents that retry on transient failure (which is a lot of them) won't double-write. The audit row records the key. We don't trust agents to retry-correctly; the server makes retry-correctness unnecessary.
The hybrid search engine (the differentiator)
The single tool we spent the most engineering time on is search_org_knowledge. When you watch an agent actually try to do useful work with a PM-tool MCP, the failure mode that shows up most often is "the agent can't find the right context." Project X has 200 tasks; the user asked about "the database thing"; the agent calls list_tasks; the response is too big to reason over.
We invested in a single tool that resolves "find me everything related to X" in one round-trip, returning a ranked, citation-ready list. Under the hood:
- Per-content-type embeddings. Every project description, task title + description, risk, goal, comment, and wiki-page paragraph is embedded at write-time with a Matryoshka-friendly model. Embeddings live in Postgres alongside the source rows; no separate vector DB. The bet is that pgvector + a couple of right indexes is good enough for the search recall we need on PM-shaped data, and the operational cost of a separate vector service exceeded its performance benefit at our current scale.
- BM25 lexical pass. Run in parallel with the vector pass via Postgres full-text search. Vectors catch synonyms ("DB migration" finds "database refactor"). BM25 catches exact-match ID-shape strings ("PROJ-447") that embeddings often fumble.
- Reciprocal rank fusion. The two result sets are fused with an RRF rank rather than a weighted-score blend. We tried weighted blending first and found it sensitive to score-scale drift across content types. RRF is rank-only and gives much more stable behavior across queries.
- Org-scoped, role-aware. Every retrieval row is filtered by org and re-checked against the caller's project membership. An agent authenticated to org A can never receive a row from org B, even if the embedding similarity says they're close. We made this filter run in SQL, not in application code, so it can't be bypassed by a future refactor.
We considered exposing the lexical and vector tools separately so agents could choose. We decided against. The fused tool is what agents actually want; the choice between vector and BM25 is implementation detail, not surface area.
Result quality, measured on an internal eval of 200 PMO-shaped queries with ground-truth labels: F1 of about 0.74 on the fused tool vs 0.61 on vector alone and 0.58 on BM25 alone. The fusion clearly wins. The remaining ~0.26 F1 gap is mostly queries where the right answer is in a closed comment thread or a wiki page our drafter hasn't indexed yet. Both are work-in-progress.
Security: PAT scopes, plan gates, audit trail
The single highest-risk design decision in an MCP server is auth shape. Get it wrong and you ship a PAT that lets any agent do anything in any org. We ran the auth layer through a security review before any tool was wired up. Three layers:
- Personal access tokens, scoped. We added a new permission scope
MCP_AGENTthat's distinct from every other PAT scope. A token with onlyREPORTS_READcannot call MCP tools, even though both routes accept Bearer tokens. Customers can mint MCP-only tokens and rotate them independently from their REST tokens. - Plan + role gates, enforced at dispatch. The same
requireFeature(key)middleware that gates REST routes runs on every tool call. The plan map is shared with the in-app catalog. An MCP agent on a FREE plan cannot callanalyze_project_risksany more than a FREE-plan user can open the in-app risk view. - Preview mode by default on free tiers. Mutations on FREE and STARTER plans return a preview response by default — the agent sees the operation that would be performed, including the diff, without writing to the database. To commit, the agent passes
{ commit: true }in the input. This was our concession to the "AI accidentally deleted my project" failure-mode reports we'd watched competitors absorb. Free users opt into the destructive behaviour rather than accidentally fall into it.
On top of those three, every tool call writes an AiOperation row with input, output (truncated to 4 KB), duration, and the originating PAT's id. The server enforces a 120 req/min per-token rate limit. Cost cap enforcement is shared with in-app AI — an org that's blown its monthly cap can't bypass it via MCP.
The piece we're least confident about, looking back: we currently scope tokens at the org level but not at the project level. A team member with MCP_AGENT scope can effectively read every project they have access to. We think this is the right default — PATs already work this way — but a future iteration may add per-project subscopes for customers with tight compliance requirements. The audit trail mitigates most of the practical concern.
The hard parts we got wrong first
Honest accounting of the three biggest mistakes we made on the way to the shipped version:
We started with too many tools. The first internal cut had 52 tools, including granular variants like list_tasks_by_assignee, list_tasks_by_project, list_tasks_by_status. We watched a Claude Desktop session try to pick between them on a query like "what's John working on" and the model picked the wrong variant about 30% of the time. Collapsing to one filterable list_tasks tool with optional params dropped wrong-tool errors close to zero. We re-tested every variant we considered cutting and kept the rule that variant proliferation is the enemy of agent reliability.
We exposed embeddings as a raw tool first. The first cut of search_org_knowledge exposed a similarity_threshold parameter and let the caller pick the embedding model. Agents would set the threshold to 0.3 and pull back hundreds of weakly-related rows, then truncate to fit the context window. Useless. We replaced the parameter with a server-side sensible default (RRF top-k = 20) and stopped exposing the model choice entirely. Tool surface area should be the language a senior PM speaks, not the language a vector-search engineer speaks.
We initially logged tool inputs in plain text. The PAT-token string itself never showed up in logs (we redact that at the auth-middleware level), but the first audit-log shape included full tool inputs including any embedded URLs, email addresses, or task-description content. After a security review pointed out that this could surface PII into audit-log queries used by ops teams without org-scope filtering, we changed the audit-log writer to truncate inputs to 4 KB and to require the same org-scope check on /admin/ai-usage queries that the rest of the admin surface uses. The lesson: audit logs are not a free observability primitive; they're an in-scope security surface and need the same access controls as the data they record.
One thing we didn't get wrong but were nervous about: response streaming. MCP supports SSE, but the JSON-RPC framing on top is brittle if you stop and restart a stream. We held the line on full-buffer responses for tools that finish in under 5 seconds, and only stream for the two long-running tools (analyze_project_risks and summarize_project with large projects). The trade-off favors simpler client integration over partial-token UX.
What's next
Three threads we're actively pulling on:
- Per-project subscopes for compliance customers. The current org-level scoping is right for most teams. Regulated PMOs with strict need-to-know separation between project teams will get the option to mint MCP tokens that can only see specific projects.
- Wiki + comment-thread indexing. The retrieval F1 gap is mostly closed-comment threads and wiki pages we haven't indexed yet. The indexing pipeline is straightforward; the harder part is deciding which wiki content should be agent-visible. Default-deny with an opt-in flag at the page level is where we're headed.
- Tool-eval CI. We have an internal eval harness that runs ~200 PM-shaped agent prompts through the MCP and scores correctness. We're wiring that into CI so any tool-surface change has to clear the eval before it ships. This is where most MCP-server regressions hide.
If you're building agents that need to reach into PM data — assignment, status, risk, dependencies, the whole picture — start with our MCP docs and the AI Agents tab in your org settings. The broader AI architecture context is on the AI project management feature page; pricing tiers and plan-gated tool surface details are on our pricing page.
If you've got feedback on the tool shape or a workflow we're missing, tell us. We read every one.
Connect your agent to Onplana
Free-plan org, free PAT, full read access to your data. Mint a token from Settings → Developer, drop the URL into your MCP client config, and you're done.