Opened 42 hours ago

Last modified 19 hours ago

#19477 assigned enhancement

Improvements to mcp_server bundle that allows Claude AI agent to control ChimeraX

Reported by: Tom Goddard Owned by: Tom Goddard
Priority: moderate Milestone:
Component: UI Version:
Keywords: Cc: a.rohou@…, Zach Pearson
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

This ticket is to make miscellaneous usability improvements to the ChimeraX MCP server that allows Claude AI to send commands to ChimeraX.

We have an MCP (Model Context Protocol) bundle mcp_server that allows the Claude AI agent to execute commands in ChimeraX. Alexis Rohou and Zach Pearson made initial versions. Currently the code includes an mcp start/stop/info ChimeraX commands Zach added (written by Claude Code) and a chimerax_mcp_bridge.py file that implements the MCP server that Claude starts to send commands to ChimeraX that was mostly created by Alexis Rohou (also I think generated by Claude Code).

Here are some problems with the current mcp_server.

1) Add an mcp setup command to create the Claude MCP configuration file. Currently the mcp info command suggests the user paste some JSON into a Claude configuration file without being told where the file is, or how to insert it into a possibly already existing JSON file. The mcp info command is giving the wrong paths on Mac and on Windows for the configuration.

2) Simplify and fix bugs in bridge code. The current chimerax_mcp_bridge.py is behaving worse than the early October version in several ways. It attempts to use the save_image() method giving absolute paths on the Claude server instead of the local machine. It tries to look up the ChimeraX documentation for commands and the bridge code fails to find the docs on Mac or Windows due to incorrect code in get_docs_path(). The run_command() method gives an error when attempt to run show/hide commands trying to force Claude to use show_hide_objects() but instead it seems to often use run_command() spewing lots of errors in the Claude output. The extensive comments in methods in the bridge code seem to be causing Claude to try all kinds of misguided commands, that it did not try in October versions. While the comments are intended to help Claude, and in some cases probably do, the current bridge code has been working much worse in my simple test prompts, for example, "Show a conotoxin structure in ChimeraX."

Change History (8)

comment:1 by Tom Goddard, 42 hours ago

Cc: a.rohou@… Zach Pearson added

I added an "mcp setup" command to write the Claude Desktop configuration file for the ChimeraX MCP server on Mac and Windows. There is no official Claude Desktop app on Linux. I also replaced the vague instructions given by the "mcp info" command about pasting into the Claude config file with instructions to run "mcp setup".

comment:2 by Tom Goddard, 39 hours ago

I made several changes the chimerax_mcp_bridge.py script to get it working on Windows and with fewer failures to the simple prompt "Show me a contoxin structure in ChimeraX." Here is the list.

  • Fix MCP start_chimerax() on Windows.
  • Fixed MCP find_chimerax_executable() and get_docs_path() on Windows.
  • Removed text from the MCP list_models() doc string that said "Use this regularly to find out what models are loaded and check which ones are set to be visible" because it seemed to cause Claude to use do excessive listing of models (as requested) when not needed cluttering the Claude output.
  • Commented out the MCP show_hide_objects() method since the very extensive doc string for that method seems to cause Claude to engage in a lot of pointless hiding and showing.
  • Commented out the MCP save_image() method since Claude usually gives it absolute paths on Anthropic's Claude servers instead of on the local machine and gets very confused.
  • Removed MCP bridge code that made run_command() give an error with show/hide commands.
  • Fixed MCP bridge code get_docs_path() on Mac. Was failing to find docs.

comment:3 by a.rohou@…, 39 hours ago

Hi Tom,

I will not have time to test this for at least a few days, but I have a few
thoughts to offer (sorry I'm doing this by email - I don't think I can
comment directly in the ticket system)

A general point first: when reporting that something did or didn't work, we
should endeavour to specify which LLM the MCP client (agent) was using. In
your case, which LLM were you using? For reference, I've been using Sonnet
4.5 for the most part so far.

(1) Commenting out the show-hide objects methods may seem like it saves
time for your particular prompt by saving what seems like unnecessary
hiding and showing operations, but with the test prompts that are in the
repo, in my hands, this was necessary to get good performance
(specifically: it was needed to get it to correctly show or hide specific
targets... a common failure mode for example was that it would show <spec>
target ribbon, but not show the model at all so nothing would be visible).

Did you check your version with that method commented out with the test
prompts? Did it perform well? If so, then great!

(2) Similarly with list_models - what may seem excessive in some contexts
may be very helpful in others. I am less sure that this one is necessary,
though, to be honest - I don't remember testing this as extensively. But
again the same question: did you check that your version completes the test
prompts correctly?

(3) I'm excited to see what we can do with fixed get_docs_path... are you
getting better behavior in terms of the agent leveraging documentation? If
so perhaps that's why you're seeing better behavior without my special
show-hide method.

Another general point: I think it's OK to get the agent to perform more
commands than an experienced human might perform, even if it seems
wasteful, so long as it achieves robustness and reliability.

Cheers,
Alexis

Last edited 37 hours ago by Tom Goddard (previous) (diff)

comment:4 by Tom Goddard, 36 hours ago

Hi Alexis,

By replying to the email from chimerax-bugs@… your reply automatically gets attached to this public ticket. Be careful not to include any private information in replies to the bug tracking email.

All my testing on this ticket has been with Claude Sonnet 4.5 using Claude Desktop version 1.0.1405.

Adding new MCP methods ("mcp tools") will improve the answers on some queries and harm the answers for other queries. This makes it difficult to figure out which methods help overall. Every change I made was separately tested with and without the change but only on one or two different queries. Obviously that is not going to give a reliable indicator of whether the method is overall an improvement or not. Still I removed methods and doc strings that degraded performance on the couple queries I tested on.

My motivation is that I observed significantly worse answers on the queries I tested with yesterday's MCP bridge code. By worse, I don't mean that it simply introduced superfluous commands, it produced many error commands, and too often went berserk for 5 minutes at a time, for example, trying to save images on the wrong file system.

The behavior might have been better than earlier bridge code if I chose different queries. So why did I start removing stuff? For starters, too much of the code in the bridge I looked at to try to debug failures was simply wrong. Code to find the docs did not work on Mac, starting ChimeraX did not work on Windows, image saving was a catastrophe of Claude failing to understand that ChimeraX runs on a different file system than Anthropic's server.

In order to let ChimeraX users try this without harming the trust users have in our software the quality has to be higher. When I tried debugging why Claude went on expeditions that seemed not relevant to my prompts I came upon various "hints, tips, orders, suggestions" in the tool doc strings that seemed like they might be sending Claude astray. Removing those and testing on the same prompts improved the answers most of the time. My sense is that the more instructions given to Claude in the MCP bridge code, the more ways it has to run amok.

I also found that more bridge methods makes it harder to debug. I spent all day trying to track down specific causes for bad answers that did not occur in earlier bridge code.

This is all speculative. My impression is that improving the answers would require a lot of testing of many possible methods with many queries, something we don't have resources for. Lacking that wide testing I think less code is more likely to achieve a good result than more minimally tested code. I know the methods you added were all to improve specific behaviors. Since we have little understanding how Claude will use the methods and hints it feels like adding code whose effects we don't understand which traditionally makes software worse. On the flip side we often make verbal suggestions to human colleagues and expect that it will improve their results. I guess I feel Claude is closer to a piece of software than to a human and so is more likely to get worse given lots of general hints, while a human we think will get better with more hints.

Tom

comment:5 by a.rohou@…, 25 hours ago

Hi Tom,

Thanks for the context.

It sounds to me like you are making many changes, some of which are fixing
broken stuff (starting on windows, not fetching docs correctly) which I had
never tested, others of which are fixing bugs that only emerge when using
Claude Desktop (e.g. image saving) and which I also never found, but others
of which I spent many hours working on in order to make your
previously-suggested test prompts work. These prompts are found under
test_prompts/ and they are called MuOpioid.txt and Ntca.txt. (I also added
Giredestrant.txt).

I think at a minimum, newly introduced changes should not break previously
working features.

After all your changes, are the already-defined test prompts, three of
them, still handled well and do they give good results? I don't think it
takes all that long to run those three prompts. If they do, then great! If
not, then I would suggest
(1) committing and pushing the test prompts you have been using recently so
that I can test them as well when I next get a chance and so that we have a
more robust set of tests for future development
(2) treating as a bug the failure of the code I had pushed to handle the
new test prompts; the fix to the bug may well involve disabling the
show_hide_objects tool, but if that means that the other test prompts then
stop working, then that is a regression which should be fixed before the
bug is considered fixed. Perhaps I would be able to find another fix that
does not introduce any regressions, who knows...

At least, that's how I suggest we could proceed. If you think we should
work in some other way to avoid regressions, I'm all ears and happy to
contribute to any "way of working" that moves this forward.

Cheers,
Alexis

Last edited 25 hours ago by Tom Goddard (previous) (diff)

comment:6 by Tom Goddard, 24 hours ago

Hi Alexis,

I don't think this concept of not making changes that are regressions can work in this situation. The trouble is that any MCP page has far reaching consequences. If I have 10 test prompts and trying to avoid degrading the performance it will be impossible because any MCP change is likely to bump some up to better answers and cause others to generate worse answers. Another factor is I get different answers running the same prompt on the same day in a new chat with no MCP changes. It would probably be necessary to run a prompt 3 or 5 times to see if on average it seems better. The different answers can be very different in the length (some taking 1 minute, some taking 3 minutes) or the final result of what appears in ChimeraX, and assessment of what answer is "better" is subjective and difficult.

I think the route forward is to gain better general knowledge about how to improve MCP results by studying what others have done. MCP was released November 2024. Sadly I have not researched what is known about how to improve an MCP application behavior.

The fundamental problem I see with MCP, and AI chat in general, is that it does not appear to learn. It makes the same mistakes over and over in each new chat. This is disasterous behavior when compared to humans who usually don't repeat the same mistakes when using software.

Adding new mcp "tools", ie methods to chimerax_mcp_bridge.py is obviously one way to tweak the behavior, maybe the only way. But I think that can easily do harm to the overall performance. As an example, your method show_hide_objects() attempts to fix the common syntax errors Claude generates using the ChimeraX show and hide commands. But the show and hide commands are very extensively documented with many examples in different online sources (reference pages and tutorials). By comparison your show_hide_objects() has much less documentation. So why would you expect Claude is going to figure out when and how to use show_hide_objects() more effectively than show and hide commands? I understand why you added it. It was an obvious way to change the show/hide behavior. Maybe there are no other effective ways to improve show/hide, but that is not clear to me. My first inclination would be to give Claude a list of example commands with correct syntax for show/hide and wrong syntax including the wrong syntaxes it likes to use, and try to get it to apply those. That way Claude still gets to use the extensive show/hide documentation online, only that documentation is now supplemented specifically through MCP. I saw in your MCP code that using tool doc strings you attempt to do this.

Possibly a general mechanism to supplement Claude's knowledge would be a single tool that gives examples of correct and wrong syntax for any specified command or all commands. And the list of wrong syntaxes can be dynamically added to each time Claude executes a command that gives a syntax error so that in the next session it will have "learned" to not make the same mistake.

These ideas are based on very little. I think we need to understand how to program MCP more generally to produce the better results, learning from the experience of others. Adding more tools and following traditional programming notions where more code fixes specific defects and regression tests prevent adverse side effects seems misguided.

If you have found useful MCP programming advice online I'd love to see that.

Tom

comment:7 by a.rohou@…, 23 hours ago

Hi Tom,

Agree with all your general points, and I like the way you are thinking
about how we could better get the agents to use show/hide and other
commands. Agree 100% that the way I did it is suboptimal in many ways.

However, I still would really like you to commit and push example prompts
that reveal bad behavior as you discover them. Would you mind doing that?
That way when I next get time to work on this, I can observe for myself
what you saw, and perhaps I can try to come up with a solution that works
both for the first set of prompts as well as your new prompts. I am OK with
trying to run each prompt 3-5 times and trying to develop an
understanding of how the agent is failing in order to then try to come up
with solutions - it's quite an instructive process in my experience so far.

I also still think we should aim to not break things that were already
working when pushing things that others will use, but I'm OK with agreeing
to disagree on that point for now!

Cheers,
Alexis

Last edited 19 hours ago by Tom Goddard (previous) (diff)

comment:8 by Tom Goddard, 19 hours ago

Sure. The prompt I used 30 times yesterday with different MCP bridge variants was "Show me a conotoxin structure in ChimeraX." I got about 20 different answers in different chats many with exactly the same MCP bridge code, from 2 to 5 pages long.

I noticed I can now copy and paste from Claude Desktop which was previously not possible, but it does not copy the commands and return values, not too useful. Also the new Claude Desktop no longer shows the ChimeraX commands that failed in its log, instead it has an exclamation mark and a popup that reports the error but not the command that caused the error, a disaster for debugging. I can look at the ChimeraX Log and see the failed commands. As we discussed before it is difficult to record the output after a Claude prompt. Without having a record of the output I'm not sure how you envision gauging regressions. In yesterday's debugging I think the ChimeraX Log output was more useful than the Claude Desktop log.

I am fine trying to record the prompts I try. That is easy but I feel of rather low value. Most of the prompts I've used are not meaningful tests. So far I have just been trying to get my bearings, what is Claude capable of, what are its failure modes, so I most often have used many different prompts.

I think you and I have somewhat different objectives. You said you have certain specific commonly used tasks you want to automate. I am interested in how Claude MCP will fare when faced with a diverse set of queries from the whole ChimeraX user base covering all ChimeraX capabilities. It is a different task to optimize Claude MCP for special cases versus diverse cases, and that probably explains our different approaches we have been discussing.

Note: See TracTickets for help on using tickets.