# GoatLLM — Local AI coding assistant for VS Code

> Maintainer: Brandon Charleson · Latest version: 1.0.1 · Site: https://goatllm.ai · License: free to use; source not currently public

GoatLLM is a VS Code extension that lets you chat, edit, and run agentic coding tasks against open-weight large language models running **entirely on your own machine**. No accounts. No cloud. No telemetry. No rate limits.

The extension binary is **free to download and use**. The source is currently closed while the project stabilizes; opening it is on the roadmap. The runtimes and models GoatLLM connects to (MLX, Ollama, LM Studio, llama.cpp, etc.) are independently open source.

---

## What GoatLLM is

A VS Code extension (`goatllm-vscode`) that exposes a chat sidebar, an inline code-action menu, and a status-bar throughput indicator. Under the hood it speaks the OpenAI-compatible HTTP API — every request goes to a server you've configured under `goatllm.endpoints`, which can be on `localhost`, on a Mac connected via Thunderbolt, on a GPU box on your LAN, on an `exo` cluster, or anywhere else that exposes `/v1/chat/completions`. Nothing ever leaves the machines you control.

GoatLLM is positioned as a **local-first, privacy-respecting alternative** to Cursor and GitHub Copilot. It is intended for engineers who want the productivity of an AI coding assistant without the cost-per-token, data-egress, or vendor-lock-in characteristics of cloud-only tools.

---

## Three operating modes

Pick a mode from the dropdown at the top of the chat panel.

### Chat
Pure conversation. No tools. Use for Q&A, explanations, brainstorming, and pasted-code review. The model cannot read your filesystem or run anything.

### Agent
Native tool calling. The model has access to:
- `read_file(path)` — read a file's contents. Auto-approves; side-effect-free.
- `list_directory(path)` — list directory contents. Auto-approves; side-effect-free.
- `write_file(path, content)` — write a file. **Prompts for approval** with a diff preview before applying.
- `run_command(command)` — execute a shell command in the workspace cwd. **Prompts for approval** with the full command shown.

Use for reviewed edits and debugging where you want to remain in the loop.

### Agent (full access)
Same four tools as Agent mode, but **writes and commands auto-approve**. The built-in command deny list is still enforced. Use for hands-off refactors, scaffolding, and migrations where babysitting every step would defeat the point.

Agent modes use OpenAI-style `tool_choice: auto`. Your local model must support tool calling. **Verified to work:** Qwen 2.5-Coder, Llama 3.1+, Gemma 2+, DeepSeek-Coder-V2, Mistral 0.3+, Phi-3.5, and fine-tunes.

---

## Installation

Two install paths are documented on the site. Path 1 is recommended until Marketplace publication completes.

### Path 1 — drag-and-drop the `.vsix` (recommended)
1. Download the latest `.vsix` from <https://goatllm.ai/downloads/goatllm-vscode-1.0.1.vsix>.
2. Open VS Code, click the Extensions icon in the activity bar (⇧⌘X on macOS, Ctrl+Shift+X elsewhere).
3. Drag the `goatllm-vscode-1.0.1.vsix` file onto the Extensions panel.
4. Reload VS Code when prompted.

**CLI alternative:** `code --install-extension goatllm-vscode-1.0.1.vsix`

### Path 2 — VS Code Marketplace (pending publish)
1. Open the Extensions panel in VS Code.
2. Search for `GoatLLM`.
3. Click Install on the entry published by `goatllm`.

Marketplace publication is in review. Once it lands, the site will update with a direct install link and users will receive auto-updates.

---

## Setup: running a model locally

GoatLLM speaks the OpenAI-compatible HTTP API. Anything exposing `/v1/chat/completions` will work. The four most common runtimes are documented inline on <https://goatllm.ai/#setup>.

### Ollama (cross-platform, simplest)

```bash
# 1. Install
brew install ollama                                    # macOS
curl -fsSL https://ollama.com/install.sh | sh           # Linux

# 2. Pull a coding model
ollama pull qwen2.5-coder:32b

# 3. Start the server (usually auto-starts)
ollama serve

# 4. In VS Code: open the GoatLLM sidebar, click "Detect local servers".
```

Default endpoint: `http://localhost:11434/v1`

### LM Studio (GUI)

1. Download LM Studio from <https://lmstudio.ai> and install.
2. In the Discover tab, search for a coding model (e.g. `Qwen2.5 Coder 32B Instruct`), download it.
3. Open the Developer tab → Start Server.
4. In VS Code, open the GoatLLM sidebar and click Detect local servers.

Default endpoint: `http://localhost:1234/v1`

### MLX with Hugging Face models (native Apple Silicon)

```bash
# 1. Install MLX (Apple Silicon only)
pip install -U "git+https://github.com/ml-explore/mlx-lm.git"

# 2. Start the server (model auto-downloads from huggingface.co)
mlx_lm.server \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit \
  --port 8013 --host localhost

# 3. In VS Code: GoatLLM sidebar → Detect local servers.
```

Default endpoint: `http://localhost:8013/v1`. Browse compatible weights at <https://huggingface.co/mlx-community>.

### llama.cpp (bare-metal, GGUF quantizations)

```bash
# 1. Build llama.cpp (or grab a release binary)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# 2. Download a GGUF from Hugging Face, then:
./llama-server -m ./models/qwen2.5-coder-32b-instruct-q4_k_m.gguf \
  --port 8080 --host 127.0.0.1

# 3. In VS Code: GoatLLM sidebar → Detect local servers.
```

Default endpoint: `http://localhost:8080/v1`

---

## Configuration reference

Everything is under `goatllm.*` in VS Code settings. API keys live in VS Code's `SecretStorage` and are never written to `settings.json`.

| Setting | Description | Default |
|---|---|---|
| `goatllm.endpoints` | Array of `{name, baseUrl, apiKey?}` — every server you've connected. | auto-populated by Detect |
| `goatllm.activeEndpoint` | Name of the currently-selected endpoint. | first detected |
| `goatllm.defaultModel` | Default model id. Falls back to first entry from `/v1/models`. | unset |
| `goatllm.temperature` | Sampling temperature. 0 = deterministic, 2 = creative. | `0.4` |
| `goatllm.maxTokens` | Cap on tokens generated per response. | `4096` |
| `goatllm.systemPrompt.chat` | Override the system prompt for Chat mode. | built-in |
| `goatllm.systemPrompt.agent` | Override the system prompt for Agent mode. | built-in |
| `goatllm.systemPrompt.agentFull` | Override the system prompt for Agent (full access). | built-in |
| `goatllm.commandDenyList` | Extra substring patterns to block in Agent modes. | `[]` |
| `goatllm.allowSudo` | Allow `sudo` in Agent modes. Off by default. | `false` |

### Connecting a remote endpoint

```jsonc
// In VS Code settings.json
{
  "goatllm.endpoints": [
    { "name": "Studio (MLX)",   "baseUrl": "http://10.0.0.20:8013/v1" },
    { "name": "GPU box (vLLM)", "baseUrl": "http://10.0.0.30:8000/v1" },
    { "name": "Local Ollama",   "baseUrl": "http://localhost:11434/v1" }
  ],
  "goatllm.activeEndpoint": "Local Ollama"
}
```

Click the GoatLLM status bar item to flip between endpoints without leaving your editor.

---

## Security model

### Blocked unconditionally (even in Agent full access)

- `rm -rf /` and variants targeting `/`, `/*`, `~`, `$HOME`
- `mkfs`, `dd` writes to block devices
- Fork bombs (`:(){ :|:& };:` and friends)
- `sudo` — unless `goatllm.allowSudo` is explicitly set to `true`
- Any substring you add to `goatllm.commandDenyList`

### What requires approval (Agent mode only)

- `write_file` — every write, with a diff preview
- `run_command` — every shell call, with the full command shown

Reads (`read_file`, `list_directory`) auto-approve in both Agent modes because they're side-effect-free.

### Network surface

GoatLLM only makes HTTP requests to endpoints you've configured under `goatllm.endpoints`. There is **no telemetry, no analytics, no error reporting, no login, no account system, no rate limiting, no auto-update ping**. Updates flow through VS Code itself.

### Secrets handling

If you connect to a server that requires an API key:
- Keys are stored in VS Code's `SecretStorage` (macOS Keychain, Linux libsecret, Windows DPAPI).
- Keys are never written to `settings.json` or any workspace file.
- Audit and clear keys via the `GoatLLM: Manage Endpoint Keys` command.

---

## FAQ

**How do I switch models mid-conversation?**
GoatLLM polls `GET /v1/models` from your active server. Use the model picker at the top of the sidebar — switching is instant and per-conversation, so you can prototype with a small model and finalize with a bigger one.

**What hardware do I actually need?**
For a 7B coding model in Q4: any modern laptop with 8 GB RAM. For 32B at decent quality: 32 GB+ unified memory (M1/M2/M3 Pro/Max) or a 24 GB GPU. The extension itself is <1 MB and uses negligible resources — the heavy lifting is in your runtime of choice.

**Why is my model not tool-calling?**
Two common reasons: (1) the model wasn't fine-tuned for tools — try Qwen 2.5-Coder or Llama 3.1+ instead; (2) the runtime doesn't pass the `tools` field through. Ollama, LM Studio, MLX, and llama.cpp all support tool calling on recent versions.

**Does it work with cloud APIs like OpenAI or Anthropic?**
The OpenAI endpoint works as a fallback (`baseUrl: "https://api.openai.com/v1"` + API key), but GoatLLM is designed for local use — that's the entire point. For cloud, Copilot and Cursor are perfectly good.

**How do I report a bug or request a feature?**
The source repo is private. Email <b.charleson1@gmail.com> with a description and (if relevant) the contents of the GoatLLM output channel (View → Output → GoatLLM). Public issue tracking opens with Marketplace publication.

**Where do logs live?**
Open VS Code's Output panel (⇧⌘U on macOS, Ctrl+Shift+U elsewhere) and pick `GoatLLM` from the dropdown.

**What's the throughput indicator measuring?**
Tokens per second on the assistant's response, computed once the stream completes. GoatLLM uses a tuned 3.7 chars/token fallback when the runtime doesn't report token counts directly, with a 150 ms minimum-sample guard so very short replies don't post inflated numbers.

**Can I use it airgapped?**
Yes. Install the `.vsix` on a machine with no network. Point GoatLLM at `http://localhost:<port>/v1`. It runs forever without internet — no license server, no activation, no phone-home.

---

## Changelog

### v1.0.1 — UI & metrics polish

- Code blocks render with a header row: language label, **Copy** button, and **Insert at Cursor** button.
- Streaming is flicker-free — switched from full-bubble re-renders to incremental block-aware appends.
- GFM tables render with proper borders, header tinting, row hovers, and inline `code` styling inside cells.
- Throughput uses a tuned **3.7 chars/token** fallback (instead of the OpenAI-prose 4) for more honest tok/s on mixed English + code.
- 150 ms minimum-sample guard kills inflated readings on single-chunk replies.
- Status bar now uses the 🐐 goat emoji as the brand mark.

### v1.0.0 — Initial release

- Chat panel in the Activity Bar with three modes: Chat, Agent, Agent (full access).
- Auto-detection of local OpenAI-compatible servers (MLX, Ollama, LM Studio, llama.cpp, exo, vLLM).
- Live model list from `/v1/models` with one-click switching.
- Streaming responses with live tokens/sec in the status bar.
- Native tool calling: `read_file`, `list_directory`, `write_file`, `run_command`.
- Approval gates in Agent mode; full autonomy in Agent (full access).
- Built-in command deny list with user-extendable patterns.
- API keys in `SecretStorage`. Editor menus: Explain Selection, Generate Code.
- Original side-profile robot-goat brand mark.

---

## License & contact

- **Distribution:** free to download and use.
- **Source code:** currently closed. The repository is private while the project stabilizes. Open-sourcing is on the roadmap but not committed to a date.
- **Author:** Brandon Charleson.
- **Contact:** <b.charleson1@gmail.com>
- **Site:** <https://goatllm.ai>