Looking for a lighter local LLM inference server than Ollama? Shimmy is written in Rust, ships as a single binary, offers 100% OpenAI API compatibility, and runs with zero configuration. This post covers what it is, how it compares to Ollama, and what happened when I tested it on WSL2 with Intel integrated graphics.

This is Part 1 of a “Local LLM Inference Environment” series. Part 2: Why Intel GPU Can’t Run LLM Inference in WSL2 →

What is Shimmy

One line: a Rust drop-in replacement for Ollama — lighter, faster, zero dependencies.

DimensionShimmyOllama
LanguageRustGo
DependenciesSingle binary, noneRequires llama.cpp dynamic libs
LicenseMITNon-commercial restrictions
GPU supportCUDA / Vulkan / OpenCL / MLXCUDA / Metal
Model formatsGGUF + SafeTensorsGGUF
API compatibilityOpenAI 100%OpenAI + custom API
Port allocationAuto (default 11435)Fixed (11434)
Response cachingLRU + TTL, 20-40% speedupNone

Project: https://github.com/Michael-A-Kuykendall/shimmy

Quick Start

# Linux x86_64
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy

# macOS ARM64 (with MLX for Apple Silicon)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy

# Windows x64 (with CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe

# Start
./shimmy serve --gpu-backend auto

Zero config: auto-discovers models from HuggingFace cache, Ollama local directories. Hot-swap models without restarting.

Once running, it exposes an OpenAI-compatible /v1/chat/completions endpoint — existing OpenAI tools work with zero code changes:

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model","messages":[{"role":"user","content":"hello"}]}'

Architecture

  • Rust + Tokio: Memory-safe, async high performance
  • llama.cpp backend: Industry-standard GGUF inference
  • GPU backend priority: CUDA → Vulkan → OpenCL → MLX → CPU (fallback)
  • Dynamic port management: Zero conflicts, auto-assign

Real-World Test: WSL2 + Intel GPU

Test Environment

ItemValue
CPUIntel Core Ultra 7 255H (12 cores)
RAM19GB
GPUIntel Arc 140T (Arrow Lake H)
EnvWSL2 (Linux 6.6.87)
Vulkan✅ Library installed (libvulkan_intel.so)
Shimmyv1.9.0 (built from source with llama-vulkan feature)

Test 1: Qwen3.5-0.8B GGUF — ❌ Failed

llama_model_load: error loading model: unknown model architecture: 'qwen35'

Shimmy v1.9.0 bundles a llama.cpp version that doesn’t support Qwen3.5 yet. Qwen3.5 uses a hybrid SSM (Mamba) + Transformer architecture requiring a newer llama.cpp.

Test 2: Qwen2.5 SafeTensors — ⚠️ Loads, no inference

SafeTensors models load successfully, but Shimmy reports “Full transformer inference coming soon!” — inference support is still in development.

GPU Backend Results

BackendStatusNotes
CUDA❌ No NVIDIA GPUN/A
Vulkan⚠️ Initializes but no real GPUWSL2 only exposes llvmpipe software renderer
OpenCL❌ Not compiledEnable with --features llama-opencl
MLX❌ macOS onlyN/A

Key finding: llvmpipe is Mesa’s CPU software renderer — it simulates Vulkan GPU using CPU multithreading. Using --gpu-backend vulkan still runs matrix ops on CPU, potentially slower than pure CPU mode due to Vulkan overhead.

Device: llvmpipe (LLVM 20.1.2, 256 bits)
Type: CPU  ← Not a real GPU!
Vendor ID: 0x10005 (Mesa software renderer)

Root Cause

The failure isn’t Shimmy’s fault — it’s a WSL2 + Intel GPU environment limitation:

  1. WSL2 DXG kernel module incompatible with Intel GPU driver — dmesg shows persistent ioctl failures
  2. Intel GPU driver outdated — current 32.0.101.6554, latest 32.0.101.8801 (2026-05-15)
  3. Windows Insider Preview Canary — build 26200 is pre-release, unstable compatibility

Next Steps

PriorityAction
1Update Intel GPU driver to latest
2wsl --update
3Exit Insider Preview, return to stable
4Try Level Zero backend to bypass OpenCL DXG issues

Who Should Use Shimmy?

  • CUDA environment (NVIDIA GPU): Works out of the box, best performance
  • macOS Apple Silicon: MLX backend native support
  • Existing toolchains needing OpenAI API compatibility
  • ⚠️ WSL2 + Intel/AMD integrated GPU: Vulkan backend limited, wait for driver updates
  • Pure CPU inference追求极致性能: Ollama may be more mature

References