27 Jun 12:42

mudler

d11b202

v4.5.5 Latest

Latest

What's Changed

Other Changes

fix(backends): repair release CI build/test breaks (kokoros, fish-speech, llama-cpp-quantization, sglang) by @localai-bot in #10547
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10544
fix(backends): whisper darwin run.sh loads whichever fallback lib exists (.so/.dylib) by @localai-bot in #10553

Full Changelog: v4.5.4...v4.5.5

Contributors

localai-bot

Assets 9

27 Jun 00:06

mudler

v4.5.4

14b29eb

v4.5.4

What's Changed

Other Changes

fix(backends): derive darwin RUN_BINARY from the exec line only by @localai-bot in #10541

Full Changelog: v4.5.3...v4.5.4

Contributors

localai-bot

Assets 9

26 Jun 23:45

mudler

v4.5.3

f0d0bff

v4.5.3

What's Changed

Other Changes

feat(macos): sign and notarize the DMG, app, and server binary by @localai-bot in #10510
fix(backends): set rpath on the piper darwin binary so it can load its bundled libs by @localai-bot in #10525
fix(backends): darwin packaging for silero-vad (last Linux-only Go backend) by @localai-bot in #10528
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10526
fix(nodes): show a node's existing labels on the detail view by @localai-bot in #10529
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10531
chore: ⬆️ Update mudler/parakeet.cpp to f469a57270a1cc4554acb15febf60e56619673b9 by @localai-bot in #10530
fix(gpu-libs): bundle transitive deps of GPU runtime libs (#10537) by @localai-bot in #10539
fix(distributed): broadcast admin model-config changes across replicas by @localai-bot in #10540
fix(llama-cpp): stop reinterpreting plain-string message content as JSON (#10524) by @localai-bot in #10538

Full Changelog: v4.5.2...v4.5.3

Contributors

localai-bot

Assets 9

26 Jun 09:20

mudler

v4.5.2

6afe127

v4.5.2

What's Changed

Other Changes

fix(backends): make the opus backend build and package on macOS/Darwin by @localai-bot in #10523

Full Changelog: v4.5.1...v4.5.2

Contributors

localai-bot

Assets 9

26 Jun 08:39

mudler

v4.5.1

f58dcef

v4.5.1

What's Changed

Other Changes

chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10472
chore: ⬆️ Update ikawrakow/ik_llama.cpp to 7ccf1d209588962b96eacca325b37e9b3e8faf5e by @localai-bot in #10456
chore: ⬆️ Update CrispStrobe/CrispASR to 96b2a6ee31d30389fed8a7ef1a54239b75231ddc by @localai-bot in #10465
chore: ⬆️ Update ggml-org/llama.cpp to be4a6a63eb2b848e19c277bdcf2bd399e8af76d9 by @localai-bot in #10467
chore: ⬆️ Update ggml-org/whisper.cpp to 43d78af5be58f41d6ffbc227d608f104577741ea by @localai-bot in #10466
chore: ⬆️ Update mudler/parakeet.cpp to 89f5e2977b4d8bccd45e7bcc6f2ef7c4ed49e89a by @localai-bot in #10468
fix(agents): URL-decode collection/agent name path params (#10443) by @localai-bot in #10471
fix(distributed): track in-flight for SoundDetection requests by @localai-bot in #10475
refactor(distributed): make in-flight tracking coverage a compile-time contract by @localai-bot in #10476
fix(pii): load default detectors at startup + add LOCALAI_PII_DEFAULT_DETECTORS by @richiejp in #10474
i18n(id): update and complete Indonesian translations by @dedyf5 in #10480
fix(realtime): resolve model aliases for pipeline sub-models by @localai-bot in #10484
fix(backends): darwin/metal support for supertonic by @localai-bot in #10488
feat(backends): add darwin/metal build for liquid-audio by @localai-bot in #10486
chore(model-gallery): ⬆️ update checksum by @localai-bot in #10495
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10491
feat(ui): usage & UX enhancements (last-used model, polling, starter models, usage cost, a11y) by @localai-bot in #10496
fix(config): per-device VRAM headroom for Blackwell defaults (#10485) by @localai-bot in #10494
feat(ui): data-driven hardware model recommendations + gallery surfacing by @localai-bot in #10500
chore: ⬆️ Update ikawrakow/ik_llama.cpp to d5507e33ae7ee2b7b41475f08044d3bde3b839ee by @localai-bot in #10498
chore: ⬆️ Update ServeurpersoCom/omnivoice.cpp to 0f37401bebe9b20c0160a888e592108fc1d17607 by @localai-bot in #10492
fix(backends): darwin/metal support across purego Go backends by @localai-bot in #10481
feat(backends): add darwin/metal (MPS) build for trl by @localai-bot in #10487
feat(llama-cpp): cpu_moe/n_cpu_moe options + generic upstream-flag passthrough by @localai-bot in #10490
chore: ⬆️ Update ServeurpersoCom/qwentts.cpp to 9dbe7ea26a01b30fccb117ae5e86807c1dc23d42 by @localai-bot in #10499
fix: correct scheme/host on self-referential URLs behind an HTTPS reverse proxy (#10482) by @localai-bot in #10504
chore: ⬆️ Update ggml-org/llama.cpp to 8be759e6f70d629638a7eb70db3824cbdcea370b by @localai-bot in #10501
chore: ⬆️ Update leejet/stable-diffusion.cpp to 8caa3f908ae6d4a4bef531e73b9a969f266a3d1f by @localai-bot in #10493
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10505
feat(vllm): macOS/Metal support via vllm-metal (MLX) by @localai-bot in #10489
feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) by @localai-bot in #10497
fix(config): gate parallel-slot default on per-device VRAM too (#10485) by @localai-bot in #10507
fix(auth): make advisory locks dialect-aware and harden SQLite DSN by @localai-bot in #10509
feat(backends): darwin/Metal builds for vision C++/ggml backends (depth-anything, locate-anything, rfdetr-cpp, sam3-cpp) by @localai-bot in #10511
feat(backends): darwin build for the localvqe backend (acoustic echo cancellation) by @localai-bot in #10512
docs(backends): make OS coverage explicit + require darwin support for new backends by @localai-bot in #10516
chore: ⬆️ Update ikawrakow/ik_llama.cpp to b84902d2ad27c34f989f23947200c4b91b1568fd by @localai-bot in #10515
chore: bump localrecall for postgres per-connection timeouts by @localai-bot in #10517
chore: pin localrecall to tagged v0.6.3 by @localai-bot in #10518
fix(backends): quote $CURDIR in run.sh (fixes backends in paths with spaces) by @localai-bot in #10519
chore: ⬆️ Update CrispStrobe/CrispASR to 8f1218141b792b8868861c1af17ba1e361b05dc0 by @localai-bot in #10502
chore: ⬆️ Update ggml-org/llama.cpp to 9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1 by @localai-bot in #10514
feat(backends): darwin/Metal build for the privacy-filter backend by @localai-bot in #10513
feat(backends): make PreferDevelopmentBackends install the development image as primary by @localai-bot in #10520
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10521
fix(backends): ship the package/ dir for darwin go backend images by @localai-bot in #10522

Full Changelog: v4.5.0...v4.5.1

Contributors

richiejp, dedyf5, and localai-bot

Assets 9

23 Jun 21:31

localai-bot

v4.5.0

deb430f

v4.5.0

🎉 LocalAI 4.5.0 Release! 🚀

LocalAI 4.5.0 is out!

This release widens what LocalAI can perceive, sharpens the realtime voice API, and makes multi-user serving fast with zero configuration. Four new backends land, the React UI redesign ships in full, and distributed mode gets a robustness pass.

Highlights:

👁️ See depth - new depth-anything backend (Depth Anything 3): monocular metric depth + camera pose, with a typed Depth RPC and POST /v1/depth.
🔊 Hear events - new ced backend tags 527 AudioSet sound classes (baby cry, glass breaking, alarms) over REST and a VAD-decoupled realtime stream.
🗣️ Speak on-device - new supertonic ONNX TTS backend: multilingual, espeak-free, fast cold start.
🛡️ Filter PII with NER - new privacy-filter.cpp engine adds named-entity token classification alongside a regex secret detector.
🎙️ Smarter realtime - sessions become speaker-aware (identity surfaced to the client and the LLM) and stay cheap on long calls through summarize-then-drop compaction.
⚙️ Concurrent by default - prefix caching, Blackwell-tuned batch sizes, and VRAM-scaled concurrency turn continuous batching on without any config.
🖼️ A redesigned UI - the UX overhaul lands end to end, while we keep improving user experience release after release.

Plus model aliases, word-level ASR timestamps, self-contained Vulkan backends, ds4 SSD streaming for 128 GB-class models, hardened distributed staging, and a broad set of fixes.

The redesigned Home: console with a built-in assistant and chat.

📌 TL;DR

Area	Summary
👁️ Depth perception	New `depth-anything` C++/ggml backend (Depth Anything 3) - metric depth + camera pose, typed `Depth` RPC + `POST /v1/depth`, 8 GGUFs. Plus Depth Anything V2 gallery models.
🔊 Sound-event tagging	New `ced` backend (CED AudioSet tagger, 527 classes) - `POST /v1/audio/classification` + VAD-decoupled realtime sound detection.
🗣️ On-device TTS	New `supertonic` ONNX backend - multilingual, no espeak/G2P, 10 voices, fast cold start (CPU).
🛡️ PII gets a NER tier	New `privacy-filter.cpp` backend - encoder/NER token classification scanning whole conversations, alongside a restricted-regex secret detector; NER-centric PII editor in the UI.
🎙️ Smarter realtime	Speaker-aware conversations (identity → client and LLM), conversation compaction (summarize-then-drop), and OpenAI `item.delete` / `item.truncate` / `input_audio_buffer.clear`.
⚙️ Multi-user serving by default	Prefix caching on by default, Blackwell batch (2048), VRAM-scaled `n_parallel` (continuous batching on out of the box) - concurrent throughput with no KV blow-up.
🔀 Model aliases	Redirect/rename a model name to another configured model, swappable live, no client reconfig.
⏱️ Word-level ASR timestamps	NeMo + CrispASR word timestamps, plumbed through the gRPC transcription path.
🖼️ The UI, redesigned	A calmer, sharper interface lands end to end: new design language, shell/nav, ops/admin data-viz, sortable/mobile tables, unsaved-changes guards, restructured Cluster Nodes.
🛰️ Distributed staging hardened	Cold-load staging detached from the request context (large models actually finish), staging progress broadcast across replicas, resumable downloader.

🚀 New Features & Major Enhancements

👁️ Depth Perception: `depth-anything`

A new native Go gRPC backend (#10352) dlopens depth-anything.cpp (a ggml port of Depth Anything 3) via purego - no Python at inference - for monocular metric depth + camera pose estimation on CPU. Depth has no native OpenAI endpoint, so the model is exposed three ways:

A typed Depth gRPC RPC + POST /v1/depth that returns the full output surface (depth map, stats, camera extrinsics 3×4 / intrinsics 3×3).
GenerateImage(src, dst) writes a min-max-normalized grayscale depth PNG.
Predict returns the depth + pose JSON blob.

Eight Depth Anything 3 GGUFs ship at mudler/depth-anything.cpp-gguf (base/small/large/giant + a monocular mono-large, q4_k/q8_0/f16/f32), with per-CPU-variant self-contained .so builds and the full hardware matrix (cpu, cuda12/13, intel-sycl, vulkan, l4t-arm64). This cycle also adds Depth Anything V2 gallery models (#10413, native version bump) and metric-large + nested metric entries (#10363).

🔗 PRs: #10352, #10413, #10363.

🔊 Sound-Event Classification: `ced`

A new backend (#10425) backed by ced.cpp - a C++/ggml port of CED (Xiaomi), a 527-class AudioSet tagger (baby cry, footsteps, glass breaking, alarms, dog bark...) with full PyTorch parity (f32 e2e 1.7e-7) and Apache-2.0 weights. CPU perf: f16 is ~1.55× faster than the PyTorch reference (~100× realtime), q8_0 uses 6.5× less memory.

REST: POST /v1/audio/classification (fully capability-registered: swagger, /api/instructions, auth feature, React capabilities.js, docs).
Realtime: opt-in pipeline.sound_detection emits conversation.item.sound_detection events, decoupled from VAD (a sound-only session runs with turn_detection: none, activating on sounds not speech), with client-driven or server-side windowing.
Gallery: 8 entries (ced-{base,tiny,mini,small}-{f16,q8}, 6 MB → 86 MB) at mudler/ced-gguf.

🔗 PR: #10425.

🗣️ On-Device TTS: `supertonic`

A new native Go gRPC TTS backend (#10342) runs Supertone's supertonic-3 flow-matching model (4 ONNX graphs) via ONNX Runtime - no Python, no espeak-ng / G2P (text preprocessing is NFKD + a Unicode-codepoint→token-id lookup). Upstream's MIT Go pipeline is vendored at a pinned commit and driven from a LocalAI gRPC server, mirroring sherpa-onnx's ONNX-runtime bundling - small image, fast cold start. Ships a supertonic-3 gallery entry (4 ONNX + 10 voice styles F1-F5/M1-M5, SHA256-pinned), with voice / language request mapping and steps/speed/silence knobs. CPU-only in this release; CUDA wiring is scaffolded for a follow-up.

🔗 PR: #10342.

🛡️ PII Filtering Gets a NER Tier: `privacy-filter.cpp`

PII filtering moves off the patched llama.cpp TokenClassify path onto a new standalone GGML backend, privacy-filter.cpp (#10360), serving OpenAI Privacy Filter NER token-classification models (CPU/CUDA/Vulkan). The filter is reworked to be NER-centric - an encoder/NER detection tier scans whole conversations as a single document - alongside a bounded restricted-regex secret-matching detector tier. Detections are labelled by source (ner vs pattern) with backend trace / confidence / debug observability, analyze/redact exposed as a synchronous API, and request filtering extended to completions, embeddings, edits and Ollama. The React UI gains a NER-centric PII editor, detector-models table, and middleware default-policy controls; the gallery gets a privacy-filter-multilingual token-classify model + an /import-model auto-detect importer. A post-merge pass (#10401) added live NER e2e coverage and review fixes.

🔗 PRs: #10360, #10401.

🎙️ Realtime Voice: Speaker-Aware and Self-Compacting

Speaker-aware conversations (#10424). The realtime voice-recognition gate now surfaces the recognized speaker to the client (a new conversation.item.speaker event - a non-breaking LocalAI extension) and feeds identity to the LLM for personalized replies (per-message OpenAI name field and/or a The current speaker is <Name>. system note). New pipeline.voice_recognition keys decouple surfacing from authorization: enforce: false resolves and surfaces a speaker without ever dropping a turn, while the gate still fails closed when enforcing. Multi-speaker histories stay correctly attributed (each user item carries its own speaker).

Conversation compaction - summarize-then-drop (#10446). Long realtime sessions used to either feed the whole growing buffer to the LLM (expensive on CPU as it grows) or silently forget old turns. Now the server can fold aged-out turns into a rolling summary instead, via an async, post-turn snapshot → summarize → commit compactor that never holds the conversation lock across the summarizer call and never evicts items without a summary replacing them. Plus the OpenAI-parity history events that were missing: conversation.item.delete, conversation.item.truncate, input_audio_buffer.clear.

pipeline:
  max_history_items: 6          # live window - recent turns kept verbatim
  compaction:
    enabled: true
    trigger_items: 12           # high-water mark; summarize overflow back down
    summary_model: ""           # optional small/cheap CPU model; default = pipeline LLM
    max_summary_tokens: 512

Also: configurable pipeline.max_history_items (#10331) and a WebRTC data-channel max-message-size raise + keep-alive fix (#10407).

🔗 PRs: #10424, #10446, #10331, #10407.

⚙️ Multi-User Serving, On by Default

Two related, config-only (no kernel) changes make concurrent serving fast without any tuning. Both only fill values the user left unset - explicit config always wins.

**Hardware-tune...

Contributors

vjsai, SuperMarioYL, and 2 other contributors

Assets 9

2 Join discussion

13 Jun 23:27

mudler

v4.4.3

4d3d54d

v4.4.3

What's Changed

Other Changes

chore: ⬆️ Update CrispStrobe/CrispASR to d745bda4386ae0f9d1d2f23fff8ec95d76428221 by @localai-bot in #10260
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10259
chore: ⬆️ Update antirez/ds4 to d881f2a05e8ff6bec001315a36b794b4aa310173 by @localai-bot in #10262
chore: ⬆️ Update mudler/parakeet.cpp to 9db92be63179a27201d3b88d5d40c545b2ac48ae by @localai-bot in #10263
feat(react-ui): add Indonesian language support by @dedyf5 in #10266
chore: ⬆️ Update ggml-org/llama.cpp to 4c6595503fe45d5a39f88d194e270f64c7424677 by @localai-bot in #10261
feat(backend): locate-anything-cpp (open-vocabulary object detection via ggml) by @localai-bot in #10264
fix(router): production-ready request router + auto-size batch for embedding/rerank by @richiejp in #10104
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10270
feat(parakeet-cpp): enable GGML_CUDA_GRAPHS in the cublas build by @localai-bot in #10273
fix(darwin): publish sherpa-onnx and speaker-recognition images for darwin/arm64 by @localai-bot in #10275
fix(crispasr): write piper TTS WAV at the model's native sample rate by @localai-bot in #10277
feat(crispasr): bundle espeak-ng and add piper TTS voices to the gallery by @localai-bot in #10283
chore: ⬆️ Update mudler/parakeet.cpp to b8012f11e5269126eddb7f4fd02f891a2ccc29b0 by @localai-bot in #10281
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10279
fix(mlx): route vision-language models to the mlx-vlm backend by @localai-bot in #10274
fix(darwin): fix vibevoice-cpp build linkage + fail-safe go backend packaging by @localai-bot in #10276
fix(agents): emit chat event timestamps in milliseconds (#9867) by @aniruddh909 in #10243
fix(realtime): keep transcription model on a language-only session.update by @localai-bot in #10295
chore: ⬆️ Update mudler/locate-anything.cpp to 92c1682da792c1e8a5dec91acc2be4b02c742ded by @localai-bot in #10282
fix(config): backend-gate the top_k=40 sampler default (#6632) by @localai-bot in #10285
feat(gallery): add 60 piper TTS voices across 42 languages (Phase 2) by @localai-bot in #10296
fix(deps): bump cogito to fix MCP image-result panic (#10101) by @localai-bot in #10294
fix(neutts): pin torchaudio to match torch (fixes undefined symbol) (#9798) by @localai-bot in #10292
fix(gallery): make opus a meta backend for platform auto-selection (#9813) by @localai-bot in #10291
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10298
fix(gallery): correct meta-backend definitions for platform auto-selection by @localai-bot in #10299
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10302
ci(darwin): build the ds4 backend for darwin/arm64 (metal) by @localai-bot in #10303
fix(react-ui): stop Talk pipeline overflow and center collapsed-rail avatar by @localai-bot in #10305
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10304
fix(react-ui): make agent chat timestamps format-agnostic (#9867) by @localai-bot in #10290
model: fix case-insensitive suffix matching and skip .bak files in ListFilesInModelPath by @pos-ei-don in #10306
fix(xsysinfo): container-aware total RAM detection (cgroup/lxcfs) (#8059) by @localai-bot in #10288
feat(distributed): declarative per-model scheduling via env/args by @localai-bot in #10308
feat(sherpa-onnx): add Kokoro TTS + multilingual Piper voices by @localai-bot in #10309
feat(omnivoice-cpp): add OmniVoice TTS backend (file + streaming, voice cloning + voice design) by @localai-bot in #10310
feat(i18n): add Korean (ko) translation by @moduvoice in #10312
feat(qwen3-tts-cpp): migrate to ServeurpersoCom/qwentts.cpp (streaming, speakers, voice design) by @localai-bot in #10316
feat(realtime): gate realtime pipeline voice models behind voice recognition by @localai-bot in #10319
chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.23.0 by @localai-bot in #10314
test(e2e): live-server voice-recognition gate test by @localai-bot in #10324

New Contributors

@aniruddh909 made their first contribution in #10243
@moduvoice made their first contribution in #10312

Full Changelog: v4.4.2...v4.4.3

Contributors

richiejp, pos-ei-don, and 4 other contributors

Assets 9

11 Jun 22:22

mudler

v4.4.2

58cdc05

v4.4.2

What's Changed

Other Changes

chore: ⬆️ Update ggml-org/llama.cpp to ac4cddeb0dbd778f650bf568f6f08344a06abe3a by @localai-bot in #10239
chore: ⬆️ Update CrispStrobe/CrispASR to 4b27392ffd0991a857594652cbb8b57e585bcd7b by @localai-bot in #10241
fix(vllm): parse tool_call function arguments before applying the chat template by @pos-ei-don in #10256
fix(cuda): install cuda-nvrtc-dev alongside the other CUDA dev packages by @pos-ei-don in #10257

Full Changelog: v4.4.1...v4.4.2

Contributors

pos-ei-don and localai-bot

Assets 9

11 Jun 16:33

mudler

v4.4.1

f618636

v4.4.1

What's Changed

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10245
chore: ⬆️ Update antirez/ds4 to 8384adf0f9fa0f3bb342dd925372de778b95b263 by @localai-bot in #10242
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved to vllm.tokenizers) by @pos-ei-don in #10252
feat(realtime): stream the LLM / TTS / transcription pipeline stages by @localai-bot in #10176
docs: fix broken relref to realtime page by @localai-bot in #10255

New Contributors

@pos-ei-don made their first contribution in #10252

Full Changelog: v4.4.0...v4.4.1

Contributors

pos-ei-don and localai-bot

Assets 9

10 Jun 20:10

mudler

v4.4.0

fba8c9c

v4.4.0

🎉 LocalAI 4.4.0 Release! 🚀

LocalAI 4.4.0 is out!

This is a big, multimodal-and-distributed release. Two brand-new audio backends land - parakeet.cpp (NVIDIA NeMo Parakeet ASR) and CrispASR (a multi-architecture ASR and TTS engine) - alongside native object detection + segmentation (rfdetr-cpp), video understanding in llama-cpp, and LTX-2 video generation in stablediffusion-ggml. Distributed mode grows up: prefix-cache-aware routing is on by default, and file transfers become resumable. There's a new intelligent middleware layer for request routing, PII filtering and cloud-model proxying, a security hardening pass that closes a credential-leak class across every outbound HTTP client, an interactive local-ai chat CLI, RAG source citations for agents, and a long run of reasoning / tool-call streaming fixes.

📌 TL;DR

Area	Summary
🎙️ Two new ASR backends	`parakeet-cpp` (NeMo FastConformer TDT/CTC/RNNT, streaming, word/segment timestamps) and `crispasr` (many ASR architectures + TTS in one binary).
🧭 Intelligent Middleware	Capability-based model routing, PII detection/redaction, cloud-model proxies + a MITM proxy for subscription-auth Claude Code / Codex.
🛰️ Distributed v4	Prefix-cache-aware routing (on by default), NATS JWT auth + TLS/mTLS, worker registration-token enforcement, resumable HTTP file transfers, boot-time model prefetch, ds4 layer-split inference.
🎥 Video, both ways	Video input (understanding) in `llama-cpp` via mtmd, and video generation via LTX-2 in `stablediffusion-ggml`.
👁️ Detection + Segmentation	New native `rfdetr-cpp` backend (RF-DETR), 32 prebuilt GGUFs, bbox + per-detection PNG masks.
🔐 Outbound HTTP hardening	`pkg/httpclient` refuses cross-host credential-leaking redirects across every outbound client (GHSA-3mj3-57v2-4636).
🗣️ TTS per-request control	`instructions` + a generic `params` map plumbed end to end (Qwen3-TTS VoiceDesign / CustomVoice, Chatterbox).
💻 `local-ai chat`	Interactive terminal chat against a running server, with `/models`, `/model`, `/clear`.
📚 RAG citations	Agent answers now append a clickable `Sources:` block from the Knowledge Base.
🧠 Models	Gemma 4 QAT family + QAT-matched MTP speculative-decoding bundles, Ideogram4, LTX-2.3 22B GGUFs.

🚀 New Features & Major Enhancements

🎙️ Audio Gets Serious: Two New ASR Backends

This release doubles down on speech-to-text with two independent, cgo-less Go backends (purego, CGO_ENABLED=0), each shipping a full CI matrix, gallery importer and docs.

parakeet-cpp - NVIDIA NeMo Parakeet (#10084). Wraps parakeet.cpp, a C++/ggml port of NeMo Parakeet (FastConformer TDT/CTC/RNNT/hybrid) that matches the upstream PyTorch models on CPU. Text transcription, OpenAI-compatible word timestamps, and cache-aware streaming (16 kHz PCM chunks, <EOU>/<EOB> utterance boundaries). GGUFs for all 10 Parakeet models × 5 quants ship in mudler/parakeet-cpp-gguf. Follow-ups in this cycle made it production-grade:

Dynamic batching (#10112) - concurrent transcription requests are batched for throughput.
Real, NeMo-faithful segment timestamps (#10207) - words are grouped into segments exactly like NeMo's get_segment_offsets (sentence-punctuation boundaries by default, opt-in segment_gap_threshold silence splitting in encoder frames). Streaming FinalResult segments now carry start/end when the library exposes the ABI v4 JSON entry points.
nemotron-3.5-asr multilingual streaming (#10199) + per-request language selection.

crispasr - many architectures + TTS in one backend (#10099). Wraps CrispASR (a whisper.cpp/ggml fork, MIT) through its session C-ABI. One backend serves ASR or TTS depending on the loaded model, with the architecture auto-detected from the GGUF (or forced via backend:). The gallery gains 36 -crispasr entries (32 ASR + 4 TTS):

ASR (e2e-verified across Whisper / Parakeet / Moonshine): parakeet, canary, cohere, qwen3, voxtral, granite, fastconformer-ctc, wav2vec2, hubert, data2vec, glm-asr, kyutai-stt, firered-asr, moonshine, mimo-asr, and more.
TTS (all four e2e-verified to valid 24 kHz mono WAV): vibevoice, chatterbox, qwen3-tts CustomVoice, orpheus - via backend: / codec: / speaker: / voice: model options.

🧭 Intelligent Middleware: Routing, PII Filtering & Cloud Proxies

A new middleware layer (#9802) analyzes, routes, filters and transforms chat requests before they hit a model.

Capability-based routing. Requests are classified (e.g. via an ArchRouter-style model) and scored across the capabilities they may require, then routed to the smallest model that satisfies them - easy requests go to small specialized models, hard or uncertain ones to larger general-purpose models. Classified embeddings are reused via cosine similarity so similar requests skip re-classification.
PII filtering. Private information is detected per-pattern and can be redacted, rerouted, or blocked, with a streaming PII filter that preserves a buffered-emit invariant on /v1/chat/completions, Anthropic /v1/messages, and /v1/completions. A per-model PII pattern editor lives in the model config UI.
Cloud model proxies + MITM. Cloud models and a MITM proxy can take part in routing/filtering - send easy requests to local models and hard ones to the cloud, and use Claude Code / Codex subscriptions (OAuth) through the PII filter via the MITM proxy (subject to provider ToS). Emits proxy_connect + proxy_traffic audit events and restores its listener from runtime_settings.json on restart.

Usage stats are recorded end to end and surfaced in REST, the UI, and MCP. Outbound clients used by this path were also the trigger for the security pass below.

🛰️ Distributed Mode v4

Distributed mode keeps maturing across routing, security and resilience.

Prefix-cache-aware routing, on by default (#10071). Routing now biases toward the replica that already holds the relevant KV/prefix cache, as a load-guarded hint that never routes worse than today's round-robin. A generic prefix tree (pkg/radixtree) maps cumulative prompt-prefix hashes to nodes; core/services/nodes/prefixcache turns the rendered prompt into a deterministic xxhash chain and makes a filter-then-score decision (narrow to load-eligible replicas, then prefer the longest-prefix match), feeding a preferredNodeID into the existing atomic SELECT ... FOR UPDATE pick. Observations sync across frontends over NATS. Round-robin is the floor; disable with --distributed-prefix-cache=false.

NATS JWT auth + TLS/mTLS (#10159). Previously anyone with access to the NATS port could publish backend-install messages or agent jobs (an SSRF / accidental-exposure risk). This adds JWT authentication and TLS/mTLS options, with workers acquiring and auto-refreshing their NATS credentials. Complemented by worker file-transfer registration-token enforcement (#10183).

Resumable file transfers (#10109). Large model GGUFs over flaky/throttled links no longer restart from byte 0. The worker's PUT /v1/files/<key> honors Content-Range (308/416 resume semantics, X-Content-SHA256 binding, final-hash verification) and the master-side stager HEAD-probes for the last accepted offset and resumes, switching to an outer time budget (LOCALAI_FILE_TRANSFER_BUDGET, default 1h) with exponential backoff.

ds4 layer-split distributed inference (#10098). Manual layer-split inference for the ds4 backend: a coordinator owns layers 0:K and listens; workers dial in and own higher ranges, each loading only its slice of the GGUF (a new dependency-free ds4-worker binary, driven via local-ai worker ds4-distributed). Fully back-compatible when ds4_role is absent.

Operational glue. Boot-time gallery prefetch via LOCALAI_PREFETCH_MODELS (#10108); a gated X-LocalAI-Node response header for attribution (#9976); plus fixes: self-heal stale "model not loaded" routing (#10181), stage directory-based models to remote nodes (#10175), in-flight tracking for non-LLM methods - VAD, diarize, voice (#10238), reconciler survives frontend restarts (#9981), cross-replica OpCache sync (#9983), and the reinstall/upgrade UI no longer sticks on "reinstalling" (#10214).

🎥 Video, Both Directions

Video input / understanding in llama-cpp (#10216). Video-capable multimodal models (e.g. SmolVLM2-Video) can now be sent a video in a chat request, mirroring the existing image and audio paths. Tracks the upstream mtmd video landing (ggml-org/llama.cpp#24269); grpc-server.cpp forwards request->videos() into the mtmd files vector on both the template and non-template paths, and the React chat UI accepts video/*, renders an inline <video controls> player, and emits video_url content parts. allow_video is auto-gated by whether the loaded mmproj supports it. ffmpeg/ffprobe (already in the runtime image) extract frames.

Video generation via LTX-2 (#9980). stablediffusion-ggml wires audio_vae_path and embeddings_connectors_path through to the upstream LTX-2 fields, with a new gallery/ltx-ggml.yaml template (T2V / I2V / FLF2V recipes) and six LTX-2.3 22B GGUF gallery entries (dev + distilled, UD-Q4_K_M / Q4_K_M / Q8_0), each bundling the text encoder + video VAE + audio VAE + embeddings connectors. Follow-up fixes wi...

Contributors

richiejp, mudler, and 9 other contributors

Assets 9

Uh oh!

Releases: mudler/LocalAI

v4.5.5

What's Changed

Other Changes

Contributors

Uh oh!

v4.5.4

What's Changed

Other Changes

Contributors

Uh oh!

v4.5.3

What's Changed

Other Changes

Contributors

Uh oh!

v4.5.2

What's Changed

Other Changes

Contributors

Uh oh!

v4.5.1

What's Changed

Other Changes

Contributors

Uh oh!

v4.5.0

🎉 LocalAI 4.5.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

👁️ Depth Perception: depth-anything

🔊 Sound-Event Classification: ced

🗣️ On-Device TTS: supertonic

🛡️ PII Filtering Gets a NER Tier: privacy-filter.cpp

🎙️ Realtime Voice: Speaker-Aware and Self-Compacting

⚙️ Multi-User Serving, On by Default

Contributors

Uh oh!

v4.4.3

What's Changed

Other Changes

New Contributors

Contributors

Uh oh!

v4.4.2

What's Changed

Other Changes

Contributors

Uh oh!

v4.4.1

What's Changed

Other Changes

New Contributors

Contributors

Uh oh!

v4.4.0

🎉 LocalAI 4.4.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🎙️ Audio Gets Serious: Two New ASR Backends

🧭 Intelligent Middleware: Routing, PII Filtering & Cloud Proxies

🛰️ Distributed Mode v4

🎥 Video, Both Directions

Contributors

Uh oh!

👁️ Depth Perception: `depth-anything`

🔊 Sound-Event Classification: `ced`

🗣️ On-Device TTS: `supertonic`

🛡️ PII Filtering Gets a NER Tier: `privacy-filter.cpp`