← 返回
未分类 中文

Augent

The audio & video layer for agents. 22 local MCP tools. No cloud, no API keys.
面向代理的音视频层,22个本地MCP工具,无需云端,无需API密钥。
augentdevs
未分类 clawhub v1.5.2 1 版本 100000 Key: 无需
★ 0
Stars
📥 455
下载
💾 0
安装
1
版本
#agents#audio#latest#local#mcp#media#transcription#video#whisper

概述

Augent — Audio & Video Intelligence for AI Agents

Augent is an MCP server that gives your agent 22 tools for audio and video intelligence. Download from 1000+ sites via yt-dlp and aria2c, transcribe in 99 languages via faster-whisper, search by keyword or meaning via sentence-transformers, take notes, identify speakers via pyannote-audio, detect chapters, separate audio via Demucs v4, export clips, extract visual frames, record X/Twitter Spaces (requires user-configured auth token in ~/.augent/auth.json), and generate speech via Kokoro TTS. All processing runs locally. Downloads are saved to ~/Downloads/, notes and clips to ~/Desktop/, transcription memory to ~/.augent/memory/.

Config

{
  "mcpServers": {
    "augent": {
      "command": "augent-mcp"
    }
  }
}

If augent-mcp is not in PATH, use python3 -m augent.mcp as the command instead.

Install

Install via the ClawHub install button above, or use uv tool install augent for the base package or uv tool install "augent[all]" for all features. FFmpeg is required for audio processing.

Tools

Augent exposes 22 MCP tools:

Core

ToolDescription
-------------------
download_audioDownload audio from video URLs at maximum speed. Supports YouTube, Vimeo, TikTok, Twitter/X, SoundCloud, and 1000+ sites. Uses aria2c multi-connection + concurrent fragments.
transcribe_audioFull transcription of any audio file with per-segment timestamps. Returns text, language, duration, and segments. Cached by file hash.
search_audioSearch audio for keywords. Returns timestamped matches with context snippets. Supports clip export.
deep_searchSemantic search — find moments by meaning, not just keywords. Uses sentence-transformers embeddings.
search_memorySearch across ALL stored transcriptions in one query. Keyword or semantic mode.
take_notesAll-in-one: download audio from URL, transcribe, and save formatted notes. Supports 5 styles: tldr, notes, highlight, eye-candy, quiz.
clip_exportExport a video clip from any URL for a specific time range. Downloads only the requested segment.

Analysis

ToolDescription
-------------------
chaptersAuto-detect topic chapters with timestamps using embedding similarity.
search_proximityFind where two keywords appear near each other (e.g., "startup" within 30 words of "funding").
identify_speakersSpeaker diarization — identify who speaks when. No API keys required.
separate_audioIsolate vocals from music/noise using Meta's Demucs v4. Feed clean vocals into transcription.
batch_searchSearch multiple audio files in parallel. Ideal for podcast libraries or interview collections.

Utilities

ToolDescription
-------------------
text_to_speechConvert text to natural speech using Kokoro TTS. 54 voices, 9 languages. Runs in background.
list_filesList media files in a directory with size info.
list_memoriesBrowse all stored transcriptions by title, duration, and date.
memory_statsView memory statistics (file count, total duration).
clear_memoryClear the transcription memory to free disk space.
tagAdd, remove, or list tags on transcriptions. Broad topic categories for organizing memories.
highlightsExport the best moments from a transcription. Auto mode picks top moments; focused mode finds moments matching a topic.
visualExtract visual context from video at moments that matter. Query, auto, manual, and assist modes. Frames saved to Obsidian vault.
rebuild_graphRebuild Obsidian graph view data for all transcriptions. Migrates files, computes wikilinks, generates MOC hubs.
spacesDownload or live-record X/Twitter Spaces. Start, check status, or stop recordings.

Usage Examples

Take notes from a video

> "Take notes from https://youtube.com/watch?v=xxx"

The agent calls take_notes which downloads, transcribes, and returns formatted notes. One tool call does everything.

Search a podcast for topics

> "Search this podcast for every mention of AI regulation" — provide the file path or URL.

The agent uses search_audio for exact keyword matches, or deep_search for semantic matches (finds relevant discussion even without exact words).

Transcribe and identify speakers

> "Transcribe this meeting recording and tell me who said what"

The agent calls transcribe_audio then identify_speakers to label each segment by speaker.

Search across all transcriptions

> "Search everything I've ever transcribed for mentions of funding"

The agent uses search_memory to search across all stored transcriptions without needing a file path.

Export a clip

> "Clip the part where they talk about pricing"

The agent uses search_audio or deep_search to find the moment, then clip_export to extract just that segment.

Separate vocals from noisy audio

> "This recording has music in the background, clean it up and transcribe"

The agent calls separate_audio to isolate vocals, then transcribe_audio on the clean vocals track.

Generate speech from text

> "Read these notes aloud"

The agent calls text_to_speech to generate an MP3 with natural speech. Supports multiple voices and languages.

Note Styles

When using take_notes, the style parameter controls formatting:

StyleDescription
--------------------
tldrShortest possible summary. One screen. Bold key terms.
notesClean sections with nested bullets (default).
highlightNotes with callout blocks for key insights and blockquotes with timestamps.
eye-candyMaximum visual formatting — callouts, tables, checklists, blockquotes.
quizMultiple-choice questions with answer key.

Model Sizes

tiny is the default and handles nearly everything. Only use larger models for heavy accents, poor audio quality, or maximum accuracy needs.

ModelSpeedAccuracy
------------------------
tinyFastestExcellent (default)
baseFastExcellent
smallMediumSuperior
mediumSlowOutstanding
largeSlowestMaximum

File Paths

Augent reads and writes to these locations on your machine:

PathPurpose
---------------
~/Downloads/Default directory for downloaded audio files
~/Desktop/Default directory for notes, clips, and TTS output
~/.augent/memory/transcriptions.dbSQLite database for persistent transcription memory
~/.augent/memory/transcriptions/Markdown files for each stored transcription
~/.augent/config.yamlUser configuration (optional)
~/.augent/auth.jsonTwitter/X authentication cookies for Spaces recording (optional, user-created)

If Obsidian is installed, visual frames are saved to the Obsidian vault's External Files/visual/ directory. The vault path is auto-detected from Obsidian's config.

Network Access

Network access is used for two purposes only:

  1. Downloading media from user-provided URLs via yt-dlp and aria2c
  2. Downloading ML models on first use (Whisper, sentence-transformers, pyannote, Demucs, Kokoro) from Hugging Face

No telemetry. No background network activity. No data is uploaded.

ML Dependencies

The augent[all] install includes these local ML components:

ComponentPurposeSize
--------------------------
faster-whisperSpeech-to-text transcription~75MB (tiny model)
sentence-transformersSemantic search, auto-tagging, chapter detection~90MB
pyannote-audioSpeaker diarization~29MB
Demucs v4Audio source separation (vocals from noise)~80MB
KokoroText-to-speech (54 voices, 9 languages)~200MB

All models run locally. None require API keys or cloud services.

Requirements

  • Python 3.10+
  • FFmpeg (audio processing)
  • yt-dlp + aria2c (for audio downloads)

Links

版本历史

共 1 个版本

  • v1.5.2 当前
    2026-05-03 07:04 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误和纠正,以实现持续改进。使用时机:(1)命令或操作意外失败;(2)用户纠正……
★ 4,055 📥 795,905
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 666 📥 323,791
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,210 📥 266,151