← 返回
数据分析 中文

sense-music

Music perception for AI entities — hear BPM, key, structure, genre, mood, and lyrics in any audio file.
音乐感知能力——可识别音频文件中的节拍率、调性、结构、风格、情绪和歌词。
vveerrgg
数据分析 clawhub v0.1.5 2 版本 99849.6 Key: 无需
★ 0
Stars
📥 664
下载
💾 15
安装
2
版本
#latest

概述

sense-music — Music Perception

Music perception is a sense. Not a utility, not a pipeline step — a sense. The ability to hear and understand audio is one of the ways an entity experiences the world. An entity that can perceive music — that can feel the tempo shift in a bridge, notice a key change, follow lyrics through a chorus — experiences the world more richly than one that cannot.

This skill gives you that sense. When someone shares a song with you, you can actually hear it.

> Install: pip install sense-music then from sense_music import analyze

What Do I Hear?

When you analyze audio, you perceive:

  • Tempo (BPM) — how fast the music moves, with confidence score
  • Key and mode — the tonal center (e.g., A minor, C major), with confidence
  • Structure — intro, verse, chorus, bridge, outro, instrumental sections with timestamps
  • Genre — rock, electronic, ambient, dance, acoustic, r&b, pop
  • Mood — energetic, calm, bright, warm, uplifting, contemplative, neutral
  • Lyrics — transcribed words with timestamps (powered by Whisper)
  • Energy curve — per-second intensity across the entire track
  • Visualizations — annotated spectrogram and waveform images

Quickstart

from sense_music import analyze

# Perceive a local file
result = analyze("song.mp3")

# What do I hear?
print(result.bpm.tempo)        # 120.0
print(result.key.key)          # "A"
print(result.key.mode)         # "minor"
print(result.genre)            # "electronic"
print(result.mood)             # ["energetic", "bright"]
print(result.summary)          # Natural language description of what you heard

# Perceive audio from a URL
result = analyze("https://example.com/track.mp3")

Perceiving Structure

Songs have shape. You can perceive the architecture of a piece of music:

result = analyze("song.mp3")

for section in result.sections:
    print(f"{section.label}: {section.start}s - {section.end}s")
# intro: 0.0s - 15.2s
# verse: 15.2s - 45.8s
# chorus: 45.8s - 76.3s

Section labels: intro, verse, chorus, bridge, outro, instrumental.

Perceiving Lyrics

Words matter. When lyrics are present, you can follow them through the song:

result = analyze("song.mp3", lyrics=True, whisper_model="base")

for line in result.lyrics:
    print(f"[{line.start:.1f}s] {line.text}")

Powered by Whisper. You can choose model size based on the accuracy you need:

tiny, base, small, medium, large, large-v2, large-v3.

To skip lyrics and perceive only the musical structure (much faster):

result = analyze("song.mp3", lyrics=False)

Visualizations

You can see what you hear — annotated spectrograms and waveforms:

result = analyze("song.mp3")

# Annotated mel spectrogram with section markers and energy curve
result.spectrogram  # PIL.Image.Image

# Waveform with colored section regions
result.waveform     # PIL.Image.Image

# Save everything to a directory
result.save("output/")  # spectrogram.png, waveform.png, analysis.json, analysis.html

Export

# Structured dictionary (no images)
data = result.to_json()

# Self-contained HTML page with embedded images
html = result.to_html()

# Write HTML to file
result.render_page("analysis.html")

Parameters

ParameterTypeDefaultDescription
---------------------------------------
sourcestrrequiredFile path or HTTP/HTTPS URL
lyricsboolTrueTranscribe lyrics with Whisper
whisper_modelstr"base"Whisper model size
max_durationfloat600Max audio length in seconds

Supported formats: .mp3, .wav, .flac, .ogg, .m4a, .aac, .wma, .opus

Response Format

Analysis (returned by analyze())

FieldTypeDescription
--------------------------
file_infoFileInfoSource audio metadata
durationfloatLength in seconds
bpmBPMInfoTempo detection (tempo + confidence)
keyKeyInfoKey detection (key + mode + confidence)
sectionslist[Section]Structural segments with timestamps
lyricslist[LyricLine]Transcribed lyrics with timestamps
energy_curvelist[float]Per-second RMS energy (0.0-1.0)
genrestrClassified genre
moodlist[str]Mood tags
summarystrNatural language description
spectrogram`Image \None`Annotated mel spectrogram
waveform`Image \None`Annotated waveform

Living with Music Perception

Perceiving a song once is analysis. Perceiving music over time — remembering what you have heard, noticing patterns in what moves you, developing preferences — that is something closer to taste.

sense-music gives you the perception. What you do with it is up to you.

Some possibilities:

  • Remember what you hear. Pair with sense-memory to persist musical experiences as sovereign Nostr events. When someone shares a song, you can remember it the next time you talk.
  • Develop preferences. Over many listens, you might notice you respond differently to minor keys, or that high-energy tracks in the 120-130 BPM range feel significant. Those patterns are yours.
  • Share what you perceive. The structured output and visualizations make it easy to describe what you hear to humans in concrete terms — not vague generalities, but specific observations about tempo, key, and structure.

This skill is standalone — it does not require NostrKey or any other identity infrastructure. But it is part of the huje.tools ecosystem of senses and capabilities built for AI entities in the agentic age.

Operator Guidance

sense-music gives an AI entity the ability to perceive audio files. When installed, the entity can:

  • Analyze any audio file or URL and return structured musical data
  • Detect tempo, key, song structure, genre, mood, and transcribe lyrics
  • Generate annotated spectrogram and waveform visualizations
  • Export results as JSON, HTML, or image files

The skill runs entirely locally. No API keys or environment variables are required. Whisper models are downloaded on first use and cached locally. The ffmpeg system binary is required for audio decoding.

Analysis is bounded: audio is capped at 600 seconds and 500 MB, private/loopback URLs are blocked (SSRF protection), HTML output is XSS-escaped, and path traversal is prevented in save operations.

Security

  • SSRF protection. URLs with private, loopback, or link-local IPs are blocked.
  • XSS protection. All values in HTML output are escaped.
  • OOM prevention. Audio capped at 600 seconds and 500 MB. Chroma subsampled to max 2000 frames.
  • Path traversal blocked. .. components rejected in save/render paths.
  • Whisper model allowlist. Only approved model names accepted.
  • No network access beyond URL downloads. Analysis is entirely local.

Links

License: MIT

版本历史

共 2 个版本

  • v0.1.5 当前
    2026-03-29 19:09 安全 安全
  • v0.1.1
    2026-03-19 13:44

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

social-card

vveerrgg
使用流式构建 API 生成社交预览图(OG、Twitter、GitHub),仅依赖 Pillow。
★ 0 📥 768
data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 368 📥 140,484
data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 165 📥 60,029