by modelscope
Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.
# Add to your Claude Code skills
git clone https://github.com/modelscope/FunASRGuides for using mcp servers skills like FunASR.
Last scanned: 5/25/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-25T08:20:43.936Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}No comments yet. Be the first to share your thoughts!
pip install funasr
from funasr import AutoModel
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
result = model.generate(input="meeting.wav")
Output — structured text with speaker labels, timestamps, and punctuation:
[00:00.4 → 00:03.8] Speaker 0: Let's discuss the Q3 plan.
[00:04.2 → 00:07.1] Speaker 1: Sounds good. I have three points.
[00:07.5 → 00:12.3] Speaker 0: Go ahead. We have 30 minutes.
That's it. One model, one call — VAD segmentation, speech recognition, punctuation, speaker diarization all happen automatically.
Deploy as API server:
funasr-server --device cuda→ OpenAI-compatible endpoint at localhost:8000Use with AI agents: MCP Server for Claude/Cursor · OpenAI API for LangChain/Dify/AutoGen
| | FunASR | Whisper | Cloud APIs | |---|---|---|---| | Speed | 170x realtime | 13x realtime | ~1x realtime | | Speaker ID | ✅ Built-in | ❌ Needs pyannote | ✅ Extra cost | | Emotion | ✅ Happy/Sad/Angry | ❌ | ❌ | | Languages | 50+ | 57 | Varies | | Streaming | ✅ WebSocket | ❌ | ✅ | | vLLM Acceleration | ✅ 2-3x faster | ❌ | N/A | | Self-hosted | ✅ MIT license | ✅ MIT license | ❌ Cloud only | | Cost | Free | Free | $0.006/min+ | | CPU viable | ✅ 17x realtime | ❌ Too slow | N/A |
Planning a switch from Whisper or a cloud ASR provider? Use the migration guide and benchmark example to test representative audio, map features, and roll out safely.
184 long-form audio files (192 min). Full report →
| Model | GPU Speed | CPU Speed | vs Whisper-large-v3 | |-------|-----------|-----------|-------------------| | SenseVoice-Small | 170x realtime | 17x realtime | 🚀 13x faster | | Paraformer-Large | 120x realtime | 15x realtime | 🚀 9x faster | | Whisper-large-v3-turbo | 46x realtime | ❌ | 3.4x faster | | Fun-ASR-Nano | 17x realtime | 3.6x realtime | 1.3x faster | | Whisper-large-v3 | 13x realtime | ❌ | baseline |
Key takeaway: FunASR models run on CPU faster than Whisper runs on GPU.
funasr-server CLI, OpenAI-compatible API, MCP Server for AI agents. pip install --upgrade funasrpip install funasr
git clone https://github.com/modelscope/FunASR.git && cd FunASR
pip install -e ./
Requirements: Python ≥ 3.8, PyTorch ≥ 1.13, torchaudio
| Model | Task | Languages | Params | Links | |-------|------|-----------|--------|-------| | Fun-ASR-Nano | ASR + timestamps | 31 languages | 800M | ⭐ 🤗 | | SenseVoiceSmall | ASR + emotion + events | zh/en/ja/ko/yue | 234M | ⭐ 🤗 | | Paraformer-zh | ASR + timestamps | zh/en | 220M | ⭐ 🤗 | | Paraformer-zh-streaming | Streaming ASR | zh/en | 220M | ⭐ 🤗 | | Qwen3-ASR | ASR, 52 languages | multilingual | 1.7B | usage | | GLM-ASR-Nano | ASR, 17 languages | multilingual | 1.5B | usage | | Whisper-large-v3 | ASR + translation | multilingual | 1550M | usage | | Whisper-large-v3-turbo | ASR + translation | multilingual | 809M | usage | | ct-punc | Punctuation | zh/en | 290M | ⭐ 🤗 | | fsmn-vad | VAD | zh/en | 0.4M | ⭐ 🤗 | | cam++ | Speaker diarization | — | 7.2M | ⭐ 🤗 | | emotion2vec+large | Emotion recognition | — | 300M | ⭐ 🤗 |
Full examples with parameter docs: Tutorial →
from funasr import AutoModel
# Chinese production (VAD + ASR + punctuation + speaker)
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++", device="cuda")
result = model.generate(input="meeting.wav", hotword="关键词 20")
# 31 languages with timestamps
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", hub="hf", trust_remote_code=True,
vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda")
result = model.generate(input="audio.wav", batch_size=1)
# Streaming real-time
model = AutoModel(model="paraformer-zh-streaming", device="cuda")
result = model.generate(input="chunk.wav", cache={}, chunk_size=[0, 10, 5])
# Emotion recognition
model = AutoModel(model="emotion2vec_plus_large", device="cuda")
result = model.generate(input="audio.wav", granularity="utterance")
# OpenAI-compatible API (recommended)
pip install funasr fastapi uvicorn python-multipart
funasr-server --model sensevoice --device cuda
# → POST /v1/audio/transcriptions at localhost:8000
Verify it with a public sample:
curl -L https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav -o sample.wav
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=sensevoice \
-F response_format=verbose_json
# Docker streaming service
docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.12
OpenAI API example → · Client recipes → · Workflow recipes → · Postman collection → · OpenAPI spec → · Deployment matrix → · Deployment docs → · Agent integration →
| | | |---|---| | 📖 Documentation | 🐛 [Issue