1463 字
7 分钟

YouTube 音频下载 & 中文字幕生成(Ubuntu + pyenv + faster-whisper)完整指南

YouTube 音频下载 & 中文字幕生成(Ubuntu + pyenv + faster-whisper)完整指南#

适用场景:

  • YouTube 视频 没有任何字幕
  • 需要 本地生成高质量中文字幕(SRT/TXT)
  • 适合财经访谈、AI 分析语料整理
  • 使用 Ubuntu + pyenv 管理 Python 多版本

一、系统环境要求#

1. 操作系统#

  • Ubuntu 20.04 / 22.04 / 24.04(已验证)

2. 必需系统软件包(APT)#

Terminal window
sudo apt update
sudo apt install -y ffmpeg git curl wget ca-certificates build-essential pkg-config libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev llvm libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

说明:

  • ffmpeg:音频解码(必须)
  • 其余依赖:用于 pyenv / Python 编译

二、Python 环境(pyenv)#

1. 使用 pyenv(你当前就是这个方案)#

示例:

Terminal window
pyenv install 3.10.14
pyenv local 3.10.14

确认 Python 来自 pyenv:

~/.pyenv/shims/python
which python

三、Python 必需包#

1. faster-whisper(核心)#

Terminal window
pip install faster-whisper

说明:

  • 本地 Whisper 推理(无需联网)
  • 支持 CPU / GPU
  • large-v3 对中文财经口语最稳

(可选)其他常用包#

Terminal window
pip install torch numpy

注:CPU 场景不是必须,GPU(如 4090)才需要关注 torch CUDA 版本


四、yt-dlp(YouTube 下载工具)#

1. 安装#

Terminal window
sudo apt install yt-dlp

或(最新版):

Terminal window
pip install -U yt-dlp

2. 必须参数(重点)#

由于 YouTube 反爬机制,强烈建议始终使用:

  • 浏览器 Cookie
  • EJS 远程组件
Terminal window
--cookies-from-browser chrome
--remote-components ejs:github

浏览器需关闭,否则 Cookie 可能被锁


五、音频下载(只下音频,不下视频)#

1. 推荐格式(m4a,ID=140)#

Terminal window
yt-dlp --cookies-from-browser chrome --remote-components ejs:github -f 140 -x --audio-format m4a https://www.youtube.com/watch?v=VIDEO_ID

2. 自动兜底选择(更稳)#

Terminal window
yt-dlp --cookies-from-browser chrome --remote-components ejs:github -f "140/bestaudio[ext=m4a]/bestaudio[ext=webm]/bestaudio/best" -x --audio-format m4a https://www.youtube.com/watch?v=VIDEO_ID

六、生成中文字幕(faster-whisper)#

1. 单文件转写(示例脚本)#

from faster_whisper import WhisperModel
import os
audio = "example.m4a"
base = os.path.splitext(audio)[0]
model = WhisperModel(
"large-v3",
device="cpu", # 有 GPU 可改为 "cuda"
compute_type="int8"
)
segments, info = model.transcribe(
audio,
language="zh",
beam_size=5,
vad_filter=True
)
def ts(t):
h = int(t // 3600)
m = int((t % 3600) // 60)
s = t % 60
return f"{h:02d}:{m:02d}:{s:06.3f}".replace(".", ",")
with open(base + ".zh.srt", "w", encoding="utf-8") as f:
i = 1
for seg in segments:
text = seg.text.strip()
if not text:
continue
f.write(f"{i}\n{ts(seg.start)} --> {ts(seg.end)}\n{text}\n\n")
i += 1

输出文件:

  • xxx.zh.srt(字幕)
  • 可额外输出 xxx.zh.txt 作为纯文本

七、一键脚本(yt_asr.sh)说明#

功能#

  • 输入:YouTube URL 或本地音频文件
  • 自动完成:
    1. 下载音频(m4a)
    2. 生成中文字幕(SRT + TXT)
  • 兼容:pyenv / CPU / GPU

关键注意点(你踩过的坑)#

  • Bash 函数返回值必须干净
  • download_audio()
    • 日志必须输出到 stderr
    • stdout 只能输出最终音频路径
  • 否则会导致:
    ❌ Audio not found: ⬇️ Downloading audio ...

正确做法:

  • echo xxx >&2
  • yt-dlp ... >&2
  • printf '%s\n' "$audio_path"

八、常见问题速查#

Q1:YouTube 显示“无字幕”?#

A:只能自己跑 ASR,yt-dlp 无解

Q2:bestaudio 报错?#

A:先 --list-formats,选 140

Q3:YouTube 提示 bot 校验?#

A:

Terminal window
--cookies-from-browser chrome
--remote-components ejs:github

Q4:中文财经口语不准?#

A:

  • large-v3
  • vad_filter=True

九、推荐目录结构#

YoutubeLearnAStock/
├── yt_asr.sh
├── out/
│ ├── *.m4a
│ ├── *.zh.srt
│ ├── *.zh.txt
│ └── logs/
└── README.md

十、你现在拥有的能力#

  • ✅ 不依赖 YouTube 字幕
  • ✅ 可批量生成高质量中文字幕
  • ✅ 可直接用于:
    • A 股访谈复盘
    • AI 分析 / RAG
    • 长期语料积累

这是 专业级工作流,不是“下载字幕小技巧”。

十一、脚本#

一键执行脚本:

#!/usr/bin/env bash
# yt_asr.sh - Final version + urls.txt batch mode
# Supports:
# 1) URLs / local audio files as args
# 2) --urls-file urls.txt (one URL per line)
# Robust for Ubuntu + pyenv + faster-whisper
set -euo pipefail
# -----------------------------
# Defaults
# -----------------------------
OUTDIR="${OUTDIR:-./out}"
ASR_LANG="${ASR_LANG:-zh}" # zh / en / ja / ...
MODEL="${MODEL:-large-v3}"
DEVICE="${DEVICE:-cpu}" # cpu | cuda
COMPUTE="${COMPUTE:-int8}"
BROWSER="${BROWSER:-chrome}"
USE_COOKIES="${USE_COOKIES:-1}"
USE_REMOTE_COMPONENTS="${USE_REMOTE_COMPONENTS:-1}"
KEEP_AUDIO="${KEEP_AUDIO:-1}"
FORMAT_SELECT="${FORMAT_SELECT:-140/bestaudio[ext=m4a]/bestaudio[ext=webm]/bestaudio/best}"
AUDIO_FORMAT="${AUDIO_FORMAT:-m4a}"
VAD_FILTER="${VAD_FILTER:-1}"
URLS_FILE=""
# -----------------------------
# Helpers
# -----------------------------
die() { echo "❌ $*" >&2; exit 1; }
log() { echo "👉 $*" >&2; }
need_cmd() { command -v "$1" >/dev/null 2>&1 || die "Missing command: $1"; }
ensure_python_pkg() {
local mod="$1"
local pkg="${2:-$1}"
log "🔎 python: $(command -v python)"
if python - <<PY >/dev/null 2>&1
import importlib.util, sys
sys.exit(0 if importlib.util.find_spec("$mod") else 1)
PY
then
log "✅ Python package OK: $mod"
else
log "⬇️ Installing Python package: $pkg"
python -m pip install -U "$pkg"
fi
}
is_url() { [[ "$1" =~ ^https?:// ]]; }
build_ytdlp_args() {
local -a a=()
[[ "$USE_COOKIES" == "1" ]] && a+=(--cookies-from-browser "$BROWSER")
[[ "$USE_REMOTE_COMPONENTS" == "1" ]] && a+=(--remote-components ejs:github)
printf '%s\0' "${a[@]}"
}
download_audio() {
local input="$1" outdir="$2" logf="$3"
mkdir -p "$outdir"
local -a ytdlp=()
while IFS= read -r -d '' x; do ytdlp+=("$x"); done < <(build_ytdlp_args)
local tmpl="$outdir/%(title).200B [%(id)s].%(ext)s"
log "⬇️ Downloading audio: $input"
local filepath
filepath="$(
yt-dlp "${ytdlp[@]}" \
-f "$FORMAT_SELECT" \
-x --audio-format "$AUDIO_FORMAT" \
-o "$tmpl" \
--print after_move:filepath \
"$input" \
2>>"$logf"
)" || die "yt-dlp failed: $input"
filepath="$(printf '%s\n' "$filepath" | sed '/^[[:space:]]*$/d' | tail -n 1)"
[[ -f "$filepath" ]] || die "Audio not found after download: $filepath"
printf '%s\n' "$filepath"
}
transcribe_audio() {
local audio="$1" lang="$2" model="$3" device="$4" compute="$5" vad="$6" logf="$7"
[[ -f "$audio" ]] || die "Audio not found: $audio"
local base="${audio%.*}"
local srt="${base}.${lang}.srt"
local txt="${base}.${lang}.txt"
local json="${base}.${lang}.json"
local tsv="${base}.${lang}.tsv"
log "🧠 Transcribing: $audio"
python - "$audio" "$lang" "$model" "$device" "$compute" "$vad" \
"$srt" "$txt" "$json" "$tsv" >>"$logf" 2>&1 << 'PY'
import sys, json, os
from faster_whisper import WhisperModel
audio, lang, model_name, device, compute, vad, srt_p, txt_p, json_p, tsv_p = sys.argv[1:]
vad = vad == "1"
model = WhisperModel(model_name, device=device, compute_type=compute)
segments, info = model.transcribe(audio, language=lang, beam_size=5, vad_filter=vad)
def ts(t):
h=int(t//3600); m=int((t%3600)//60); s=t%60
return f"{h:02d}:{m:02d}:{s:06.3f}".replace(".", ",")
rows=[]
with open(srt_p,"w",encoding="utf-8") as srt, open(txt_p,"w",encoding="utf-8") as txt:
i=1
for seg in segments:
text=(seg.text or "").strip()
if not text: continue
srt.write(f"{i}\n{ts(seg.start)} --> {ts(seg.end)}\n{text}\n\n")
txt.write(text+"\n")
rows.append({"i":i,"start":float(seg.start),"end":float(seg.end),"text":text})
i+=1
with open(json_p,"w",encoding="utf-8") as f:
json.dump({"audio":os.path.basename(audio),"lang":lang,"segments":rows},f,ensure_ascii=False,indent=2)
with open(tsv_p,"w",encoding="utf-8") as f:
f.write("i\tstart\tend\ttext\n")
for r in rows:
f.write(f"{r['i']}\t{r['start']:.3f}\t{r['end']:.3f}\t{r['text']}\n")
PY
}
usage() {
cat <<'USAGE'
Usage:
./yt_asr.sh [options] <url_or_audio> [more...]
./yt_asr.sh --urls-file urls.txt
Options:
--urls-file FILE Read URLs from file (one per line, # for comments)
-o, --outdir DIR
--lang LANG zh / en / ja (default: zh)
--model MODEL
--device cpu|cuda
--compute TYPE
--browser NAME
--no-cookies
--no-remote-components
--keep-audio | --no-keep-audio
--no-vad
-h, --help
USAGE
}
# -----------------------------
# Parse args
# -----------------------------
ARGS=()
while [[ $# -gt 0 ]]; do
case "$1" in
--urls-file) URLS_FILE="$2"; shift 2;;
-o|--outdir) OUTDIR="$2"; shift 2;;
--lang) ASR_LANG="$2"; shift 2;;
--model) MODEL="$2"; shift 2;;
--device) DEVICE="$2"; shift 2;;
--compute) COMPUTE="$2"; shift 2;;
--browser) BROWSER="$2"; shift 2;;
--no-cookies) USE_COOKIES=0; shift;;
--no-remote-components) USE_REMOTE_COMPONENTS=0; shift;;
--keep-audio) KEEP_AUDIO=1; shift;;
--no-keep-audio) KEEP_AUDIO=0; shift;;
--no-vad) VAD_FILTER=0; shift;;
-h|--help) usage; exit 0;;
*) ARGS+=("$1"); shift;;
esac
done
# -----------------------------
# Preflight
# -----------------------------
need_cmd yt-dlp
need_cmd ffmpeg
need_cmd python
ensure_python_pkg faster_whisper faster-whisper
mkdir -p "$OUTDIR" "$OUTDIR/logs"
# -----------------------------
# Collect inputs
# -----------------------------
ITEMS=()
if [[ -n "$URLS_FILE" ]]; then
[[ -f "$URLS_FILE" ]] || die "urls file not found: $URLS_FILE"
while IFS= read -r line; do
line="$(echo "$line" | sed 's/#.*//g' | xargs)"
[[ -z "$line" ]] && continue
ITEMS+=("$line")
done < "$URLS_FILE"
fi
ITEMS+=("${ARGS[@]}")
[[ ${#ITEMS[@]} -ge 1 ]] || { usage; exit 1; }
# -----------------------------
# Main loop
# -----------------------------
for item in "${ITEMS[@]}"; do
echo "============================================================" >&2
log "INPUT: $item"
safe_id="$(echo "$item" | sed 's#[^A-Za-z0-9._-]#_#g' | cut -c1-80)"
ts="$(date +%Y%m%d_%H%M%S)"
logf="$OUTDIR/logs/${ts}_${safe_id}.log"
: > "$logf"
audio=""
downloaded=0
if is_url "$item"; then
audio="$(download_audio "$item" "$OUTDIR" "$logf")"
downloaded=1
else
[[ -f "$item" ]] || die "Not a URL or file: $item"
audio="$item"
fi
transcribe_audio "$audio" "$ASR_LANG" "$MODEL" "$DEVICE" "$COMPUTE" "$VAD_FILTER" "$logf"
if [[ "$downloaded" == "1" && "$KEEP_AUDIO" == "0" ]]; then
rm -f -- "$audio"
fi
log "📝 Log saved: $logf"
done
log "🎉 All done."

赞助支持

如果这篇文章对你有帮助,欢迎赞助支持!

赞助
YouTube 音频下载 & 中文字幕生成(Ubuntu + pyenv + faster-whisper)完整指南
https://jkwei.com/posts/knowledge/youtube_asr_workflow_ubuntu_pyenv/
作者
Jacky
发布于
2026-01-15
许可协议
CC BY-NC-SA 4.0
最后更新于 2026-01-15

目录