This commit is contained in:
kura 2026-05-04 15:02:40 +08:00
parent 25a8cd077b
commit 77199541cd
5 changed files with 354 additions and 38 deletions

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 kuraa
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

178
README.en.md Normal file
View File

@ -0,0 +1,178 @@
<picture>
<source media="(prefers-color-scheme: dark)" srcset="readme/screenshot-main.png">
<img src="readme/screenshot-main.png" alt="CrossSubtitle-AI Screenshot" width="100%">
</picture>
<div align="center">
# CrossSubtitle-AI
**AI-Powered, Local-First Subtitle Workbench**
[![GitHub Release](https://img.shields.io/github/v/release/AndySkaura/crosssubtitle-ai?style=flat-square)](https://github.com/AndySkaura/crosssubtitle-ai/releases)
[![GitHub License](https://img.shields.io/github/license/AndySkaura/crosssubtitle-ai?style=flat-square)](https://github.com/AndySkaura/crosssubtitle-ai/blob/main/LICENSE)
[![Platform](https://img.shields.io/badge/platform-macOS%20%7C%20Windows-blue?style=flat-square)](#)
**English** · [简体中文](./README.md)
</div>
---
## About
CrossSubtitle-AI is a **local-first** audio/video subtitle processing tool. It uses [Whisper](https://github.com/ggerganov/whisper.cpp) for speech recognition, [Silero VAD](https://github.com/snakers4/silero-vad) for voice activity detection, and supports OpenAI-compatible APIs for intelligent translation — helping you quickly transcribe and translate media files into bilingual subtitles.
All speech recognition runs locally on your machine. No audio or video files are ever uploaded to any server, ensuring your data privacy.
## Features
- **Speech Recognition** — High-accuracy speech-to-text powered by Whisper, supporting 17 source languages including Chinese, English, Japanese, Korean, French, and more
- **Voice Activity Detection** — Silero VAD precisely splits speech segments and automatically filters out silence
- **Smart Translation** — Connect to any OpenAI-compatible API (GLM, DeepSeek, ChatGPT, etc.) to translate transcripts into your target language
- **Audio Extraction** — Built-in FFmpeg automatically extracts audio and converts to 16kHz mono WAV
- **Multiple Export Formats** — Export subtitles in SRT, VTT, and ASS formats
- **Bilingual Export** — Export side-by-side original + translated bilingual subtitles
- **Subtitle Editor** — Built-in editor for modifying both source text and translations line by line
- **Drag & Drop** — Drag and drop files to quickly create tasks
- **Task Queue** — Batch process multiple media files with real-time progress tracking
- **Bilingual UI** — Switch between Chinese and English interface languages
- **Local-First** — Speech recognition runs entirely locally, no data upload required
## Workflow
1. **Choose Mode** — Select "Source" for transcription only, or "Translate" mode for automatic translation after transcription
2. **Add Task** — Click "Add Task" or drag-and-drop media files onto the window
3. **Wait for Processing** — Tasks go through: Audio Extraction → VAD Segmentation → Speech Recognition → (Optional) Translation
4. **Review & Edit** — View and modify recognition results and translations in the subtitle editor
5. **Export Subtitles** — Export as SRT, VTT, or ASS format
## Screenshots
| Subtitle | Subtitle Editor |
|:---:|:---:|
| ![Subtitle](readme/screenshot-main.png) | ![Subtitle Editor](readme/screenshot-editor.png) |
## Installation
Download the installer for your platform from [GitHub Releases](https://github.com/AndySkaura/crosssubtitle-ai/releases):
| Platform | Package |
|:---:|:---:|
| macOS (Apple Silicon) | `.dmg` |
| Windows | `.exe` (NSIS Installer) |
> The Whisper model (~500MB) will be downloaded on first launch. An internet connection is required.
## Usage
### Quick Start
1. Open the app and select a mode from the top toolbar:
- **Source** — Speech recognition only, outputs source language subtitles
- **Translate** — Transcribes then translates via an LLM API
2. Click "Add Task" or drag-and-drop files onto the window
3. Wait for processing to complete
4. Review and edit results in the subtitle editor on the right
5. Click "Export" to save subtitles in your preferred format
### Translation Configuration
Before using the translation feature, configure the LLM API:
- Fill in the LLM API Base, API Key, and Model in "Advanced Settings"
- Works with any OpenAI-compatible service, including:
- **GLM (Zhipu AI)** — GLM-4.7-Flash available for free
- **DeepSeek**
- **ChatGPT**
- **Self-hosted** — Ollama, vLLM, etc.
### Advanced Settings
- **Whisper Model Path** — Path to a local ggml model file
- **VAD Model Path** — Path to a local Silero VAD ONNX model file
- **Batch Size** — Number of segments to translate per batch (10-15)
- **Context Size** — Number of preceding segments to include as context for translation (0-5)
## Development
### Prerequisites
- [Rust](https://www.rust-lang.org/) toolchain
- [Node.js](https://nodejs.org/) (18+)
- [FFmpeg](https://ffmpeg.org/) (must be available on the command line)
- [CMake](https://cmake.org/) (required for compiling whisper-rs)
### Local Development
```bash
# Clone the repository
git clone https://github.com/AndySkaura/crosssubtitle-ai.git
cd crosssubtitle-ai
# Install frontend dependencies
npm install
# Start development mode
npm run tauri-dev
```
### Build
```bash
# macOS DMG build
npm run tauri-build-dmg
# Windows NSIS build
npm run tauri-build-windows
```
## Tech Stack
| Layer | Technology |
|:---|:---|
| Desktop Framework | [Tauri v2](https://v2.tauri.app/) |
| Frontend | [Vue 3](https://vuejs.org/) + [TypeScript](https://www.typescriptlang.org/) |
| State Management | [Pinia](https://pinia.vuejs.org/) |
| Styling | [Tailwind CSS](https://tailwindcss.com/) |
| Internationalization | [vue-i18n](https://vue-i18n.intlify.dev/) |
| Speech Recognition | [whisper-rs](https://github.com/tazz4843/whisper-rs) (Whisper) |
| Voice Detection | [ort](https://github.com/pykeio/ort) (Silero VAD ONNX) |
| Audio Processing | FFmpeg |
| LLM Translation | OpenAI-compatible API |
## Project Structure
```
src/ Vue frontend
components/ UI components (TaskQueue, SubtitleEditor)
stores/ Pinia state management
locales/ i18n locale files (zh-CN, en)
lib/ Type definitions
src-tauri/ Rust backend
src/
audio.rs Audio extraction & WAV reading
vad.rs Silero VAD voice activity detection
whisper.rs Whisper speech recognition interface
translate.rs OpenAI-compatible translation interface
subtitle.rs SRT / VTT / ASS export
task.rs Task orchestration & event broadcasting
state.rs Application state
```
## License
This project is licensed under the [MIT](./LICENSE) License.
## Acknowledgements
- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — High-performance Whisper inference implementation
- [Silero VAD](https://github.com/snakers4/silero-vad) — High-accuracy voice activity detection
- [Tauri](https://tauri.app/) — Lightweight desktop application framework
- All contributors and users
---
<div align="center">
Made by <a href="https://kuraa.cc">kuraa</a>
</div>

193
README.md
View File

@ -1,61 +1,178 @@
<picture>
<source media="(prefers-color-scheme: dark)" srcset="readme/screenshot-main.png">
<img src="readme/screenshot-main.png" alt="CrossSubtitle-AI 截图" width="100%">
</picture>
<div align="center">
# CrossSubtitle-AI
基于 `Tauri v2 + Vue 3 + Pinia + Tailwind CSS` 的本地优先字幕工作台,覆盖以下 MVP 链路:
**AI 驱动的本地优先字幕工作台**
- 导入音视频文件并创建任务队列
- 使用 `ffmpeg` 抽取 16kHz 单声道 WAV
- 执行基础 VAD 切分并生成语音片段时间轴
- 进入 Whisper 转录/翻译环节
- 可选接入 OpenAI-compatible 接口生成中文译文
- 实时推送任务进度和字幕片段
- 导出 `SRT / VTT / ASS`
[![GitHub Release](https://img.shields.io/github/v/release/AndySkaura/crosssubtitle-ai?style=flat-square)](https://github.com/AndySkaura/crosssubtitle-ai/releases)
[![GitHub License](https://img.shields.io/github/license/AndySkaura/crosssubtitle-ai?style=flat-square)](https://github.com/AndySkaura/crosssubtitle-ai/blob/main/LICENSE)
[![Platform](https://img.shields.io/badge/platform-macOS%20%7C%20Windows-blue?style=flat-square)](#)
## 目录结构
[English](./README.en.md) · **简体中文**
- `src/`: Vue 前端界面、Pinia 状态、字幕编辑器
- `src-tauri/src/audio.rs`: 音频抽取与 WAV 读取
- `src-tauri/src/vad.rs`: VAD API 与基础能量检测实现
- `src-tauri/src/whisper.rs`: Whisper 接口层
- `src-tauri/src/translate.rs`: OpenAI-compatible 滑动窗口翻译
- `src-tauri/src/subtitle.rs`: SRT / VTT / ASS 导出
- `src-tauri/src/task.rs`: 任务编排与事件广播
</div>
## 当前实现说明
---
- 当前仓库已补齐完整工程骨架与核心数据流。
- 前端 `npm run build` 已通过Rust 侧 `cargo check` 已通过。
- `whisper.rs` 已接入真实 `whisper-rs`,会基于 VAD 片段逐段转录;目标语言为英文时启用 Whisper 原生 `translate`
- `vad.rs` 已接入 `ort` 版 Silero VAD 推理入口;当模型缺失或推理失败时,会自动回退到能量检测,保证链路不断。
## 简介
## 运行前准备
CrossSubtitle-AI 是一款**本地优先**的音视频字幕处理工具。它利用 [Whisper](https://github.com/ggerganov/whisper.cpp) 进行语音识别,结合 [Silero VAD](https://github.com/snakers4/silero-vad) 进行语音活动检测,并支持接入 OpenAI 兼容接口进行智能翻译,帮助你将音视频文件快速转录并翻译为双语字幕。
1. 安装 Rust 工具链。
2. 安装 `cmake``whisper-rs-sys` 在首次编译时需要它。
3. 安装 `ffmpeg`,并确保可通过命令行直接调用。
4. 安装前端依赖:
整个过程在本地完成语音识别,无需上传音视频文件到任何服务器,保护你的数据隐私。
## 功能特性
- **语音识别** — 基于 Whisper 的高精度语音转文字,支持中文、英文、日文、韩文、法文等 17 种源语言
- **语音活动检测** — Silero VAD 精准切分语音片段,自动过滤静音区域
- **智能翻译** — 接入 OpenAI 兼容接口(如智谱 GLM、DeepSeek、ChatGPT 等),将原文翻译为目标语言
- **音频抽取** — 内置 FFmpeg 自动抽取音频并转换为 16kHz 单声道 WAV
- **多种导出格式** — 支持 SRT、VTT、ASS 三种字幕格式导出
- **双语导出** — 支持原文 + 译文并排显示的双语字幕导出
- **字幕编辑器** — 内置字幕编辑器,支持逐条修改原文和译文
- **拖拽导入** — 支持拖拽文件快速创建任务
- **任务队列** — 批量处理多个音视频文件,实时查看处理进度
- **双语界面** — 内置中文 / 英文界面切换
- **本地优先** — 语音识别完全在本地运行,无需上传数据
## 使用流程
1. **选择模式** — 选择「原文」仅做语音识别,或「翻译」模式在识别后自动翻译
2. **添加任务** — 点击「添加任务」按钮或直接拖拽音视频文件到窗口
3. **等待处理** — 任务将依次经历:音频抽取 → VAD 切分 → 语音识别 →(可选)翻译
4. **编辑校对** — 在字幕编辑器中逐条查看、修改识别结果和译文
5. **导出字幕** — 导出为 SRT、VTT 或 ASS 格式
## 截图
| 示例 | 字幕编辑器 |
|:---:|:---:|
| ![示例](readme/screenshot-main.png) | ![字幕编辑器](readme/screenshot-editor.png) |
## 安装
从 [GitHub Releases](https://github.com/AndySkaura/crosssubtitle-ai/releases) 下载对应平台的安装包:
| 平台 | 安装包 |
|:---:|:---:|
| macOS (Apple Silicon) | `.dmg` |
| Windows | `.exe` (NSIS 安装包) |
> 首次启动时需要下载 Whisper 模型(约 500MB请确保网络通畅。
## 使用方式
### 快速开始
1. 打开应用,在顶部工具栏选择工作模式:
- **原文** — 仅进行语音识别,输出原文字幕
- **翻译** — 识别后调用 LLM 接口翻译为指定语言
2. 点击「添加任务」或拖拽文件到窗口
3. 等待任务处理完成
4. 在右侧字幕编辑器中查看和修改结果
5. 点击「导出」选择格式保存字幕文件
### 翻译模式配置
使用翻译功能前需要配置 LLM API
- 在「高级设置」中填入 LLM API Base、API Key 和 Model
- 支持任何兼容 OpenAI API 的服务,如:
- **智谱 GLM** — 推荐免费使用 GLM-4.7-Flash
- **DeepSeek**
- **ChatGPT**
- **自建服务** — 如 Ollama、vLLM 等
### 高级设置
- **Whisper 模型路径** — 指定本地 ggml 模型文件路径
- **VAD 模型路径** — 指定本地 Silero VAD ONNX 模型路径
- **批大小 (Batch Size)** — 每批翻译的片段数 (10-15)
- **上下文 (Context Size)** — 翻译时参考的上下文片段数 (0-5)
## 开发
### 环境要求
- [Rust](https://www.rust-lang.org/) 工具链
- [Node.js](https://nodejs.org/) (18+)
- [FFmpeg](https://ffmpeg.org/)(需在命令行中可用)
- [CMake](https://cmake.org/)(编译 whisper-rs 需要)
### 本地开发
```bash
# 克隆仓库
git clone https://github.com/AndySkaura/crosssubtitle-ai.git
cd crosssubtitle-ai
# 安装前端依赖
npm install
# 启动开发模式
npm run tauri-dev
```
5. 如需中文翻译,配置环境变量:
### 构建
```bash
export OPENAI_API_BASE=https://your-openai-compatible-endpoint/v1
export OPENAI_API_KEY=your_api_key
export OPENAI_MODEL=GLM-4-Flash-250414
# macOS DMG 构建
npm run tauri-build-dmg
# Windows NSIS 构建
npm run tauri-build-windows
```
6. 若要真正启用 ONNX Runtime 推理,请确保本机存在可被 `ort` 动态加载的 ONNX Runtime 库,或按你的部署方式提供运行库。
## 技术栈
7. 启动桌面应用:
| 层级 | 技术 |
|:---|:---|
| 桌面框架 | [Tauri v2](https://v2.tauri.app/) |
| 前端框架 | [Vue 3](https://vuejs.org/) + [TypeScript](https://www.typescriptlang.org/) |
| 状态管理 | [Pinia](https://pinia.vuejs.org/) |
| 样式 | [Tailwind CSS](https://tailwindcss.com/) |
| 国际化 | [vue-i18n](https://vue-i18n.intlify.dev/) |
| 语音识别 | [whisper-rs](https://github.com/tazz4843/whisper-rs) (Whisper) |
| 语音检测 | [ort](https://github.com/pykeio/ort) (Silero VAD ONNX) |
| 音频处理 | FFmpeg |
| LLM 翻译 | OpenAI-compatible API |
```bash
npm run dev
## 项目结构
```
src/ Vue 前端界面
components/ 组件 (TaskQueue, SubtitleEditor)
stores/ Pinia 状态管理
locales/ 国际化文件 (zh-CN, en)
lib/ 类型定义
src-tauri/ Rust 后端
src/
audio.rs 音频抽取与 WAV 读取
vad.rs Silero VAD 语音活动检测
whisper.rs Whisper 语音识别接口
translate.rs OpenAI 兼容翻译接口
subtitle.rs SRT / VTT / ASS 导出
task.rs 任务编排与事件广播
state.rs 应用状态
```
## 下一步建议
## 开源协
- 为 `src-tauri/src/vad.rs` 补模型输入名自适应和更多异常日志。
- 加入文件选择器、任务恢复、批量导出与测试用例。
- 为 `whisper-rs` 增加硬件加速参数与模型配置面板。
本项目基于 [MIT](./LICENSE) 协议开源。
## 致谢
- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — 高性能 Whisper 推理实现
- [Silero VAD](https://github.com/snakers4/silero-vad) — 高精度语音活动检测
- [Tauri](https://tauri.app/) — 轻量级桌面应用框架
- 所有贡献者和用户
---
<div align="center">
<a href="https://kuraa.cc">kuraa</a> 制作
</div>

Binary file not shown.

After

Width:  |  Height:  |  Size: 387 KiB

BIN
readme/screenshot-main.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB