Overview
Menu

Smart Subtitle Access Tutorial

Last updated: 2025-12-25 15:32:43Download PDF

Overview

The smart subtitle feature supports processing offline audio files, video files, and live streams. It can extract subtitles in the source language from videos through automatic speech recognition (ASR) or optical character recognition (OCR) and implement multilingual translation. It supports LLM-based text translation of subtitle files. The feature also allows configuration of hotword and term lexicons to improve the accuracy of speech recognition and LLM-based translation.
Smart Subtitle Feature
Description
Supported Input Type
ASR-based subtitle generation
Enables ASR-based conversion of dialogs to subtitle files for LLM-based translation.
Supports configuration of hotword and term lexicons to improve the accuracy of speech recognition and LLM-based translation.
Supports embedding and rendering subtitles into video images.
Audio file, video file, live stream, and real-time audio stream
OCR-based subtitle generation
Enables OCR-based extraction of characters from images as subtitles for LLM-based translation.
Video file (with hard subtitles on images)
Subtitle file translation
Supports LLM-based translation of input subtitles into different languages and generation of new subtitles.
Subtitle file (in WebVTT or SRT format)

Key features

Comprehensive Platform Support: Offers processing capabilities for on-demand files, live streams, and RTC streams. Live broadcast real-time simultaneous captioning supports steady and gradient modes, with a low barrier to integration and no need for modifications on the playback end.
High Accuracy: Utilizes large-scale models, and supports hotwords and glossary databases, achieving industry-leading accuracy.
Rich Language Variety: Supports hundreds of languages, including various dialects. Capable of recognizing mixed-language speech, such as combinations of Chinese and English.
Customizable Styles: Enables embedding open subtitles into videos, with customizable subtitle styles (font, size, color, background, position, etc.).

Demo

1. Access the Experience Center to navigate to the Intelligent Captions experience page. On the right-hand side, select either an offline video file or a live stream, specify the source language and caption type, and then click One-Click Processing.
2. Once the processing is complete, you can view the results.
Note:
The function of the MPS Demo is relatively simple, only for experiencing the basic effect, please use the API access to test the complete effect.


Scenario 1: Processing Offline Files

Method 1: Initiating a Zero-Code Task from the Console

Initiating a Task Manually

Log in to the Media Processing Service (MPS) console and click Create Task > Create VOD Processing Task.



1. Specify an input file.
You can choose a video file from a Tencent Cloud Object Storage (COS) bucket or provide a video download URL. The current subtitle generation and translation feature does not support using AWS S3 as an input file source.
2. Process the input file.
Select Create Orchestration and insert the "Smart Subtitle" node.

You can choose a preset template or use custom parameters. For a detailed template configuration guide, see Smart Subtitle Template.

3. Specify an output path.
Specify the storage path of the output file.
4. Initiate a task.
Click Create to initiate a task.

Automatically Triggering a Task Through the Orchestration (Optional)

If you want to upload a video file to the COS bucket and achieve automatic smart subtitles according to preset parameters, you can:
1. Enter On-demand Orchestration in the menu, click Create VDD Orchestration, select the smart subtitle node in task configuration, and configure parameters such as the bucket and directory to be triggered.

2. Go to the On-demand Orchestration list, find the new orchestration, and enable the switch at Enable. Subsequently, any new video files added to the triggered directory will automatically initiate tasks according to the preset process and parameters of the orchestration, and the processed video files will be saved to the output path configured in the orchestration.
Note:
It takes 3-5 minutes for the orchestration to take effect after being enabled.


Method 2: API Call

Method 1

Call the ProcessMedia API and initiate a task by specifying the Template ID. Example:
{
"InputInfo": {
"Type": "URL",
"UrlInputInfo": {
"Url": "https://test-1234567.cos.ap-guangzhou.myqcloud.com/video/test.mp4" // Replace it with the video URL to be processed.
}
},
"SmartSubtitlesTask": {
"Definition": 122 //122 is the ID of the preset Chinese source video—generate Chinese and English subtitles template, which can be replaced with the ID of a custom smart subtitle template.
},
"OutputStorage": {
"CosOutputStorage": {
"Bucket": "test-1234567",
"Region": "ap-guangzhou"
},
"Type": "COS"
},
"OutputDir": "/output/",
"Action": "ProcessMedia",
"Version": "2019-06-12"
}

Method 2

Call the ProcessMedia API and initiate a task by specifying the Orchestration ID. Example:
{
"InputInfo": {
"Type": "COS",
"CosInputInfo": {
"Bucket": "facedetectioncos-125*****11",
"Region": "ap-guangzhou",
"Object": "/video/123.mp4"
}
},
"ScheduleId": 12345, //Replace it with a custom orchestration ID. 12345 is a sample code and has no practical significance.
"Action": "ProcessMedia",
"Version": "2019-06-12"
}
Note:
If there is a callback address set, see the ParseNotification document for response packets.

Subtitle Application to Videos (Optional Capability)

Call the ProcessMedia API, initiate a transcoding task, specify the vtt file path for the subtitle, and specify subtitle application styles through the SubtitleTemplate field.
Example:
{
"MediaProcessTask": {
"TranscodeTaskSet": [
{
"Definition": 100040, //Transcoding template ID. It should be replaced with the transcoding template you need.
"OverrideParameter": { //Overwriting parameters that are used for flexibly overwriting some parameters in the transcoding template.
"SubtitleTemplate": { //Subtitle application configuration.
"Path": "https://test-1234567.cos.ap-nanjing.myqcloud.com/mps_autotest/subtitle/1.vtt",
"StreamIndex": 2,
"FontType": "simkai.ttf",
"FontSize": "10px",
"FontColor": "0xFFFFFF",
"FontAlpha": 0.9
}
}
}
]
},
"InputInfo": { //Input information.
"Type": "URL",
"UrlInputInfo": {
"Url": "https://test-1234567.cos.ap-nanjing.myqcloud.com/mps_autotest/subtitle/123.mkv"
}
},
"OutputStorage": { //Output bucket.
"Type": "COS",
"CosOutputStorage": {
"Bucket": "test-1234567",
"Region": "ap-nanjing"
}
},
"OutputDir": "/mps_autotest/output2/", //Output path.
"Action": "ProcessMedia",
"Version": "2019-06-12"
}

Querying Task Results

Via the Console

1. Navigate to the Offline Task Management in the console, where the list will display the tasks that have just been initiated.

2. When the subtask status is "Successful", clicking on View Result allows for a preview of the subtitle.

3. The generated VTT subtitle file can be found in Orchestration > COS Bucket > Output Bucket.



Sample Chinese-English subtitles:




Event Notification Callbacks

When initiating a media processing task with ProcessMedia, you can configure event callbacks through the TaskNotifyConfig parameter. Upon the completion of the task, the results will be communicated back to you via the configured callback information, which you can decipher using ParseNotification.

Querying Task Results by Calling an API

Call the DescribeTaskDetail API and fill in the task ID (for example, 24000022-WorkflowTask-b20a8exxxxxxx1tt110253 or 24000022-ScheduleTask-774f101xxxxxxx1tt110253) to query task results. Example:


Scenario 2: Live Streams

There are currently 2 solutions for using subtitles and translations in live streams: Enable the subtitle feature through the Cloud Streaming Services (CSS) console, or use MPS to call back text and embed it into live streams. It is recommended to enable the subtitle feature through the CSS console. The solution is introduced as follows:

Method 1: Enabling the Subtitle Feature in the CSS Console

1. Configure the live subtitling feature.
1.1 Enable CSS and MPS.
1.2 Log in to the CSS console, create a subtitle template, and bind the transcoding template.
2. Obtain subtitle streams.
When the transcoding stream (append the transcoding template name _transcoding template name bound with the subtitle template to the corresponding live stream's StreamName to generate a transcoding stream address) is obtained, subtitles will be displayed. For detailed rules of splicing addresses for obtaining streams, see Splicing Playback URLs.
Note:
Currently, there are 2 forms of subtitle display: real-time dynamic subtitles and delayed steady-state subtitles. For real-time dynamic subtitles, the subtitles in live broadcast will dynamically correct the content word by word based on the speech content, and the output subtitles change in real time. For delayed steady-state subtitles, the system will display the live broadcast with a delay according to the set time, but the viewing experience of the complete sentence subtitle mode is better.

Method 2: Calling Back Text through MPS

Currently, it is not supported to use the MPS console to initiate live stream smart subtitle tasks. You can initiate them through the API.
Below are usage examples. For detailed API documentation, see ProcessLiveStream. For the real-time callback package, see ParseLiveStreamProcessNotification.
Note:
Currently, using MPS to process live streams requires the use of the Intelligent Identification template. This is achieved using Automatic Speech Recognition or speech translation.
{
"Url": "http://5000-wenzhen.liveplay.myqcloud.com/live/123.flv",
"AiRecognitionTask": {
"Definition": 10101 //10101 is the preset Chinese subtitle template ID, which can be replaced with the ID of a custom intelligent identification template.
},
"OutputStorage": {
"CosOutputStorage": {
"Bucket": "6c0f30dfvodgzp*****0800-10****53",
"Region": "ap-guangzhou"
},
"Type": "COS"
},
"OutputDir": "/6c0f30dfvodgzp*****0800/0d1409d3456551**********652/",
"TaskNotifyConfig": {
"NotifyType": "URL",
"NotifyUrl": "http://****.qq.com/callback/qtatest/?token=*****"
},
"Action": "ProcessLiveStream",
"Version": "2019-06-12"
}

Scenario 3: Processing Private Audio Streams via WebSocket

In scenarios such as video conferencing and duplex voice, audios can be transmitted to the recognition and translation services via WebSocket, and the results can be returned over the same protocol. Recognition, recognition and translation, and simultaneous recognition and translation of multiple real-time audio streams are supported. Real-time subtitles can be output in the steady-state or gradient mode. For details about the protocol, see WebSocket Protocol for Recognition.
Sample code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse
import struct
import time
import os
import signal
import sys
import hashlib
import hmac
import random
from urllib.parse import urlencode, urlunsplit, quote
import websockets
import asyncio
import logging
import json

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class AudioPacket:
def __init__(self, format=1, is_end=False, timestamp=0, audio_src_id="123456", ext_data=b'', data=b''):
self.format = format
self.is_end = is_end
self.timestamp = timestamp
self.audio_src_id = audio_src_id
self.ext_data = ext_data
self.data = data

def marshal(self):
"""Serialize audio packet to binary format"""
header = struct.pack(
'>BBQH',
self.format,
1 if self.is_end else 0,
self.timestamp,
len(self.audio_src_id)
)
audio_src_bytes = self.audio_src_id.encode('utf-8')
ext_len = struct.pack('>H', len(self.ext_data))
return header + audio_src_bytes + ext_len + self.ext_data + self.data

def sha256hex(s):
"""Calculate SHA256 hex digest"""
if isinstance(s, str):
s = s.encode('utf-8')
return hashlib.sha256(s).hexdigest()

def hmacsha256(s, key):
"""Calculate HMAC-SHA256"""
if isinstance(s, str):
s = s.encode('utf-8')
if isinstance(key, str):
key = key.encode('utf-8')
return hmac.new(key, s, hashlib.sha256).digest()

def generate_random_number(digits):
"""Generate random number with specified digits"""
low = 10 ** (digits - 1)
high = (10 ** digits) - 1
return random.randint(low, high)

def generate_url_v3(args):
"""Generate WebSocket URL with TC3-HMAC-SHA256 signature"""
query_params = {}
if args.dstLang:
query_params["transSrc"] = args.lang
query_params["transDst"] = args.dstLang
else:
query_params["asrDst"] = args.lang
query_params["fragmentNotify"] = "1" if args.frame else "0"
query_params["timeoutSec"] = str(args.timeout)
timestamp = int(time.time())
expire_timestamp = timestamp + 3600
query_params["timeStamp"] = str(timestamp)
query_params["expired"] = str(expire_timestamp)
query_params["secretId"] = args.secretId
query_params["nonce"] = str(generate_random_number(10))
# Sort keys and build canonical query string
sorted_keys = sorted(query_params.keys())
canonical_query = "&".join(
["{}={}".format(k, quote(query_params[k], safe=''))
for k in sorted_keys]
)
# Build canonical request
path = "/wss/v1/{}".format(args.appid)
http_method = "post"
canonical_uri = path
canonical_headers = "content-type:application/json; charset=utf-8\nhost:{}\n".format(args.addr)
signed_headers = "content-type;host"
canonical_request = "{}\n{}\n{}\n{}\n{}\n".format(
http_method,
canonical_uri,
canonical_query,
canonical_headers,
signed_headers,
)
# Build string to sign
date = time.strftime("%Y-%m-%d", time.gmtime(timestamp))
credential_scope = "{}/mps/tc3_request".format(date)
hashed_canonical = sha256hex(canonical_request)
algorithm = "TC3-HMAC-SHA256"
string_to_sign = "{}\n{}\n{}\n{}".format(
algorithm,
timestamp,
credential_scope,
hashed_canonical
)
# Calculate signature
secret_date = hmacsha256(date, "TC3" + args.secretKey)
secret_service = hmacsha256("mps", secret_date)
secret_signing = hmacsha256("tc3_request", secret_service)
signature = hmac.new(
secret_signing,
string_to_sign.encode('utf-8'),
hashlib.sha256
).hexdigest()
# Add signature to query params
query_params["signature"] = signature
# Build final URL
scheme = "wss" if args.ssl else "ws"
url = urlunsplit((
scheme,
args.addr,
path,
urlencode(query_params),
""
))
return url

async def receive_messages(websocket, stop_event):
"""Handle incoming WebSocket messages"""
try:
while not stop_event.is_set():
message = await websocket.recv()
if isinstance(message, bytes):
try:
message = message.decode('utf-8')
except UnicodeDecodeError:
message = str(message)
logger.info("Received: %s", message)
except Exception as e:
logger.info("Connection closed: %s", e)

async def run_client():
parser = argparse.ArgumentParser()
parser.add_argument("--addr", default="mps.cloud.tencent.com", help="websocket service address")
parser.add_argument("--file", default="./wx_voice.pcm", help="pcm file path")
parser.add_argument("--appid", default="121313131", help="app id")
parser.add_argument("--lang", default="zh", help="language")
parser.add_argument("--dstLang", default="", help="destination language")
parser.add_argument("--frame", action="store_true", help="enable frame notify")
parser.add_argument("--secretId", default="123456", help="secret id")
parser.add_argument("--secretKey", default="123456", help="secret key")
parser.add_argument("--ssl", action="store_true", help="use SSL")
parser.add_argument("--timeout", type=int, default=10, help="timeout seconds")
parser.add_argument("--wait", type=int, default=700, help="wait seconds after end")
args = parser.parse_args()

url = generate_url_v3(args)
logger.info("Connecting to %s", url)

try:
# Python 3.6 compatible websockets connection
websocket = await websockets.connect(url, ping_timeout=5)

# Handle initial response
initial_msg = await websocket.recv()
try:
result = json.loads(initial_msg)
if result.get("Code", 0) != 0:
logger.error("Handshake failed: %s", result.get("Message", ""))
return
logger.info("TaskId %s handshake success", result.get("TaskId", ""))
except ValueError: # json.JSONDecodeError not available in 3.6
logger.error("Invalid initial message")
return

# Setup signal handler
loop = asyncio.get_event_loop()
stop_event = asyncio.Event()
loop.add_signal_handler(signal.SIGINT, stop_event.set)

# Start receiver
receiver_task = asyncio.ensure_future(receive_messages(websocket, stop_event))

# Audio processing
try:
with open(args.file, "rb") as fd:
PCM_DUR_MS = 40
pcm = bytearray(PCM_DUR_MS * 32)
pkt = AudioPacket(data=pcm)
is_end = False
wait_until = 0

while not stop_event.is_set():
if is_end:
if time.time() > wait_until:
logger.info("Finish")
break
await asyncio.sleep(0.1)
continue

# Read PCM data
n = fd.readinto(pkt.data)
if n < len(pkt.data):
pkt.is_end = True
is_end = True
wait_until = time.time() + args.wait

# Send audio packet
await websocket.send(pkt.marshal())
logger.info("Sent ts %d", pkt.timestamp)
pkt.timestamp += n // 32

await asyncio.sleep(PCM_DUR_MS / 1000)

except IOError: # FileNotFoundError not available in 3.6
logger.error("Open file error: %s", args.file)
return

# Cleanup
await asyncio.wait_for(receiver_task, timeout=1)
await websocket.close()

except Exception as e:
logger.error("Connection error: %s", e)
return

if __name__ == "__main__":
# Python 3.6 compatible asyncio runner
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(run_client())
finally:
loop.close()



FAQs

What Languages Are Supported by Smart Subtitle

Source and Target Languages Supported by the Processing Type of ASR-based Subtitle Generation
Source and Target Languages Supported by the Processing Type of Subtitle File Translation

Smart Subtitle Billing

Postpaid billing is supported. See Billing Overview for details.