Caption Generator
Generate accurate captions, subtitles, and transcripts for video content using advanced speech recognition and AI models through the 1min.AI API. This feature converts spoken audio in videos to text with high accuracy across multiple languages and formats.
Key Features
- High Accuracy: Advanced speech-to-text models for precise transcription
- Multi-language Support: Supports 99+ languages for global content
- Format Flexibility: Generate captions in various subtitle formats
- Timestamp Precision: Accurate timing information for perfect synchronization
- Video Processing: Direct video upload and processing capabilities
- Speaker Detection: Identify different speakers in multi-person content
Supported Models
The Caption Generator API primarily uses specialized speech recognition models:
OpenAI Models:
whisper-1
- Whisper v1 (Primary model for caption generation)
Note: This feature uses specialized speech recognition models optimized for audio transcription rather than general-purpose language models.
API Reference
Request Headers
Field | Value |
---|---|
API-KEY | <api-key> |
Content-Type | application/json |
Request Parameters
Field Name | Type | Description | Required |
---|---|---|---|
type | string | Must be "CAPTIONS_GENERATOR" | ✔️ |
model | string | Speech recognition model identifier | ✔️ |
conversationId | string | Must be "CAPTIONS_GENERATOR" | ✔️ |
promptObject.videoUrl | string | Video file URL or asset key | ✔️ |
promptObject.response_format | string | Response format ("verbose_json" is default) | ❌ |
promptObject.timestamp_granularities | array | Granularity levels (["word", "segment"]) | ❌ |
promptObject.language | string | Language code for transcription | ❌ |
Example Request
API Playground
https://api.1min.ai/api/features
Generated cURL Command:
curl -X POST "https://api.1min.ai/api/features" \
-H "API-KEY: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"type": "CAPTIONS_GENERATOR",
"model": "whisper-1",
"conversationId": "CAPTIONS_GENERATOR",
"promptObject": {
"videoUrl": "your-video-asset-key",
"language": "en",
"response_format": "verbose_json",
"timestamp_granularities": [
"word",
"segment"
]
}
}'
Response Format
The API returns a JSON response with the generated captions in the requested format:
{}
Output Formats
SRT Format
1
00:00:00,000 --> 00:00:03,240
Welcome to this video tutorial on AI technology.
2
00:00:03,240 --> 00:00:07,080
Today we'll explore the latest developments in machine learning.
WebVTT Format
WEBVTT
00:00:00.000 --> 00:00:03.240
Welcome to this video tutorial on AI technology.
00:00:03.240 --> 00:00:07.080
Today we'll explore the latest developments in machine learning.
Verbose JSON Format (Default)
{
"task": "transcribe",
"language": "english",
"duration": 7.08,
"text": "Welcome to this video tutorial on AI technology. Today we'll explore the latest developments in machine learning.",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.24,
"text": "Welcome to this video tutorial on AI technology.",
"words": [
{
"word": "Welcome",
"start": 0.0,
"end": 0.4
}
]
}
]
}
Supported Languages
The Caption Generator supports 99+ languages including:
Major Languages:
- English, Spanish, French, German, Italian, Portuguese
- Russian, Chinese (Simplified/Traditional), Japanese, Korean
- Arabic, Hindi, Dutch, Polish, Turkish, Swedish
- Norwegian, Danish, Finnish, Hebrew, Thai, Vietnamese
Regional Variants:
- Portuguese (Brazil), Spanish (Latin America)
- Chinese (Simplified/Traditional)
- English (US/UK/AU variants)
Use Cases
- Content Creation: Add captions to YouTube videos, social media content
- Accessibility: Make video content accessible to deaf and hard-of-hearing audiences
- Language Learning: Generate transcripts for educational content
- Documentation: Convert meeting recordings to text transcripts
- SEO Optimization: Create searchable text content from video
- Compliance: Meet accessibility requirements for corporate content
Best Practices
- Audio Quality: Ensure clear audio with minimal background noise
- Language Selection: Specify the language for better accuracy when known
- File Formats: Use common video formats (MP4, MOV, AVI) for best results
- File Size: Keep video files under 25MB for optimal processing speed
- Speaker Clarity: For multi-speaker detection, ensure speakers are distinct
Technical Requirements
- Supported Video Formats: MP4, MOV, AVI, MKV, WebM
- Supported Audio Formats: MP3, WAV, M4A, FLAC
- Maximum File Size: 25MB per file
- Maximum Duration: 30 minutes per video
- Audio Quality: Minimum 16kHz sample rate recommended
Error Handling
Common error scenarios:
- Unsupported file formats or corrupted files
- Files exceeding size or duration limits
- Poor audio quality resulting in low confidence scores
- Network timeouts for large file uploads
- Language detection failures for unclear speech
For detailed error codes and troubleshooting, refer to the main API documentation.