Skip to main content

Caption Generator

Generate accurate captions, subtitles, and transcripts for video content using advanced speech recognition and AI models through the 1min.AI API. This feature converts spoken audio in videos to text with high accuracy across multiple languages and formats.

Key Features

  • High Accuracy: Advanced speech-to-text models for precise transcription
  • Multi-language Support: Supports 99+ languages for global content
  • Format Flexibility: Generate captions in various subtitle formats
  • Timestamp Precision: Accurate timing information for perfect synchronization
  • Video Processing: Direct video upload and processing capabilities
  • Speaker Detection: Identify different speakers in multi-person content

Supported Models

The Caption Generator API primarily uses specialized speech recognition models:

OpenAI Models:

  • whisper-1 - Whisper v1 (Primary model for caption generation)

Note: This feature uses specialized speech recognition models optimized for audio transcription rather than general-purpose language models.

API Reference

Request Headers

FieldValue
API-KEY<api-key>
Content-Typeapplication/json

Request Parameters

Field NameTypeDescriptionRequired
typestringMust be "CAPTIONS_GENERATOR"✔️
modelstringSpeech recognition model identifier✔️
conversationIdstringMust be "CAPTIONS_GENERATOR"✔️
promptObject.videoUrlstringVideo file URL or asset key✔️
promptObject.response_formatstringResponse format ("verbose_json" is default)
promptObject.timestamp_granularitiesarrayGranularity levels (["word", "segment"])
promptObject.languagestringLanguage code for transcription

Example Request

API Playground

https://api.1min.ai/api/features

Generated cURL Command:

curl -X POST "https://api.1min.ai/api/features" \
-H "API-KEY: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"type": "CAPTIONS_GENERATOR",
"model": "whisper-1",
"conversationId": "CAPTIONS_GENERATOR",
"promptObject": {
"videoUrl": "your-video-asset-key",
"language": "en",
"response_format": "verbose_json",
"timestamp_granularities": [
"word",
"segment"
]
}
}'

Response Format

The API returns a JSON response with the generated captions in the requested format:

{}

Output Formats

SRT Format

1
00:00:00,000 --> 00:00:03,240
Welcome to this video tutorial on AI technology.

2
00:00:03,240 --> 00:00:07,080
Today we'll explore the latest developments in machine learning.

WebVTT Format

WEBVTT

00:00:00.000 --> 00:00:03.240
Welcome to this video tutorial on AI technology.

00:00:03.240 --> 00:00:07.080
Today we'll explore the latest developments in machine learning.

Verbose JSON Format (Default)

{
"task": "transcribe",
"language": "english",
"duration": 7.08,
"text": "Welcome to this video tutorial on AI technology. Today we'll explore the latest developments in machine learning.",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.24,
"text": "Welcome to this video tutorial on AI technology.",
"words": [
{
"word": "Welcome",
"start": 0.0,
"end": 0.4
}
]
}
]
}

Supported Languages

The Caption Generator supports 99+ languages including:

Major Languages:

  • English, Spanish, French, German, Italian, Portuguese
  • Russian, Chinese (Simplified/Traditional), Japanese, Korean
  • Arabic, Hindi, Dutch, Polish, Turkish, Swedish
  • Norwegian, Danish, Finnish, Hebrew, Thai, Vietnamese

Regional Variants:

  • Portuguese (Brazil), Spanish (Latin America)
  • Chinese (Simplified/Traditional)
  • English (US/UK/AU variants)

Use Cases

  • Content Creation: Add captions to YouTube videos, social media content
  • Accessibility: Make video content accessible to deaf and hard-of-hearing audiences
  • Language Learning: Generate transcripts for educational content
  • Documentation: Convert meeting recordings to text transcripts
  • SEO Optimization: Create searchable text content from video
  • Compliance: Meet accessibility requirements for corporate content

Best Practices

  1. Audio Quality: Ensure clear audio with minimal background noise
  2. Language Selection: Specify the language for better accuracy when known
  3. File Formats: Use common video formats (MP4, MOV, AVI) for best results
  4. File Size: Keep video files under 25MB for optimal processing speed
  5. Speaker Clarity: For multi-speaker detection, ensure speakers are distinct

Technical Requirements

  • Supported Video Formats: MP4, MOV, AVI, MKV, WebM
  • Supported Audio Formats: MP3, WAV, M4A, FLAC
  • Maximum File Size: 25MB per file
  • Maximum Duration: 30 minutes per video
  • Audio Quality: Minimum 16kHz sample rate recommended

Error Handling

Common error scenarios:

  • Unsupported file formats or corrupted files
  • Files exceeding size or duration limits
  • Poor audio quality resulting in low confidence scores
  • Network timeouts for large file uploads
  • Language detection failures for unclear speech

For detailed error codes and troubleshooting, refer to the main API documentation.