Experimental Builds
Ai Video Pipeline Autonomous Rendering.
Text-to-video autonomous rendering system using AI voice generation, stock footage orchestration, and Docker-based media processing.
System architecture.
How it's built.
Engineering process.
How it was built.
- Built text → voice → video prototype
- Tested ffmpeg rendering capabilities
- Validated timing synchronization logic
Working prototype (single format)
- Integrated Google TTS with multiple voice options
- Built caption generation with word-level timing
- Implemented subtitle overlay with custom styling
Voice + caption pipeline
- Integrated Pexels API with keyword extraction
- Built video clip selection logic (relevance scoring)
- Implemented clip duration matching to audio segments
Automated footage sourcing
- Dockerized ffmpeg rendering environment
- Built queue system for batch processing
- Implemented error handling and retry logic
- Optimized rendering with hardware acceleration
Production rendering pipeline
- Created platform-specific encoding profiles
- Built aspect ratio conversion (16:9 → 9:16 for vertical)
- Added file size optimization
Multi-platform export system
Engineering challenges.
What broke. How we fixed it.
Frame-Accurate Caption Timing
Captions appearing 0.5–2 seconds off from audio. Manual adjustment required per video.
ffmpeg subtitle filters use timestamps, but audio waveforms have variable silence. Word boundaries not aligned with fixed intervals.
Custom timing engine analyzing TTS output for word boundaries. Silence detection to adjust caption placement. 100ms buffer zones before/after transitions.
Timing accuracy: 99.8% (within 100ms). Manual adjustment: eliminated.
Video Clip Relevance & Quality
Pexels API returned generic results for specific keywords. Quality varied.
Free tier: 200 requests/hour. Cannot manually curate. No built-in quality filtering.
Two-pass search: specific keywords first, broader terms as fallback. Quality scoring (resolution, duration, view count). Caching layer to reuse high-quality clips.
Clip relevance: 60% → 85%. Only 1080p+ clips used. API requests reduced 40% through caching.
Docker Resource Management
Concurrent renders crashed server. Single render = 4–6GB RAM. Server: 8GB total.
Cannot upgrade server mid-project. Fast turnaround needed.
Single-threaded rendering queue. Container limits (6GB RAM, 2 CPU). Batch overnight, real-time during day.
Server stability: 100%. Render time: <60 seconds. Zero crashes in production.
Measured impact.
Results. Numbers only.
Production efficiency
Production time: 2–4 hours → <60 seconds (99% faster)
Throughput: 2–3 videos/day → unlimited (queue-based only)
Daily capacity: API-rate-limited, not compute-limited
Quality consistency: variable → standardized 1080p @ 30fps
Technical performance
Rendering success rate: 98%
Caption sync accuracy: 99.8%
Output quality: 1080p @ 30fps
Processing queue: real-time under normal load
Infrastructure efficiency
Infrastructure footprint: single VPS node
Processing overhead: minimal (serverless-comparable per video)
Maintenance overhead: automated, zero manual intervention
Capacity vs manual: 100x throughput at same infrastructure
Related.
Related systems.
Get started
Need similar architecture?
We build systems for operators serious about scale. If you're ready to invest in infrastructure that compounds, let's design your system.