OJCLabs

Experimental Builds

Ai Video Pipeline Autonomous Rendering.

Text-to-video autonomous rendering system using AI voice generation, stock footage orchestration, and Docker-based media processing.

Client: Internal R&D / Content CreatorsTimeline: 6 weeksTeam: 1 engineer
<60 sec/videoProduction time
100%Automation coverage
99.8%Media sync accuracy
98%Render success rate


System architecture.

How it's built.

Component
Purpose
Technology
Reasoning
Input Processing
Accept text input, parse into segments
n8n webhook trigger + text preprocessing
Event-driven architecture, flexible input sources
Voice Generation
Convert text to natural-sounding speech
Google Text-to-Speech (Chirp)
High quality, fast generation, reliable API
Visual Sourcing
Find relevant video footage based on text content
Pexels API + keyword extraction
Large stock library, free tier, API reliability
Synchronization Engine
Match video clips to audio timing
Custom Python logic + ffmpeg
Frame-accurate control, complex timing logic
Caption Rendering
Generate and overlay captions synchronized to audio
Custom subtitle engine + ffmpeg overlays
Precise timing control, customizable styling
Rendering Pipeline
Stitch audio, video, captions into final output
Docker + ffmpeg
Isolated environment, reproducible builds, resource control

Engineering process.

How it was built.

Proof of ConceptWeek 1–2
  • Built text → voice → video prototype
  • Tested ffmpeg rendering capabilities
  • Validated timing synchronization logic

Working prototype (single format)

Voice & Caption SystemWeek 3
  • Integrated Google TTS with multiple voice options
  • Built caption generation with word-level timing
  • Implemented subtitle overlay with custom styling

Voice + caption pipeline

Visual Asset IntegrationWeek 4
  • Integrated Pexels API with keyword extraction
  • Built video clip selection logic (relevance scoring)
  • Implemented clip duration matching to audio segments

Automated footage sourcing

Rendering PipelineWeek 5
  • Dockerized ffmpeg rendering environment
  • Built queue system for batch processing
  • Implemented error handling and retry logic
  • Optimized rendering with hardware acceleration

Production rendering pipeline

Multi-Format ExportWeek 6
  • Created platform-specific encoding profiles
  • Built aspect ratio conversion (16:9 → 9:16 for vertical)
  • Added file size optimization

Multi-platform export system


Engineering challenges.

What broke. How we fixed it.

Frame-Accurate Caption Timing

Problem

Captions appearing 0.5–2 seconds off from audio. Manual adjustment required per video.

Constraint

ffmpeg subtitle filters use timestamps, but audio waveforms have variable silence. Word boundaries not aligned with fixed intervals.

Solution

Custom timing engine analyzing TTS output for word boundaries. Silence detection to adjust caption placement. 100ms buffer zones before/after transitions.

Outcome

Timing accuracy: 99.8% (within 100ms). Manual adjustment: eliminated.

Video Clip Relevance & Quality

Problem

Pexels API returned generic results for specific keywords. Quality varied.

Constraint

Free tier: 200 requests/hour. Cannot manually curate. No built-in quality filtering.

Solution

Two-pass search: specific keywords first, broader terms as fallback. Quality scoring (resolution, duration, view count). Caching layer to reuse high-quality clips.

Outcome

Clip relevance: 60% → 85%. Only 1080p+ clips used. API requests reduced 40% through caching.

Docker Resource Management

Problem

Concurrent renders crashed server. Single render = 4–6GB RAM. Server: 8GB total.

Constraint

Cannot upgrade server mid-project. Fast turnaround needed.

Solution

Single-threaded rendering queue. Container limits (6GB RAM, 2 CPU). Batch overnight, real-time during day.

Outcome

Server stability: 100%. Render time: <60 seconds. Zero crashes in production.


Measured impact.

Results. Numbers only.

Production efficiency

Production time: 2–4 hours → <60 seconds (99% faster)

Throughput: 2–3 videos/day → unlimited (queue-based only)

Daily capacity: API-rate-limited, not compute-limited

Quality consistency: variable → standardized 1080p @ 30fps

Technical performance

Rendering success rate: 98%

Caption sync accuracy: 99.8%

Output quality: 1080p @ 30fps

Processing queue: real-time under normal load

Infrastructure efficiency

Infrastructure footprint: single VPS node

Processing overhead: minimal (serverless-comparable per video)

Maintenance overhead: automated, zero manual intervention

Capacity vs manual: 100x throughput at same infrastructure


Related.

Related systems.


Get started

Need similar architecture?

We build systems for operators serious about scale. If you're ready to invest in infrastructure that compounds, let's design your system.

Start a diagnosticExplore all systems