JSON Driver Masterclass

What is a Driver?

A driver is a JSON file that tells Onset Engine what visual content to assign to each energy tier of your music. Without a driver, the engine uses raw CLIP similarity and motion scores. With a driver, you get precise creative control over what appears at each musical intensity level.

Think of it as a content brief for the AI: “During quiet sections, show landscapes. During medium energy, show dialogue. During the drop, show explosions.”

The v3 Schema

{
  "meta": {
    "name": "Action Anime Driver",
    "version": "3.0",
    "description": "Maps anime content to energy tiers with subject awareness"
  },
  "global": {
    "min_rating": 3,
    "exclude_tags": ["@Filler", "@Recap"],
    "shot_diversity": true
  },
  "tiers": {
    "1_LOW": {
      "descriptions": [
        "character standing in peaceful landscape",
        "calm sky with clouds",
        "characters talking quietly"
      ],
      "subjects": ["@Goku"],
      "moods": ["serene", "melancholic"],
      "scene_types": ["wide", "medium"]
    },
    "2_MED": {
      "descriptions": [
        "character powering up with glowing aura",
        "intense stare between fighters",
        "character flying through sky"
      ],
      "subjects": ["@Goku", "@Vegeta"],
      "moods": ["tense"],
      "scene_types": ["medium", "close-up"]
    },
    "3_HIGH": {
      "descriptions": [
        "fast martial arts combat with punching",
        "energy beam attack",
        "character dodging rapid attacks"
      ],
      "subjects": ["@Goku"],
      "moods": ["aggressive", "epic"],
      "scene_types": ["close-up", "medium"]
    },
    "4_MAX": {
      "descriptions": [
        "massive energy explosion",
        "character transforming with blinding light",
        "devastating beam clash"
      ],
      "subjects": ["@Goku"],
      "moods": ["epic"],
      "min_rating": 4
    }
  }
}

Schema Reference

Top-Level Blocks

Block	Description
`meta`	Display name, version, and description
`global`	Library-wide filtering rules applied to all tiers
`tiers`	Energy tier definitions keyed as `1_LOW`, `2_MED`, `3_HIGH`, `4_MAX`

Meta Fields

Field	Type	Description
`name`	string	Display name for the driver
`version`	string	Schema version (use `"3.0"` for current)
`description`	string	What this driver is designed for

Global Filters

Field	Type	Default	Description
`min_rating`	integer	`0`	Minimum quality rating for all tiers (0–5)
`exclude_tags`	string[]	`[]`	Exclude clips with these tags from all tiers
`shot_diversity`	boolean	`true`	Enable diversity filtering to prevent visual repetition

Tier Fields

Each tier is keyed as 1_LOW, 2_MED, 3_HIGH, or 4_MAX:

Field	Type	Description
`descriptions`	string[]	CLIP text descriptions — the engine computes cosine similarity against clip embeddings
`subjects`	string[]	Subject tag references using `@TagName` syntax
`moods`	string[]	Filter to clips with matching mood classification
`scene_types`	string[]	Filter to clips with matching scene type
`min_rating`	integer	Per-tier minimum rating override (overrides global)

Energy Tiers

The engine maps musical energy (0.0–1.0) to four tiers:

Tier	Energy Range	Musical Moment
`1_LOW`	0.0–0.25	Intros, breakdowns, quiet sections
`2_MED`	0.25–0.50	Verses, building tension
`3_HIGH`	0.50–0.75	Choruses, buildups
`4_MAX`	0.75–1.0	Drops, climaxes, peak energy

CLIP Descriptions — How Matching Works

Each description string is encoded into a 768-dim vector using the CLIP text encoder. The engine computes cosine similarity between the description vector and every clip’s embedding.

Description: "massive energy explosion"
   ↓ CLIP text encoder
   ↓ 768-dim vector
   ↓ cosine similarity vs. all clips
   ↓ ranked results

Clip #4821: cos_sim = 0.31  ← best match
Clip #1203: cos_sim = 0.28
Clip #0892: cos_sim = 0.24

Contrastive Scoring

v3 drivers use contrastive scoring — the engine doesn’t just pick clips with the highest absolute similarity to a tier’s descriptions. It measures tier specificity: how much more similar is this clip to the target tier than to all other tiers?

Standard scoring:  score = raw_similarity * 0.4
Contrastive:       score = raw_similarity * 0.4 + (target_sim - max_other_sim) * 0.6

This prevents the common failure mode where the CLIP model’s highest-confidence clips dominate every tier. Contrastive scoring ensures that calm tiers get genuinely calm content, not just the model’s most confident matches.

Subject Tags (`@` Syntax)

Reference tagged clips using the @TagName syntax in the subjects array:

{
  "4_MAX": {
    "descriptions": ["massive energy explosion"],
    "subjects": ["@Goku", "@Vegeta"],
    "moods": ["epic"],
    "min_rating": 4
  }
}

Tags are created via the few-shot propagation system (tag 5 clips → engine finds 800 more).

Mood and Scene Type Filters

Available Moods

Mood	Description
`epic`	Grand, powerful, heroic content
`melancholic`	Sad, reflective, emotional
`tense`	Suspenseful, high-stakes
`comedic`	Light, funny, playful
`romantic`	Intimate, warm, affectionate
`serene`	Peaceful, calm, meditative
`aggressive`	Intense, violent, forceful

Available Scene Types

Scene Type	Description
`close-up`	Face or detail shot
`medium`	Waist-up or small group
`wide`	Full environment or establishing shot
`aerial`	Drone or overhead perspective
`pov`	First-person or subjective camera
`slow-motion`	Reduced playback speed content

Building Your First Driver

Step 1: Identify Your Content

What footage is in your library? Anime fights? Drone landscapes? Wedding ceremonies? The driver should reflect your actual content, not aspirational queries.

Step 2: Write Descriptions

Be specific and visual. CLIP understands natural language:

// ❌ Too vague
"descriptions": ["action"]

// ✅ Specific and visual
"descriptions": [
  "character performing a spinning kick in mid-air",
  "explosion with debris flying toward camera",
  "fast sword slash with motion blur"
]

Step 3: Use the Driver Wizard

In Studio Mode, click ✨ Create in the Clip Direction section to open the Driver Wizard — a visual tier builder with live JSON preview. If you’ve entered text descriptions, the wizard pre-populates from those.

Step 4: Test and Iterate

Use DJ Mode to preview how the driver selects clips in real-time. Adjust descriptions and filters based on what you see. The console shows per-tier diagnostic logging with similarity breakdowns.

Example Drivers

Nature / Drone

{
  "meta": {
    "name": "Nature Reel",
    "version": "3.0"
  },
  "tiers": {
    "1_LOW": {
      "descriptions": ["calm ocean waves", "forest canopy from above", "sunrise over mountains"],
      "scene_types": ["wide", "aerial"]
    },
    "2_MED": {
      "descriptions": ["flowing river through valley", "birds in flight"],
      "scene_types": ["wide", "medium"]
    },
    "3_HIGH": {
      "descriptions": ["fast drone dive through canyon", "waterfall close-up"],
      "scene_types": ["aerial", "pov"],
      "min_rating": 2
    },
    "4_MAX": {
      "descriptions": ["storm clouds time-lapse", "lightning strike over ocean"],
      "scene_types": ["wide"],
      "min_rating": 3
    }
  }
}

Wedding Highlight

{
  "meta": {
    "name": "Wedding Highlights",
    "version": "3.0"
  },
  "global": {
    "min_rating": 2,
    "shot_diversity": true
  },
  "tiers": {
    "1_LOW": {
      "descriptions": ["wedding venue exterior", "floral decorations", "guests arriving"],
      "moods": ["serene"],
      "scene_types": ["wide"]
    },
    "2_MED": {
      "descriptions": ["bride walking down aisle", "exchanging rings", "emotional guests"],
      "moods": ["romantic", "melancholic"],
      "scene_types": ["medium"]
    },
    "3_HIGH": {
      "descriptions": ["first dance", "wedding party celebration", "champagne toast"],
      "moods": ["romantic", "epic"],
      "scene_types": ["medium", "close-up"]
    },
    "4_MAX": {
      "descriptions": ["crowd dancing at reception", "confetti throw", "sparkler exit"],
      "moods": ["epic"],
      "scene_types": ["wide", "close-up"],
      "min_rating": 3
    }
  }
}

Tips

Write 3–6 descriptions per tier for best results — more variety = better coverage
Use @Tag subjects only after running clip tagging in the GUI
The penalty multipliers stack: a clip with wrong mood AND scene can get 0.50 × 0.60 = 0.30× score
Per-tier min_rating overrides the global setting for that tier only