Skip to content

JSON Driver Masterclass

A driver is a JSON file that tells Onset Engine what visual content to assign to each energy tier of your music. Without a driver, the engine uses raw CLIP similarity and motion scores. With a driver, you get precise creative control over what appears at each musical intensity level.

Think of it as a content brief for the AI: “During quiet sections, show landscapes. During medium energy, show dialogue. During the drop, show explosions.”

{
"meta": {
"name": "Action Anime Driver",
"version": "3.0",
"description": "Maps anime content to energy tiers with subject awareness"
},
"global": {
"min_rating": 3,
"exclude_tags": ["@Filler", "@Recap"],
"shot_diversity": true
},
"tiers": {
"1_LOW": {
"descriptions": [
"character standing in peaceful landscape",
"calm sky with clouds",
"characters talking quietly"
],
"subjects": ["@Goku"],
"moods": ["serene", "melancholic"],
"scene_types": ["wide", "medium"]
},
"2_MED": {
"descriptions": [
"character powering up with glowing aura",
"intense stare between fighters",
"character flying through sky"
],
"subjects": ["@Goku", "@Vegeta"],
"moods": ["tense"],
"scene_types": ["medium", "close-up"]
},
"3_HIGH": {
"descriptions": [
"fast martial arts combat with punching",
"energy beam attack",
"character dodging rapid attacks"
],
"subjects": ["@Goku"],
"moods": ["aggressive", "epic"],
"scene_types": ["close-up", "medium"]
},
"4_MAX": {
"descriptions": [
"massive energy explosion",
"character transforming with blinding light",
"devastating beam clash"
],
"subjects": ["@Goku"],
"moods": ["epic"],
"min_rating": 4
}
}
}
BlockDescription
metaDisplay name, version, and description
globalLibrary-wide filtering rules applied to all tiers
tiersEnergy tier definitions keyed as 1_LOW, 2_MED, 3_HIGH, 4_MAX
FieldTypeDescription
namestringDisplay name for the driver
versionstringSchema version (use "3.0" for current)
descriptionstringWhat this driver is designed for
FieldTypeDefaultDescription
min_ratinginteger0Minimum quality rating for all tiers (0–5)
exclude_tagsstring[][]Exclude clips with these tags from all tiers
shot_diversitybooleantrueEnable diversity filtering to prevent visual repetition

Each tier is keyed as 1_LOW, 2_MED, 3_HIGH, or 4_MAX:

FieldTypeDescription
descriptionsstring[]CLIP text descriptions — the engine computes cosine similarity against clip embeddings
subjectsstring[]Subject tag references using @TagName syntax
moodsstring[]Filter to clips with matching mood classification
scene_typesstring[]Filter to clips with matching scene type
min_ratingintegerPer-tier minimum rating override (overrides global)

The engine maps musical energy (0.0–1.0) to four tiers:

TierEnergy RangeMusical Moment
1_LOW0.0–0.25Intros, breakdowns, quiet sections
2_MED0.25–0.50Verses, building tension
3_HIGH0.50–0.75Choruses, buildups
4_MAX0.75–1.0Drops, climaxes, peak energy

Each description string is encoded into a 768-dim vector using the CLIP text encoder. The engine computes cosine similarity between the description vector and every clip’s embedding.

Description: "massive energy explosion"
↓ CLIP text encoder
↓ 768-dim vector
↓ cosine similarity vs. all clips
↓ ranked results
Clip #4821: cos_sim = 0.31 ← best match
Clip #1203: cos_sim = 0.28
Clip #0892: cos_sim = 0.24

v3 drivers use contrastive scoring — the engine doesn’t just pick clips with the highest absolute similarity to a tier’s descriptions. It measures tier specificity: how much more similar is this clip to the target tier than to all other tiers?

Standard scoring: score = raw_similarity * 0.4
Contrastive: score = raw_similarity * 0.4 + (target_sim - max_other_sim) * 0.6

This prevents the common failure mode where the CLIP model’s highest-confidence clips dominate every tier. Contrastive scoring ensures that calm tiers get genuinely calm content, not just the model’s most confident matches.

Reference tagged clips using the @TagName syntax in the subjects array:

{
"4_MAX": {
"descriptions": ["massive energy explosion"],
"subjects": ["@Goku", "@Vegeta"],
"moods": ["epic"],
"min_rating": 4
}
}

Tags are created via the few-shot propagation system (tag 5 clips → engine finds 800 more).

MoodDescription
epicGrand, powerful, heroic content
melancholicSad, reflective, emotional
tenseSuspenseful, high-stakes
comedicLight, funny, playful
romanticIntimate, warm, affectionate
serenePeaceful, calm, meditative
aggressiveIntense, violent, forceful
Scene TypeDescription
close-upFace or detail shot
mediumWaist-up or small group
wideFull environment or establishing shot
aerialDrone or overhead perspective
povFirst-person or subjective camera
slow-motionReduced playback speed content

What footage is in your library? Anime fights? Drone landscapes? Wedding ceremonies? The driver should reflect your actual content, not aspirational queries.

Be specific and visual. CLIP understands natural language:

// ❌ Too vague
"descriptions": ["action"]
// ✅ Specific and visual
"descriptions": [
"character performing a spinning kick in mid-air",
"explosion with debris flying toward camera",
"fast sword slash with motion blur"
]

In Studio Mode, click ✨ Create in the Clip Direction section to open the Driver Wizard — a visual tier builder with live JSON preview. If you’ve entered text descriptions, the wizard pre-populates from those.

Use DJ Mode to preview how the driver selects clips in real-time. Adjust descriptions and filters based on what you see. The console shows per-tier diagnostic logging with similarity breakdowns.

{
"meta": {
"name": "Nature Reel",
"version": "3.0"
},
"tiers": {
"1_LOW": {
"descriptions": ["calm ocean waves", "forest canopy from above", "sunrise over mountains"],
"scene_types": ["wide", "aerial"]
},
"2_MED": {
"descriptions": ["flowing river through valley", "birds in flight"],
"scene_types": ["wide", "medium"]
},
"3_HIGH": {
"descriptions": ["fast drone dive through canyon", "waterfall close-up"],
"scene_types": ["aerial", "pov"],
"min_rating": 2
},
"4_MAX": {
"descriptions": ["storm clouds time-lapse", "lightning strike over ocean"],
"scene_types": ["wide"],
"min_rating": 3
}
}
}
{
"meta": {
"name": "Wedding Highlights",
"version": "3.0"
},
"global": {
"min_rating": 2,
"shot_diversity": true
},
"tiers": {
"1_LOW": {
"descriptions": ["wedding venue exterior", "floral decorations", "guests arriving"],
"moods": ["serene"],
"scene_types": ["wide"]
},
"2_MED": {
"descriptions": ["bride walking down aisle", "exchanging rings", "emotional guests"],
"moods": ["romantic", "melancholic"],
"scene_types": ["medium"]
},
"3_HIGH": {
"descriptions": ["first dance", "wedding party celebration", "champagne toast"],
"moods": ["romantic", "epic"],
"scene_types": ["medium", "close-up"]
},
"4_MAX": {
"descriptions": ["crowd dancing at reception", "confetti throw", "sparkler exit"],
"moods": ["epic"],
"scene_types": ["wide", "close-up"],
"min_rating": 3
}
}
}
  • Write 3–6 descriptions per tier for best results — more variety = better coverage
  • Use @Tag subjects only after running clip tagging in the GUI
  • The penalty multipliers stack: a clip with wrong mood AND scene can get 0.50 × 0.60 = 0.30× score
  • Per-tier min_rating overrides the global setting for that tier only