AI Enablement Guide

Nomad Media supports a range of AI-powered enrichment capabilities for images, audio, and video content. All AI processing is optional, individually toggleable, and governed by the Rules Engine — meaning each capability can be enabled for specific folders, file types, years, or content categories independently.

For information on data security, model providers, and how AI-generated metadata is stored, see AI Metadata Overview. For rollout planning, see Phased AI Rollout.

Recommended Rollout Order

If you are enabling AI for the first time, the most successful approach is usually to start with the lowest-cost, highest-utility capabilities and layer richer search on top only after users have validated the value.

Phase	Recommended capability	Why it usually comes first
Day 1	Audio + video transcription	Delivers immediate search value across spoken content at relatively low cost
Next	Image enrichment / image search	Expands natural-language discovery into image libraries very cost-effectively
Then	LLM-enhanced search on transcript content	Makes transcript search much richer and more flexible than keyword-only matching
Later	Deep video visual search	Powerful and demo-friendly, but typically higher-cost and not required for every go-live

Day 1 Recommendation

For most deployments, the safest default is:

Enable transcription for speech-first audio and video content
Let transcript text become searchable
Add VTT subtitles for playback
Roll richer search and deeper visual analysis in later waves

This delivers immediate value without forcing a full AI spend at go-live.

Audio AI Processors

Processor	What it does
Transcription	Generates a full text transcript of the spoken content
Subtitles / Captions	Generates a subtitle file with word-level or segment-level timecodes (SRT / VTT output)
Sentiment analysis	Detects emotional tone at the segment level (excited, happy, sad, frustrated, etc.)

Dependency: All audio AI processors require an MP3 proxy to be generated first. Ensure audio proxy generation is enabled for any folder where audio AI processing is intended — including folders containing video files, since audio is extracted from video for transcription purposes. See Proxy Generation Overview.

Approximate cost: ~$2 per hour of audio for transcription.

What Day 1 Transcription Actually Gives You

When transcription is enabled for audio or video:

the spoken words are transcribed
the transcript text is indexed for search
a VTT subtitle file is generated for web playback
an SRT subtitle file can optionally be generated when needed for downstream editing or delivery workflows

This is why transcription is usually the first AI feature to enable: it improves both findability and playback usability with a single pipeline.

Subtitle Files vs. Search Indexing

These are related, but not the same thing:

Transcript indexing is what makes spoken content searchable
VTT / SRT files are subtitle outputs used for playback or export workflows
VTT is the standard web subtitle format
SRT is commonly used in editing and broadcast-adjacent tooling

In other words, users do not search the VTT file itself. They search the transcript text after it has been indexed.

Known Limitation: Music Content

Standard AI transcription models are designed for speech. Performance on music content — instrumental tracks, vocals over music, or dense audio mixes — is poor. Results may be incomplete, inaccurate, or unreliable.

If your library contains significant music content, consider:

Scoping audio transcription rules to folders that contain speech-first content (e.g., content/podcasts/, content/interviews/)
Excluding music-heavy folders explicitly using urlExcludes
Contacting Nomad Media support to discuss music-specific AI model options

Image AI Capabilities

Image enrichment is usually the next feature to add after transcription because it is inexpensive and expands natural-language search across still images.

Capability	What it does
Visual description	Generates natural-language descriptions of what appears in the image
Text detection	Extracts visible text from signs, titles, lower thirds, packaging, and other in-image text
Object / concept detection	Identifies objects, scenes, and broad visual concepts
Celebrity recognition	Identifies known public figures when that use case is needed
Multimodal / LLM analysis	Produces richer search-oriented understanding for natural-language discovery

Approximate cost: image enrichment is typically one of the most cost-effective AI options in the platform, often around ~$1 per 1,000 images for baseline enrichment, depending on the processors enabled.

For new rollouts, start with the capabilities that improve search and discovery first. Add specialized detectors only when there is a clear business need.

Video AI Capabilities

Video AI can work in two different ways:

Audio-driven — transcription and transcript search based on what is spoken
Visual / multimodal — time-coded analysis of what is happening on screen

Time-coded video AI makes it possible to search for a concept and jump into a relevant segment rather than only finding the full asset.

Capability	What it does
Transcript search	Lets users search what is said in the audio track
Time-coded visual description	Describes visual content at intervals throughout the video
Time-coded text detection	Detects on-screen text with time ranges
Time-coded object / concept detection	Identifies objects and scenes over time
Deep video search	Supports natural-language retrieval of visually relevant video moments

Approximate cost: visual and multimodal video analysis is usually a later-phase decision because it is more expensive than transcription and often needs targeted rollout by folder, media type, or content priority.

For many organizations, deep video search is the feature that demos best, but it is not always the right Day 1 default.

Specialized Processors vs. Modern Multimodal Search

Some AI capabilities are specialized detectors such as text detection, object detection, or celebrity recognition. Others are multimodal / LLM-oriented capabilities focused on richer natural-language retrieval.

Both remain useful, but they serve different goals:

Specialized processors are best when you need a very specific signal, such as visible text, known public figures, or explicit object categories
Multimodal / LLM search is best when users want to search naturally, using descriptions and concepts rather than rigid keywords

For new deployments, the recommendation is usually:

start with transcription
add image discovery
add richer multimodal search
add specialized processors where they solve a real business need

This avoids over-configuring features that users may not actually need at go-live.

Processor Dependencies

Some AI processors depend on others running first. The most important dependency chain is still:

AudioExtraction (MP3 proxy)
  └── Transcribe
        └── downstream transcript-based search features

Video visual analysis also depends on screenshot or frame extraction. See Turning On/Off Asset Processors and Dependencies for the detailed dependency map.

Scoping and Rollout Controls

AI capabilities are enabled via the system configuration file using the same processorList structure as other Nomad processors. Common controls include:

enabled: true/false
urlMatches for folder prefixes
urlExcludes for exclusions
rule for expression-based conditions such as file type, age, size, or other metadata

This means you can roll AI out gradually by:

folder
territory or department
year
file extension
content type

See Rules Engine Overview and Turning On/Off Asset Processors and Dependencies for the configuration details.

Enabling AI Later Is Safe

AI does not have to be decided all at once.

You can start with transcription now and add deeper capabilities later
You can scope later capabilities to only selected folders or media types
You can reprocess existing content to add only the outputs that are missing

Retroactive processing: newly enabled AI processors can be applied later using Reprocessing Assets. The system adds missing outputs without re-running processors that have already completed successfully.

The main operational caution is cost: enabling a new processor across a large historical catalog can create a catch-up spike while the back catalog is processed.

Deployment-Specific Providers

Provider details can vary by deployment and by capability. Use the capability guidance on this page to decide what to enable, then consult the provider-specific setup pages for how your environment is configured:

AI Metadata Overview — data security, model providers, and storage architecture
Search and Discovery — how transcript, keyword, and multimodal search behave
Phased AI Rollout — suggested Day 1 go-live and test-bed strategy
Rules Engine Overview — how to scope processors to specific folders and file types
Turning On/Off Asset Processors and Dependencies — full JSON configuration reference
Proxy Generation Overview — proxy dependencies for AI processing
Reprocessing Assets — applying AI processors retroactively to existing content