AI Enablement Guide

Which AI capabilities are available in Nomad Media, how to phase them in, what they depend on, and which rollout order is usually best.

AI Enablement Guide

Nomad Media supports a range of AI-powered enrichment capabilities for images, audio, and video content. All AI processing is optional, individually toggleable, and governed by the Rules Engine — meaning each capability can be enabled for specific folders, file types, years, or content categories independently.

For information on data security, model providers, and how AI-generated metadata is stored, see AI Metadata Overview. For rollout planning, see Phased AI Rollout.


Recommended Rollout Order

If you are enabling AI for the first time, the most successful approach is usually to start with the lowest-cost, highest-utility capabilities and layer richer search on top only after users have validated the value.

PhaseRecommended capabilityWhy it usually comes first
Day 1Audio + video transcriptionDelivers immediate search value across spoken content at relatively low cost
NextImage enrichment / image searchExpands natural-language discovery into image libraries very cost-effectively
ThenLLM-enhanced search on transcript contentMakes transcript search much richer and more flexible than keyword-only matching
LaterDeep video visual searchPowerful and demo-friendly, but typically higher-cost and not required for every go-live

Day 1 Recommendation

For most deployments, the safest default is:

  1. Enable transcription for speech-first audio and video content
  2. Let transcript text become searchable
  3. Add VTT subtitles for playback
  4. Roll richer search and deeper visual analysis in later waves

This delivers immediate value without forcing a full AI spend at go-live.


Audio AI Processors

ProcessorWhat it does
TranscriptionGenerates a full text transcript of the spoken content
Subtitles / CaptionsGenerates a subtitle file with word-level or segment-level timecodes (SRT / VTT output)
Sentiment analysisDetects emotional tone at the segment level (excited, happy, sad, frustrated, etc.)

Dependency: All audio AI processors require an MP3 proxy to be generated first. Ensure audio proxy generation is enabled for any folder where audio AI processing is intended — including folders containing video files, since audio is extracted from video for transcription purposes. See Proxy Generation Overview.

Approximate cost: ~$2 per hour of audio for transcription.

What Day 1 Transcription Actually Gives You

When transcription is enabled for audio or video:

  • the spoken words are transcribed
  • the transcript text is indexed for search
  • a VTT subtitle file is generated for web playback
  • an SRT subtitle file can optionally be generated when needed for downstream editing or delivery workflows

This is why transcription is usually the first AI feature to enable: it improves both findability and playback usability with a single pipeline.

Subtitle Files vs. Search Indexing

These are related, but not the same thing:

  • Transcript indexing is what makes spoken content searchable
  • VTT / SRT files are subtitle outputs used for playback or export workflows
  • VTT is the standard web subtitle format
  • SRT is commonly used in editing and broadcast-adjacent tooling

In other words, users do not search the VTT file itself. They search the transcript text after it has been indexed.

Known Limitation: Music Content

Standard AI transcription models are designed for speech. Performance on music content — instrumental tracks, vocals over music, or dense audio mixes — is poor. Results may be incomplete, inaccurate, or unreliable.

If your library contains significant music content, consider:

  • Scoping audio transcription rules to folders that contain speech-first content (e.g., content/podcasts/, content/interviews/)
  • Excluding music-heavy folders explicitly using urlExcludes
  • Contacting Nomad Media support to discuss music-specific AI model options

Image AI Capabilities

Image enrichment is usually the next feature to add after transcription because it is inexpensive and expands natural-language search across still images.

CapabilityWhat it does
Visual descriptionGenerates natural-language descriptions of what appears in the image
Text detectionExtracts visible text from signs, titles, lower thirds, packaging, and other in-image text
Object / concept detectionIdentifies objects, scenes, and broad visual concepts
Celebrity recognitionIdentifies known public figures when that use case is needed
Multimodal / LLM analysisProduces richer search-oriented understanding for natural-language discovery

Approximate cost: image enrichment is typically one of the most cost-effective AI options in the platform, often around ~$1 per 1,000 images for baseline enrichment, depending on the processors enabled.

For new rollouts, start with the capabilities that improve search and discovery first. Add specialized detectors only when there is a clear business need.


Video AI Capabilities

Video AI can work in two different ways:

  1. Audio-driven — transcription and transcript search based on what is spoken
  2. Visual / multimodal — time-coded analysis of what is happening on screen

Time-coded video AI makes it possible to search for a concept and jump into a relevant segment rather than only finding the full asset.

CapabilityWhat it does
Transcript searchLets users search what is said in the audio track
Time-coded visual descriptionDescribes visual content at intervals throughout the video
Time-coded text detectionDetects on-screen text with time ranges
Time-coded object / concept detectionIdentifies objects and scenes over time
Deep video searchSupports natural-language retrieval of visually relevant video moments

Approximate cost: visual and multimodal video analysis is usually a later-phase decision because it is more expensive than transcription and often needs targeted rollout by folder, media type, or content priority.

For many organizations, deep video search is the feature that demos best, but it is not always the right Day 1 default.


Specialized Processors vs. Modern Multimodal Search

Some AI capabilities are specialized detectors such as text detection, object detection, or celebrity recognition. Others are multimodal / LLM-oriented capabilities focused on richer natural-language retrieval.

Both remain useful, but they serve different goals:

  • Specialized processors are best when you need a very specific signal, such as visible text, known public figures, or explicit object categories
  • Multimodal / LLM search is best when users want to search naturally, using descriptions and concepts rather than rigid keywords

For new deployments, the recommendation is usually:

  1. start with transcription
  2. add image discovery
  3. add richer multimodal search
  4. add specialized processors where they solve a real business need

This avoids over-configuring features that users may not actually need at go-live.


Processor Dependencies

Some AI processors depend on others running first. The most important dependency chain is still:

AudioExtraction (MP3 proxy)
  └── Transcribe
        └── downstream transcript-based search features

Video visual analysis also depends on screenshot or frame extraction. See Turning On/Off Asset Processors and Dependencies for the detailed dependency map.


Scoping and Rollout Controls

AI capabilities are enabled via the system configuration file using the same processorList structure as other Nomad processors. Common controls include:

  • enabled: true/false
  • urlMatches for folder prefixes
  • urlExcludes for exclusions
  • rule for expression-based conditions such as file type, age, size, or other metadata

This means you can roll AI out gradually by:

  • folder
  • territory or department
  • year
  • file extension
  • content type

See Rules Engine Overview and Turning On/Off Asset Processors and Dependencies for the configuration details.


Enabling AI Later Is Safe

AI does not have to be decided all at once.

  • You can start with transcription now and add deeper capabilities later
  • You can scope later capabilities to only selected folders or media types
  • You can reprocess existing content to add only the outputs that are missing

Retroactive processing: newly enabled AI processors can be applied later using Reprocessing Assets. The system adds missing outputs without re-running processors that have already completed successfully.

The main operational caution is cost: enabling a new processor across a large historical catalog can create a catch-up spike while the back catalog is processed.


Deployment-Specific Providers

Provider details can vary by deployment and by capability. Use the capability guidance on this page to decide what to enable, then consult the provider-specific setup pages for how your environment is configured:


Related Pages