AI Enablement Guide
Which AI capabilities are available in Nomad Media, how to phase them in, what they depend on, and which rollout order is usually best.
AI Enablement Guide
Nomad Media supports a range of AI-powered enrichment capabilities for images, audio, and video content. All AI processing is optional, individually toggleable, and governed by the Rules Engine — meaning each capability can be enabled for specific folders, file types, years, or content categories independently.
For information on data security, model providers, and how AI-generated metadata is stored, see AI Metadata Overview. For rollout planning, see Phased AI Rollout.
Recommended Rollout Order
If you are enabling AI for the first time, the most successful approach is usually to start with the lowest-cost, highest-utility capabilities and layer richer search on top only after users have validated the value.
| Phase | Recommended capability | Why it usually comes first |
|---|---|---|
| Day 1 | Audio + video transcription | Delivers immediate search value across spoken content at relatively low cost |
| Next | Image enrichment / image search | Expands natural-language discovery into image libraries very cost-effectively |
| Then | LLM-enhanced search on transcript content | Makes transcript search much richer and more flexible than keyword-only matching |
| Later | Deep video visual search | Powerful and demo-friendly, but typically higher-cost and not required for every go-live |
Day 1 Recommendation
For most deployments, the safest default is:
- Enable transcription for speech-first audio and video content
- Let transcript text become searchable
- Add VTT subtitles for playback
- Roll richer search and deeper visual analysis in later waves
This delivers immediate value without forcing a full AI spend at go-live.
Audio AI Processors
| Processor | What it does |
|---|---|
| Transcription | Generates a full text transcript of the spoken content |
| Subtitles / Captions | Generates a subtitle file with word-level or segment-level timecodes (SRT / VTT output) |
| Sentiment analysis | Detects emotional tone at the segment level (excited, happy, sad, frustrated, etc.) |
Dependency: All audio AI processors require an MP3 proxy to be generated first. Ensure audio proxy generation is enabled for any folder where audio AI processing is intended — including folders containing video files, since audio is extracted from video for transcription purposes. See Proxy Generation Overview.
Approximate cost: ~$2 per hour of audio for transcription.
What Day 1 Transcription Actually Gives You
When transcription is enabled for audio or video:
- the spoken words are transcribed
- the transcript text is indexed for search
- a VTT subtitle file is generated for web playback
- an SRT subtitle file can optionally be generated when needed for downstream editing or delivery workflows
This is why transcription is usually the first AI feature to enable: it improves both findability and playback usability with a single pipeline.
Subtitle Files vs. Search Indexing
These are related, but not the same thing:
- Transcript indexing is what makes spoken content searchable
- VTT / SRT files are subtitle outputs used for playback or export workflows
- VTT is the standard web subtitle format
- SRT is commonly used in editing and broadcast-adjacent tooling
In other words, users do not search the VTT file itself. They search the transcript text after it has been indexed.
Known Limitation: Music Content
Standard AI transcription models are designed for speech. Performance on music content — instrumental tracks, vocals over music, or dense audio mixes — is poor. Results may be incomplete, inaccurate, or unreliable.
If your library contains significant music content, consider:
- Scoping audio transcription rules to folders that contain speech-first content (e.g.,
content/podcasts/,content/interviews/) - Excluding music-heavy folders explicitly using
urlExcludes - Contacting Nomad Media support to discuss music-specific AI model options
Image AI Capabilities
Image enrichment is usually the next feature to add after transcription because it is inexpensive and expands natural-language search across still images.
| Capability | What it does |
|---|---|
| Visual description | Generates natural-language descriptions of what appears in the image |
| Text detection | Extracts visible text from signs, titles, lower thirds, packaging, and other in-image text |
| Object / concept detection | Identifies objects, scenes, and broad visual concepts |
| Celebrity recognition | Identifies known public figures when that use case is needed |
| Multimodal / LLM analysis | Produces richer search-oriented understanding for natural-language discovery |
Approximate cost: image enrichment is typically one of the most cost-effective AI options in the platform, often around ~$1 per 1,000 images for baseline enrichment, depending on the processors enabled.
For new rollouts, start with the capabilities that improve search and discovery first. Add specialized detectors only when there is a clear business need.
Video AI Capabilities
Video AI can work in two different ways:
- Audio-driven — transcription and transcript search based on what is spoken
- Visual / multimodal — time-coded analysis of what is happening on screen
Time-coded video AI makes it possible to search for a concept and jump into a relevant segment rather than only finding the full asset.
| Capability | What it does |
|---|---|
| Transcript search | Lets users search what is said in the audio track |
| Time-coded visual description | Describes visual content at intervals throughout the video |
| Time-coded text detection | Detects on-screen text with time ranges |
| Time-coded object / concept detection | Identifies objects and scenes over time |
| Deep video search | Supports natural-language retrieval of visually relevant video moments |
Approximate cost: visual and multimodal video analysis is usually a later-phase decision because it is more expensive than transcription and often needs targeted rollout by folder, media type, or content priority.
For many organizations, deep video search is the feature that demos best, but it is not always the right Day 1 default.
Specialized Processors vs. Modern Multimodal Search
Some AI capabilities are specialized detectors such as text detection, object detection, or celebrity recognition. Others are multimodal / LLM-oriented capabilities focused on richer natural-language retrieval.
Both remain useful, but they serve different goals:
- Specialized processors are best when you need a very specific signal, such as visible text, known public figures, or explicit object categories
- Multimodal / LLM search is best when users want to search naturally, using descriptions and concepts rather than rigid keywords
For new deployments, the recommendation is usually:
- start with transcription
- add image discovery
- add richer multimodal search
- add specialized processors where they solve a real business need
This avoids over-configuring features that users may not actually need at go-live.
Processor Dependencies
Some AI processors depend on others running first. The most important dependency chain is still:
AudioExtraction (MP3 proxy)
└── Transcribe
└── downstream transcript-based search features
Video visual analysis also depends on screenshot or frame extraction. See Turning On/Off Asset Processors and Dependencies for the detailed dependency map.
Scoping and Rollout Controls
AI capabilities are enabled via the system configuration file using the same processorList structure as other Nomad processors. Common controls include:
enabled: true/falseurlMatchesfor folder prefixesurlExcludesfor exclusionsrulefor expression-based conditions such as file type, age, size, or other metadata
This means you can roll AI out gradually by:
- folder
- territory or department
- year
- file extension
- content type
See Rules Engine Overview and Turning On/Off Asset Processors and Dependencies for the configuration details.
Enabling AI Later Is Safe
AI does not have to be decided all at once.
- You can start with transcription now and add deeper capabilities later
- You can scope later capabilities to only selected folders or media types
- You can reprocess existing content to add only the outputs that are missing
Retroactive processing: newly enabled AI processors can be applied later using Reprocessing Assets. The system adds missing outputs without re-running processors that have already completed successfully.
The main operational caution is cost: enabling a new processor across a large historical catalog can create a catch-up spike while the back catalog is processed.
Deployment-Specific Providers
Provider details can vary by deployment and by capability. Use the capability guidance on this page to decide what to enable, then consult the provider-specific setup pages for how your environment is configured:
Related Pages
- AI Metadata Overview — data security, model providers, and storage architecture
- Search and Discovery — how transcript, keyword, and multimodal search behave
- Phased AI Rollout — suggested Day 1 go-live and test-bed strategy
- Rules Engine Overview — how to scope processors to specific folders and file types
- Turning On/Off Asset Processors and Dependencies — full JSON configuration reference
- Proxy Generation Overview — proxy dependencies for AI processing
- Reprocessing Assets — applying AI processors retroactively to existing content
