Skip to content

Azure GPU Hosting Options for Open-Source AI Models

Table of Contents


Executive Summary

This document provides comprehensive guidance for hosting open-source AI models (Hugging Face, Whisper, Pyannote, etc.) on Azure, with a focus on Azure Container Apps (ACA) with serverless GPUs. It addresses regional availability, cost optimization, and best practices for production deployments, specifically for long-running audio analysis workloads (1-2 hours) with speaker diarization.


🎯 Key Use Case: Long Audio File Analysis (1-2 Hours)

Your scenario: Processing long audio files (1-2 hours) with custom open-source models using serverless GPU

Critical Cost Comparison

Hosting Option Billing Model Cost When Idle Best For
ACA Serverless GPU Per-second $0 (scale-to-zero) Variable workloads, batch jobs (supported regions)
Azure Batch (Spot VMs) Spot pricing (~90% off) $0 (scale-to-zero) Maximum savings, batch jobs, any region
AI Foundry Managed Compute Per-hour (VM uptime) Continues charging 24/7 consistent workloads
Azure Speech Batch Transcription Per-audio-hour $0 (pay-per-use) Standard Whisper transcription
Azure Speech Fast Transcription Per-audio-hour $0 (pay-per-use) Files <2 hours, <300MB

πŸ’° Cost Analysis for Long Audio Processing

Scenario: Processing 10 audio files of 2 hours each per day

Option Active Processing Idle Hours (22h) Monthly Estimate
ACA Serverless GPU (T4) Pay only for ~20h processing $0 Lower
AI Foundry Managed Compute Pay for 24h VM Still paying ~3-4x higher

Winner for batch/variable workloads: βœ… ACA Serverless GPU


1. Azure Container Apps (ACA) with Serverless GPUs

Overview

Azure Container Apps serverless GPUs provide a middle-layer option between: - Azure AI Foundry serverless APIs (fully managed, pay-per-token) - Managed compute (dedicated VMs, pay-per-compute-uptime - charges even when idle)

Key Benefits for Long Audio Processing

Feature Description Benefit for Audio Workloads
Scale-to-Zero GPUs scale down when not in use Pay $0 between processing jobs
Per-Second Billing Pay only for actual GPU compute time 2-hour job = pay for 2 hours only
Data Governance Data never leaves container boundaries Audio files stay in your container
No Infrastructure Management Serverless - no driver installation Focus on model, not infrastructure
Automatic Scaling Scales based on workload demand Handle burst of audio files
Jobs Support ACA Jobs for batch processing Perfect for audio batch processing

GPU Types Available

GPU Type Memory Approx. Cost/Hour Best For Recommendation for Audio
NVIDIA T4 16GB VRAM ~$0.32/hr Cost-effective inference, models <10GB βœ… Recommended for Whisper/audio models
NVIDIA A100 80GB HBM2e ~$2.34/hr Large models >15GB, training Overkill for most audio processing

Cost Insight: A100 provides ~4Γ— the performance but costs ~7-8Γ— more than T4. For Whisper + Pyannote (~8-10GB VRAM combined), T4 is the cost-effective choice.

ACA Billing Details

Billing = (vCPU-seconds Γ— rate) + (GiB-seconds Γ— rate) + (GPU-seconds Γ— rate)

βœ… When scaled to ZERO replicas = NO charges
βœ… Per-second precision billing
βœ… GPU-seconds billed only when GPU is allocated

Important: Idle usage charges do NOT apply to serverless GPU apps - they're always billed for active usage only.

🚨 Regional Availability - CRITICAL LIMITATION

ACA serverless GPUs are only available in specific regions. Many regions (including UK, Germany, and others) are NOT supported. Check the table below for current availability.

Region A100 T4
Australia East βœ… Yes βœ… Yes
Brazil South βœ… Yes βœ… Yes
Canada Central βœ… Yes βœ… Yes
Central India ❌ No βœ… Yes
East US βœ… Yes βœ… Yes
France Central ❌ No βœ… Yes
Italy North βœ… Yes βœ… Yes
Japan East ❌ No βœ… Yes
North Central US ❌ No βœ… Yes
South Central US ❌ No βœ… Yes
South East Asia ❌ No βœ… Yes
South India ❌ No βœ… Yes
Sweden Central βœ… Yes βœ… Yes
West Europe* ❌ No βœ… Yes
West US βœ… Yes βœ… Yes
West US 2 ❌ No βœ… Yes
West US 3 βœ… Yes βœ… Yes
Canada East ❌ Not available ❌ Not available

*West Europe requires creating a new workload profile environment.

Considerations for ACA Serverless GPUs

  1. CUDA Version: Supports latest CUDA version
  2. Single GPU per Container: Only one container in an app can use the GPU at a time
  3. No Fractional GPUs: Multi and fractional GPU replicas aren't supported
  4. Quota Required: Must request GPU quotas via Azure support case (EA/Pay-as-you-go have default quota)
  5. Instance Limits: Max 1Γ— A100 GPU, 24 vCPU, 96 GB RAM per container instance
  6. Network: Consumes one IP address per replica when using VNet integration
  7. Foundry Models (Preview): You can deploy Microsoft Foundry models (MLFLOW type) directly to ACA serverless GPUs using az containerapp up with --model-registry, --model-name, and --model-version parameters

Why NOT Other Azure Services?

Alternative Scale-to-Zero? Why Not Preferred
Azure ML Online Endpoints ❌ No Requires at least 1 GPU node running 24/7 - incurs idle costs
AKS with GPU Node Pools ⚠️ Possible but complex Can configure scale-to-zero but requires Kubernetes expertise, longer cold starts (minutes)
Azure Batch βœ… Yes Good for batch jobs but not real-time serving, more setup
Azure Functions ❌ No GPU support No GPU on Consumption plan
Dedicated GPU VMs ❌ No Manual management, risk of forgetting to shut down

2. Azure AI Foundry Model Hosting Options

About Azure AI Foundry

Azure AI Foundry catalogs 11,000+ models including open-source Hugging Face models. It offers two deployment modes:

  1. Serverless API - Managed, pay-per-token (uses ephemeral containers under the hood)
  2. Managed Compute - Dedicated VMs you provision (pays per VM uptime)

Deployment Options Comparison (With Cost Impact)

Option Billing Model Idle Cost Hugging Face Support All Regions
Serverless API Pay-per-token $0 ❌ Limited (popular models only) βœ… Most
Managed Compute Pay-per-VM-hour πŸ’Έ Continues charging βœ… Full βœ… Yes
ACA Serverless GPU Pay-per-second $0 (scale-to-zero) βœ… Full ❌ Limited (17 regions)

⚠️ Managed Compute Cost Warning

AI Foundry Managed Compute charges per VM uptime, NOT per inference.

If your workload is variable (e.g., process audio files during business hours only), you'll pay for: - Active processing time - PLUS all idle time the VM is running

This can result in 3-4x higher costs compared to serverless GPU for batch workloads.

When Serverless API Won't Work

Not every model can run on the serverless footprint. Models that exceed these limits require Managed Compute:

  • GPU: More than 1Γ— A100 (multi-GPU inference)
  • CPU: More than 24 vCPUs
  • RAM: More than 96 GB

Large models requiring multi-GPU for inference cannot use the simple serverless deployment.

Hugging Face Models in AI Foundry

Important: Hugging Face models in Azure AI Foundry are NOT available as serverless APIs. They require Managed Compute deployment.

"Managed compute deployment is required for model collections that include: Hugging Face, NVIDIA inference microservices (NIMs), Industry models, Databricks, Custom models"


3. ALL Azure Options for Long Audio Processing (1-2 Hours)

Complete Options Matrix

Option Model Flexibility File Size Limit Billing Scale-to-Zero Custom Models Pyannote Region Availability
ACA Serverless GPU βœ… Any model Unlimited Per-second βœ… Yes βœ… Yes βœ… Yes ⚠️ 17 regions only
ACA Jobs + GPU βœ… Any model Unlimited Per-second βœ… Yes βœ… Yes βœ… Yes ⚠️ 17 regions only
Azure Batch (Spot VMs) βœ… Any model Unlimited Spot (~90% off) βœ… Yes βœ… Yes βœ… Yes βœ… All regions
Azure Speech Batch Transcription Whisper only Up to 1GB Per-audio-hour βœ… Yes ❌ No ❌ MS only βœ… Most regions
Azure Speech Fast Transcription Whisper only <300MB, <2h Per-audio-hour βœ… Yes ❌ No ❌ MS only βœ… Most regions
Azure OpenAI Whisper Whisper only 25MB Per-token βœ… Yes ❌ No ❌ No βœ… Most regions
AI Foundry Managed Compute βœ… Any model Unlimited Per-VM-hour ❌ No βœ… Yes βœ… Yes βœ… All regions
AKS with GPU Nodes βœ… Any model Unlimited Per-node-hour ⚠️ Manual βœ… Yes βœ… Yes βœ… All regions

Option Details

Best for: Custom open-source models, variable workloads, long audio files

Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Audio Files    │────▢│  Azure Container Apps               β”‚
β”‚  (Blob Storage) β”‚     β”‚  - Serverless GPU (T4)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  - Custom Whisper/Hugging Face      β”‚
                        β”‚  - Scale-to-zero when idle          β”‚
                        β”‚  - Per-second billing               β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pros: - βœ… Per-second billing (pay only during processing) - βœ… Scale-to-zero (no cost when idle) - βœ… Use ANY open-source model - βœ… Process files of ANY size - βœ… Full data governance

Cons: - ❌ Limited to 17 supported regions (see availability table) - ❌ Requires container image management - ❌ Cold start time (mitigated with artifact streaming)

Option 2: ACA Jobs with GPU (βœ… EXCELLENT FOR BATCH)

Best for: Scheduled batch processing of multiple audio files

Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Queue/Schedule │────▢│  Azure Container Apps Job           β”‚
β”‚  (Storage Queue)β”‚     β”‚  - Event/Schedule triggered         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  - Serverless GPU (T4)              β”‚
                        β”‚  - Runs, processes, exits           β”‚
                        β”‚  - Pay only during execution        β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Job Types: - Manual Jobs: Trigger on-demand via API - Scheduled Jobs: Cron-based (e.g., nightly batch) - Event-driven Jobs: Triggered by queue messages

Pros: - βœ… Perfect for batch audio processing - βœ… Automatic retry on failure - βœ… Pay only during job execution - βœ… Scales based on queue depth

Option 3: Azure Speech Batch Transcription (For Standard Whisper)

Best for: Standard Whisper transcription without custom models

Limits:
- Files up to 1GB
- Concurrent processing
- Diarization support
- Word-level timestamps

Pros: - βœ… Fully managed, no infrastructure - βœ… Pay-per-audio-hour - βœ… Supports files >25MB (unlike Azure OpenAI Whisper) - βœ… Available in Canada

Cons: - ❌ Whisper model only (no custom models) - ❌ No real-time processing (async only) - ❌ May take up to 30 min to start at peak hours

Option 4: Azure Speech Fast Transcription API

Best for: Quick turnaround on files <2 hours

Limits:
- Files less than 2 hours
- Files less than 300MB
- Synchronous response (faster than real-time)

Pros: - βœ… Predictable, fast latency - βœ… Synchronous results - βœ… Available in Canada

Cons: - ❌ 2-hour file limit - ❌ 300MB size limit - ❌ Standard Whisper only - ❌ Uses Microsoft's diarization (NOT Pyannote)

Best for: Maximum cost savings, large-scale batch processing, full infrastructure control

Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Storage Queue  │────▢│  Azure Batch Pool (Spot GPU VMs)   β”‚
β”‚  (Job Triggers) β”‚     β”‚  - NC-series or ND-series GPUs     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  - Spot pricing (~90% off)         β”‚
                        β”‚  - Custom container image          β”‚
                        β”‚  - Auto-scale 0 to N nodes         β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pros: - βœ… Lowest cost option - Spot VMs up to 90% cheaper - βœ… Available in ALL Azure regions (including unsupported ACA GPU regions) - βœ… Full GPU VM support - Spot pricing applies to NC, NCv3, ND, NDv2, NV series and all GPU SKUs - βœ… Scale-to-zero (pools auto-scale based on job queue) - βœ… Full flexibility - any Docker container - βœ… Parallel processing across multiple nodes

Cons: - ⚠️ Spot VMs can be preempted at any time - ⚠️ More infrastructure setup required - ⚠️ Need to handle checkpointing for preemption recovery - ⚠️ Not available in BatchService mode (requires UserSubscription)

Spot VM Consideration: Jobs should be designed to handle preemption. For 1-2 hour audio files, consider checkpointing progress or breaking into smaller segments.


4. Whisper + Pyannote: Speaker Diarization Considerations

⚠️ Important: No Managed SaaS/PaaS for Pyannote

There is NO Azure managed service that offers Pyannote specifically. Pyannote is an open-source library, not a commercial product, so no cloud provider offers it as a SaaS/PaaS solution.

Service Whisper Speaker Diarization Fully Managed Pyannote?
Azure Speech Batch Transcription βœ… Yes βœ… Microsoft's built-in βœ… Yes ❌ No
Azure Speech Fast Transcription βœ… Yes βœ… Microsoft's built-in βœ… Yes ❌ No
Azure OpenAI Whisper βœ… Yes ❌ No diarization βœ… Yes ❌ No
Self-hosted (ACA, Batch, AKS) βœ… Yes βœ… Pyannote ❌ You manage βœ… Yes

πŸ€” Key Question: Is Pyannote Actually Required?

Before investing in custom GPU hosting, evaluate if Azure Speech's built-in diarization meets your needs:

Capability Azure Speech Diarization Pyannote
Speaker identification βœ… Yes (up to 36 speakers) βœ… Yes
Overlapping speech ⚠️ Limited βœ… Better
Custom fine-tuning ❌ No βœ… Yes
Accuracy on complex audio Good State-of-the-art
Infrastructure required None (fully managed) GPU container hosting
Available in Canada βœ… Yes Depends on hosting

Recommendation: - Try Azure Speech Batch Transcription first - it's fully managed, pay-per-use, and may be "good enough" - Only use Pyannote if you need higher accuracy on complex scenarios, overlapping speech detection, or custom fine-tuning

What is Pyannote?

Pyannote Audio is an open-source speaker diarization toolkit that provides: - Speaker diarization - identifying "who spoke when" - Voice activity detection (VAD) - Speaker segmentation and clustering - Overlapped speech detection

Why Combine Whisper + Pyannote?

Azure Speech Service has built-in diarization, but Pyannote offers: - Higher accuracy for complex scenarios - Better handling of overlapping speech - More customization options - Fine-tuning capability on your own data - State-of-the-art performance on benchmarks

πŸ”„ Alternative Hugging Face Models for Speaker Diarization

If Pyannote doesn't fit your needs (licensing, performance, or ease of use), consider these alternatives:

Model/Framework Type License VRAM Best For
Pyannote 3.1 Full pipeline MIT ~2-4 GB State-of-the-art diarization, overlapped speech
NVIDIA NeMo Full toolkit Apache 2.0 ~2-6 GB Enterprise-grade, NVIDIA GPU optimized, Riva deployment
SpeechBrain ECAPA-TDNN Speaker embeddings Apache 2.0 ~1-2 GB Speaker verification, embedding extraction
Wav2Vec2 + clustering DIY approach MIT ~2-4 GB Custom pipelines, research flexibility
WhisperX Whisper + diarization BSD ~4-8 GB Combined transcription + diarization in one

NVIDIA NeMo is a scalable AI framework with built-in speaker diarization: - βœ… Apache 2.0 license (enterprise-friendly) - βœ… Optimized for NVIDIA GPUs - βœ… Can deploy to production with NVIDIA Riva - βœ… Active development, 16k+ GitHub stars - βœ… Includes ASR, TTS, and speaker diarization in one toolkit

# NeMo speaker diarization example
from nemo.collections.asr.models import ClusteringDiarizer
diarizer = ClusteringDiarizer.from_pretrained("nvidia/speakerdiarization_en")

Option 2: SpeechBrain ECAPA-TDNN

SpeechBrain provides speaker embedding models that can be combined with clustering for diarization: - βœ… Apache 2.0 license - βœ… Pre-trained on VoxCeleb (1M+ downloads/month) - βœ… Easy to integrate with custom clustering - ⚠️ Requires building your own diarization pipeline

# SpeechBrain speaker embeddings
from speechbrain.inference.speaker import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
embeddings = classifier.encode_batch(audio_signal)
# Then cluster embeddings to identify speakers

Option 3: WhisperX (Easiest Combined Solution)

WhisperX combines Whisper transcription with speaker diarization in a single package: - βœ… Combines Whisper + Pyannote + forced alignment - βœ… Word-level timestamps with speaker attribution - βœ… Single package, easy to deploy - ⚠️ Uses Pyannote under the hood (requires accepting Pyannote license)

# WhisperX combined transcription + diarization
import whisperx
model = whisperx.load_model("large-v3", device="cuda")
result = model.transcribe(audio_file)
diarize_result = whisperx.DiarizationPipeline()(audio_file)

Comparison: Pyannote vs Alternatives

Criteria Pyannote NVIDIA NeMo SpeechBrain WhisperX
Accuracy (DER) Best (~12-25%) Very Good Good (needs pipeline) Good (uses Pyannote)
Ease of Use Easy Medium Medium Easiest
License MIT (with contact form) Apache 2.0 Apache 2.0 BSD
Enterprise Support Commercial option NVIDIA Riva Community Community
Overlapped Speech βœ… Best βœ… Good ⚠️ Limited βœ… Good
Production Ready βœ… Yes βœ… Yes (Riva) ⚠️ Custom ⚠️ Custom

GPU Memory Requirements

Running both models requires more VRAM:

Model VRAM Usage Notes
Whisper large-v3 ~3-5 GB Varies by batch size
Whisper medium ~2-3 GB Good quality/cost tradeoff
Pyannote 3.0 ~2-4 GB Speaker diarization
Combined ~8-10 GB Both models loaded

Recommendation: NVIDIA T4 (16GB VRAM) is sufficient for Whisper + Pyannote. A100 (80GB) is overkill.

Processing Pipeline

Typical Whisper + Pyannote workflow:

flowchart TD
    subgraph Pipeline["🎡 WHISPER + PYANNOTE PIPELINE"]
        A[🎀 Audio File<br/>1-2 hours] --> B[Pyannote VAD<br/>Voice Activity Detection]
        B --> C[Pyannote Diarization<br/>Who spoke when]
        C --> D[Whisper Transcription<br/>Per speaker segment]
        D --> E[Merge Results<br/>Speaker-attributed transcript]
    end

    C -.-> C1["Speaker Segments:<br/>[Speaker1: 0:00-0:45]<br/>[Speaker2: 0:45-1:30]"]
    E -.-> E1["πŸ“„ Final Output:<br/>Timestamped transcript<br/>with speaker labels"]

    style A fill:#E6F3FF
    style E fill:#90EE90
    style E1 fill:#90EE90

Options Comparison for Whisper + Pyannote

Option Whisper Pyannote Billing Scale-to-Zero Region Availability Cost Rating
ACA Serverless GPU βœ… βœ… Per-second βœ… ⚠️ 17 regions πŸ’° Lowest
Azure Batch (Spot VMs) βœ… βœ… Spot (~90% off) βœ… βœ… All regions πŸ’° Lowest
AI Foundry Managed Compute βœ… βœ… Per-VM-hour ❌ βœ… All regions πŸ’°πŸ’°πŸ’°πŸ’° High
Azure Speech Batch βœ… ❌ Microsoft's Per-audio-hour βœ… βœ… Most regions πŸ’°πŸ’° Medium

*Spot VMs may be preempted - design jobs to handle checkpointing.

Decision Tree for Whisper + Pyannote

flowchart TD
    A[🎯 Whisper + Pyannote<br/>Audio Processing] --> B{Is Pyannote specifically<br/>required?}

    B -->|No| C[βœ… Azure Speech Batch Transcription<br/>Simpler, fully managed]

    B -->|Yes| D{Is your region supported<br/>for ACA serverless GPU?}

    D -->|Yes| E[βœ… ACA Serverless GPU<br/>Best: per-second billing<br/>Scale-to-zero]

    D -->|No| F{Is workload<br/>variable/batch?}

    F -->|Yes| G[βœ… Azure Batch - Spot VMs<br/>Lowest cost ~90% off<br/>Available in ALL regions]

    F -->|No - 24/7| H[AI Foundry Managed Compute<br/>⚠️ Charges when idle]

    style C fill:#90EE90
    style E fill:#90EE90
    style G fill:#90EE90
    style H fill:#FFB366

5. Best Practices for Open-Source Model Hosting on ACA

Cold Start Optimization

For audio processing jobs, cold start matters less since processing time dominates:

Technique Impact When to Use
ACR Artifact Streaming Container starts before full image pull completes Always (requires Premium ACR)
Storage Mounts for Models Model weights load from persistent storage, not re-downloaded Large models (>5GB)
Azure Files/Blob Volume Persists between container restarts, speeds model loading Production deployments
Minimum Replica = 1 Eliminates cold start entirely Latency-critical workloads
Minimum Replica = 0 Maximum cost savings (scale-to-zero) Batch processing (recommended)

For 1-2 hour audio files: Cold start of 30-60 seconds is negligible compared to processing time. Use minReplicas=0 for cost optimization.

Scaling and Concurrency Tuning

Strategy Description Recommendation
Batch inference per replica Process multiple requests per GPU to maximize utilization Avoid over-scaling on every request
Max replicas limit Set maxReplicas to control costs Prevent runaway scaling
Queue-based triggers Use Storage Queue or Service Bus to trigger processing App only spins up when work arrives
HTTP scaling rules KEDA-based autoscaling on concurrent requests Fine-tune scale thresholds
Monitor GPU utilization If GPU underutilized, handle more concurrent load per replica Consider CPU if GPU usage very low

Key Insight: Each ACA replica can only use ONE GPU. Horizontal scaling adds more replicas (each with its own GPU). Size your workload to maximize single-GPU utilization before scaling out.

Cost Optimization Strategies

Strategy Implementation Savings
Use T4 instead of A100 T4 (~$0.32/hr) vs A100 (~$2.34/hr) for Whisper/audio ~85% savings
Scale to Zero Set minReplicas=0 100% during idle
Use ACA Jobs For batch processing workflows More efficient than always-on
Optimize container image Smaller image = faster cold start Reduced startup latency
Mount models from storage Azure Files/Blob instead of baking into image Faster updates, smaller images
Queue-based architecture Only process when work arrives No idle GPU time

6. Recommendations for Unsupported Regions

If your required region is not supported for ACA serverless GPUs (e.g., Canada East, UK, Germany, Middle East, Africa, etc.), here are alternatives:

Best for Whisper + Pyannote in any region with maximum savings

Pros: - βœ… Spot pricing - up to 90% off - βœ… Available in ALL Azure regions - βœ… Full Docker container flexibility - βœ… Scale pools based on job queue - βœ… Full HuggingFace model support (Whisper, Pyannote)

Cons: - ⚠️ Spot VMs can be preempted anytime - ⚠️ More infrastructure setup required - ⚠️ Need checkpointing for long jobs

Option 2: AI Foundry Managed Compute (Any Region)

Pros: - Full Hugging Face model support - Available in all Azure regions - data stays in your chosen region - Enterprise security features

Cons: - ⚠️ Pay per VM uptime (charges even when idle) - Higher operational cost for variable workloads

Cost Mitigation: - Use Azure Automation to start/stop VMs on schedule - Consider if batch processing can be done in off-hours

Option 3: ACA Serverless GPU in Supported Region (If Cross-Region Processing OK)

Pros: - βœ… Scale-to-zero capability - βœ… Per-second billing - βœ… Significantly lower cost for variable workloads

Cons: - Data leaves your local region (compliance concern) - Network latency to supported region (~20-100ms depending on distance)

For audio processing: Latency is usually acceptable since files are uploaded, processed, and results retrieved asynchronously.

Option 4: Azure Speech Batch Transcription (If Standard Whisper + MS Diarization Works)

Pros: - βœ… Available in most Azure regions - βœ… Pay-per-use (no idle charges) - βœ… Handles files up to 1GB - βœ… Diarization and word timestamps

Cons: - Standard Whisper only (no custom models) - Async processing (may take time to start)

flowchart LR
    subgraph Unsupported["🌍 OPTIONS FOR UNSUPPORTED REGIONS"]
        direction TB

        subgraph A["βœ… OPTION A: Azure Batch - Spot VMs (RECOMMENDED)"]
            A1[Audio Files<br/>Your Region] --> A2[Azure Batch<br/>Spot GPU Pool]
            A2 --> A3["β€’ Spot pricing ~90% off<br/>β€’ Custom Docker container<br/>β€’ May be preempted<br/>β€’ Data stays in your region"]
        end

        subgraph B["OPTION B: Azure Speech (If Pyannote NOT required)"]
            B1[Audio Files<br/>Your Region] --> B2[Azure Speech<br/>Batch Transcription]
            B2 --> B3["β€’ Pay-per-audio-hour<br/>β€’ Built-in diarization (up to 36)<br/>β€’ Files up to 1GB"]
        end

        subgraph C["OPTION C: ACA Serverless (If cross-region OK)"]
            C1[Audio Files<br/>Your Region] --> C2[ACA Serverless GPU<br/>Supported Region]
            C2 --> C3["β€’ Scale-to-zero, per-second<br/>β€’ Whisper + Pyannote<br/>β€’ ⚠️ Data processed elsewhere"]
        end

        subgraph D["OPTION D: AI Foundry (24/7 workloads only)"]
            D1[Audio Files<br/>Your Region] --> D2[AI Foundry<br/>Managed Compute]
            D2 --> D3["β€’ Your chosen region<br/>β€’ ⚠️ Charges when idle<br/>β€’ Only for 24/7 loads"]
        end
    end

    style A fill:#90EE90
    style B fill:#E6F3FF
    style C fill:#FFE4B5
    style D fill:#FFB366

7. Is ACA Serverless GPU the Best Solution for Long Audio Processing?

βœ… YES, ACA Serverless GPU is THE BEST When:

  • βœ… Workload is variable/batch (not running 24/7)
  • βœ… Processing long files (1-2 hours) where cold start is negligible
  • βœ… Running custom open-source models (Hugging Face Whisper variants, etc.)
  • βœ… Deploying in supported US or European regions
  • βœ… Want minimal cost - pay only during actual processing
  • βœ… Need scale-to-zero to eliminate idle costs
  • βœ… Processing audio files in batches (use ACA Jobs)

❌ Consider Alternatives When:

  • ❌ Your region doesn't support ACA serverless GPU β†’ Use Azure Batch (Spot VMs) - available everywhere
  • ❌ Standard Whisper + MS diarization is sufficient β†’ Use Azure Speech Batch/Fast Transcription (simpler, fully managed)
  • ❌ 24/7 consistent workload β†’ Managed Compute may be simpler (though still more expensive)
  • ❌ Need fractional GPU sharing β†’ Use AKS with GPU sharing
  • ❌ Real-time streaming transcription β†’ Use Azure Speech real-time API
  • ❌ Need lowest possible cost with some preemption tolerance β†’ Use Azure Batch with Spot VMs

8. Complete Cost Comparison Summary

Scenario Best Option Billing Idle Cost
Variable batch audio processing, supported region ACA Serverless GPU Per-second $0
Whisper + Pyannote, unsupported region Azure Batch (Spot VMs) Spot (~90% off) $0
Standard Whisper + MS diarization, files >25MB Azure Speech Batch Transcription Per-audio-hour $0
Standard Whisper, files <2h, fast turnaround Azure Speech Fast Transcription Per-audio-hour $0
Custom model, unsupported region, 24/7 consistent load AI Foundry Managed Compute Per-hour πŸ’Έ Continues
Maximum control/customization AKS with GPU Nodes Per-node-hour πŸ’Έ Continues

Cost Formula Comparison (Monthly Estimate)

Scenario: 20 hours of audio processing per day, 22 business days/month = 440 processing hours

Option Calculation Est. Monthly Cost
ACA Serverless GPU (T4) 440h Γ— ~$0.32/hr ~$140/month
AI Foundry Managed Compute 24h Γ— 30 days Γ— ~$0.32/hr ~$230/month (3-4x more if larger VM)
Azure Speech Batch 440 audio-hours Γ— rate ~$Y (competitive for standard Whisper)

Key Insight: ACA Serverless GPU can be 60-85% cheaper than Managed Compute for batch/variable workloads.

⚠️ When Dedicated VMs Might Be Better

For high-volume, steady 24/7 workloads (e.g., continuous audio processing with consistent traffic), a dedicated GPU VM or Azure ML deployment with reserved capacity could be more economical:

Workload Pattern Best Option
Sporadic/variable (< 40% utilization) ACA Serverless GPU (consumption)
Predictable off-hours (business hours only) ACA Serverless GPU + scale-to-zero
24/7 heavy traffic (> 80% utilization) Dedicated VM or Reserved Capacity

Rule of thumb: If GPU utilization exceeds ~80% consistently around the clock, compare per-second costs against reserved/dedicated pricing. The crossover point varies by workload.


9. Action Items

  1. Clarify Data Residency Requirements
  2. Check if your required region supports ACA serverless GPU (see availability table)
  3. If NOT supported β†’ Plan for Azure Batch (Spot VMs) or AI Foundry Managed Compute
  4. If cross-region processing is acceptable β†’ βœ… ACA Serverless GPU is optimal

  5. Evaluate if Pyannote is Required (vs Azure Speech diarization)

  6. If Pyannote NOT required β†’ Azure Speech Batch Transcription (simplest, most regions)
  7. If Pyannote required β†’ Need custom model hosting (ACA, Azure ML Batch, or Azure Batch)

  8. Evaluate if Standard Whisper Works

  9. If YES β†’ Azure Speech Batch/Fast Transcription (simplest, most regions)
  10. If NO (need custom Whisper variant) β†’ Custom model hosting required

  11. Assess Workload Patterns

  12. Variable with idle time β†’ βœ… ACA Serverless GPU or Azure ML Batch/Azure Batch
  13. 24/7 consistent β†’ Managed Compute (but still more expensive)

  14. Evaluate Preemption Tolerance (for Spot VM option)

  15. Can tolerate preemption β†’ Azure Batch (Spot VMs) for lowest cost
  16. Need guaranteed execution β†’ AI Foundry Managed Compute or dedicated VMs

  17. Request GPU Quota (if choosing ACA)

  18. EA and Pay-as-you-go customers have default quota
  19. Submit Azure support case for additional quota

  20. Plan Architecture

  21. Use ACA Jobs for batch processing workflows (if using ACA)
  22. Use Azure ML Batch Endpoints for Canada with MLOps features
  23. Use Azure Batch for maximum control and lowest cost
  24. Use Storage Queue triggers for event-driven processing
  25. Mount models from Azure Storage to reduce image size

  26. Optimize Cold Start

  27. Use Premium ACR with artifact streaming
  28. For batch jobs, cold start is negligible vs. 1-2 hour processing time

  29. Design for GPU Memory (Whisper + Pyannote)

  30. Combined models need ~8-10GB VRAM
  31. NVIDIA T4 (16GB) is sufficient
  32. Consider sequential vs parallel model execution

References

Azure Container Apps

Azure Machine Learning

Azure Batch

Azure Speech Services

AI Foundry

Pyannote (Open Source)


Document prepared: January 2026 Last verified against official Microsoft documentation: March 10, 2026 Based on current Azure documentation and service availability

βœ… Verification Status

The following information has been verified against official Microsoft Learn documentation:

Claim Verified Source
ACA Serverless GPU: 17 supported regions including Canada Central (many regions NOT available) βœ… ACA GPU Overview
ACA Foundry Models deployment to serverless GPUs (preview) βœ… ACA GPU Overview
ACA T4: 16GB VRAM, A100: 80GB HBM2e βœ… ACA GPU Types
Speech Batch Transcription: 1GB max, 240min with diarization βœ… Speech Quotas
Speech Fast Transcription: <300MB, <120min βœ… Speech Quotas
Speaker Diarization: max 36 speakers βœ… Batch Transcription Create
Azure OpenAI Whisper: 25MB file limit βœ… Whisper Quickstart
Azure Batch Spot VMs: up to 90% discount βœ… Batch Spot VMs
Low-Priority VMs: Retired September 30, 2025 βœ… Batch Spot VMs
ACA Per-second billing, scale-to-zero βœ… ACA Billing
ACA: Only one container per app can use GPU βœ… ACA GPU Overview
ACA: Multi/fractional GPU replicas NOT supported βœ… ACA GPU Overview
EA/Pay-as-you-go default GPU quota βœ… ACA GPU Overview
West Europe requires new workload profile environment βœ… ACA GPU Overview

Note: Specific hourly pricing (~$0.32/hr T4, ~$2.34/hr A100) are estimates based on Azure pricing calculator and may vary. Always confirm current pricing at Azure Pricing Calculator.