Azure GPU Hosting Options for Open-Source AI Models¶
Table of Contents¶
- Executive Summary
- π― Key Use Case: Long Audio File Analysis
- 1. Azure Container Apps (ACA) with Serverless GPUs
- 2. Azure AI Foundry Model Hosting Options
- 3. ALL Azure Options for Long Audio Processing
- 4. Whisper + Pyannote: Speaker Diarization Considerations
- No Managed SaaS/PaaS for Pyannote
- Alternative Hugging Face Models
- 5. Best Practices for Open-Source Model Hosting on ACA
- 6. Recommendations for Unsupported Regions
- 7. Is ACA Serverless GPU the Best Solution?
- 8. Complete Cost Comparison Summary
- 9. Action Items
- References
- Verification Status
Executive Summary¶
This document provides comprehensive guidance for hosting open-source AI models (Hugging Face, Whisper, Pyannote, etc.) on Azure, with a focus on Azure Container Apps (ACA) with serverless GPUs. It addresses regional availability, cost optimization, and best practices for production deployments, specifically for long-running audio analysis workloads (1-2 hours) with speaker diarization.
π― Key Use Case: Long Audio File Analysis (1-2 Hours)¶
Your scenario: Processing long audio files (1-2 hours) with custom open-source models using serverless GPU
Critical Cost Comparison¶
| Hosting Option | Billing Model | Cost When Idle | Best For |
|---|---|---|---|
| ACA Serverless GPU | Per-second | $0 (scale-to-zero) | Variable workloads, batch jobs (supported regions) |
| Azure Batch (Spot VMs) | Spot pricing (~90% off) | $0 (scale-to-zero) | Maximum savings, batch jobs, any region |
| AI Foundry Managed Compute | Per-hour (VM uptime) | Continues charging | 24/7 consistent workloads |
| Azure Speech Batch Transcription | Per-audio-hour | $0 (pay-per-use) | Standard Whisper transcription |
| Azure Speech Fast Transcription | Per-audio-hour | $0 (pay-per-use) | Files <2 hours, <300MB |
π° Cost Analysis for Long Audio Processing¶
Scenario: Processing 10 audio files of 2 hours each per day
| Option | Active Processing | Idle Hours (22h) | Monthly Estimate |
|---|---|---|---|
| ACA Serverless GPU (T4) | Pay only for ~20h processing | $0 | Lower |
| AI Foundry Managed Compute | Pay for 24h VM | Still paying | ~3-4x higher |
Winner for batch/variable workloads: β ACA Serverless GPU
1. Azure Container Apps (ACA) with Serverless GPUs¶
Overview¶
Azure Container Apps serverless GPUs provide a middle-layer option between: - Azure AI Foundry serverless APIs (fully managed, pay-per-token) - Managed compute (dedicated VMs, pay-per-compute-uptime - charges even when idle)
Key Benefits for Long Audio Processing¶
| Feature | Description | Benefit for Audio Workloads |
|---|---|---|
| Scale-to-Zero | GPUs scale down when not in use | Pay $0 between processing jobs |
| Per-Second Billing | Pay only for actual GPU compute time | 2-hour job = pay for 2 hours only |
| Data Governance | Data never leaves container boundaries | Audio files stay in your container |
| No Infrastructure Management | Serverless - no driver installation | Focus on model, not infrastructure |
| Automatic Scaling | Scales based on workload demand | Handle burst of audio files |
| Jobs Support | ACA Jobs for batch processing | Perfect for audio batch processing |
GPU Types Available¶
| GPU Type | Memory | Approx. Cost/Hour | Best For | Recommendation for Audio |
|---|---|---|---|---|
| NVIDIA T4 | 16GB VRAM | ~$0.32/hr | Cost-effective inference, models <10GB | β Recommended for Whisper/audio models |
| NVIDIA A100 | 80GB HBM2e | ~$2.34/hr | Large models >15GB, training | Overkill for most audio processing |
Cost Insight: A100 provides ~4Γ the performance but costs ~7-8Γ more than T4. For Whisper + Pyannote (~8-10GB VRAM combined), T4 is the cost-effective choice.
ACA Billing Details¶
Billing = (vCPU-seconds Γ rate) + (GiB-seconds Γ rate) + (GPU-seconds Γ rate)
β
When scaled to ZERO replicas = NO charges
β
Per-second precision billing
β
GPU-seconds billed only when GPU is allocated
Important: Idle usage charges do NOT apply to serverless GPU apps - they're always billed for active usage only.
π¨ Regional Availability - CRITICAL LIMITATION¶
ACA serverless GPUs are only available in specific regions. Many regions (including UK, Germany, and others) are NOT supported. Check the table below for current availability.
| Region | A100 | T4 |
|---|---|---|
| Australia East | β Yes | β Yes |
| Brazil South | β Yes | β Yes |
| Canada Central | β Yes | β Yes |
| Central India | β No | β Yes |
| East US | β Yes | β Yes |
| France Central | β No | β Yes |
| Italy North | β Yes | β Yes |
| Japan East | β No | β Yes |
| North Central US | β No | β Yes |
| South Central US | β No | β Yes |
| South East Asia | β No | β Yes |
| South India | β No | β Yes |
| Sweden Central | β Yes | β Yes |
| West Europe* | β No | β Yes |
| West US | β Yes | β Yes |
| West US 2 | β No | β Yes |
| West US 3 | β Yes | β Yes |
| Canada East | β Not available | β Not available |
*West Europe requires creating a new workload profile environment.
Considerations for ACA Serverless GPUs¶
- CUDA Version: Supports latest CUDA version
- Single GPU per Container: Only one container in an app can use the GPU at a time
- No Fractional GPUs: Multi and fractional GPU replicas aren't supported
- Quota Required: Must request GPU quotas via Azure support case (EA/Pay-as-you-go have default quota)
- Instance Limits: Max 1Γ A100 GPU, 24 vCPU, 96 GB RAM per container instance
- Network: Consumes one IP address per replica when using VNet integration
- Foundry Models (Preview): You can deploy Microsoft Foundry models (MLFLOW type) directly to ACA serverless GPUs using
az containerapp upwith--model-registry,--model-name, and--model-versionparameters
Why NOT Other Azure Services?¶
| Alternative | Scale-to-Zero? | Why Not Preferred |
|---|---|---|
| Azure ML Online Endpoints | β No | Requires at least 1 GPU node running 24/7 - incurs idle costs |
| AKS with GPU Node Pools | β οΈ Possible but complex | Can configure scale-to-zero but requires Kubernetes expertise, longer cold starts (minutes) |
| Azure Batch | β Yes | Good for batch jobs but not real-time serving, more setup |
| Azure Functions | β No GPU support | No GPU on Consumption plan |
| Dedicated GPU VMs | β No | Manual management, risk of forgetting to shut down |
2. Azure AI Foundry Model Hosting Options¶
About Azure AI Foundry¶
Azure AI Foundry catalogs 11,000+ models including open-source Hugging Face models. It offers two deployment modes:
- Serverless API - Managed, pay-per-token (uses ephemeral containers under the hood)
- Managed Compute - Dedicated VMs you provision (pays per VM uptime)
Deployment Options Comparison (With Cost Impact)¶
| Option | Billing Model | Idle Cost | Hugging Face Support | All Regions |
|---|---|---|---|---|
| Serverless API | Pay-per-token | $0 | β Limited (popular models only) | β Most |
| Managed Compute | Pay-per-VM-hour | πΈ Continues charging | β Full | β Yes |
| ACA Serverless GPU | Pay-per-second | $0 (scale-to-zero) | β Full | β Limited (17 regions) |
β οΈ Managed Compute Cost Warning¶
AI Foundry Managed Compute charges per VM uptime, NOT per inference.
If your workload is variable (e.g., process audio files during business hours only), you'll pay for: - Active processing time - PLUS all idle time the VM is running
This can result in 3-4x higher costs compared to serverless GPU for batch workloads.
When Serverless API Won't Work¶
Not every model can run on the serverless footprint. Models that exceed these limits require Managed Compute:
- GPU: More than 1Γ A100 (multi-GPU inference)
- CPU: More than 24 vCPUs
- RAM: More than 96 GB
Large models requiring multi-GPU for inference cannot use the simple serverless deployment.
Hugging Face Models in AI Foundry¶
Important: Hugging Face models in Azure AI Foundry are NOT available as serverless APIs. They require Managed Compute deployment.
"Managed compute deployment is required for model collections that include: Hugging Face, NVIDIA inference microservices (NIMs), Industry models, Databricks, Custom models"
3. ALL Azure Options for Long Audio Processing (1-2 Hours)¶
Complete Options Matrix¶
| Option | Model Flexibility | File Size Limit | Billing | Scale-to-Zero | Custom Models | Pyannote | Region Availability |
|---|---|---|---|---|---|---|---|
| ACA Serverless GPU | β Any model | Unlimited | Per-second | β Yes | β Yes | β Yes | β οΈ 17 regions only |
| ACA Jobs + GPU | β Any model | Unlimited | Per-second | β Yes | β Yes | β Yes | β οΈ 17 regions only |
| Azure Batch (Spot VMs) | β Any model | Unlimited | Spot (~90% off) | β Yes | β Yes | β Yes | β All regions |
| Azure Speech Batch Transcription | Whisper only | Up to 1GB | Per-audio-hour | β Yes | β No | β MS only | β Most regions |
| Azure Speech Fast Transcription | Whisper only | <300MB, <2h | Per-audio-hour | β Yes | β No | β MS only | β Most regions |
| Azure OpenAI Whisper | Whisper only | 25MB | Per-token | β Yes | β No | β No | β Most regions |
| AI Foundry Managed Compute | β Any model | Unlimited | Per-VM-hour | β No | β Yes | β Yes | β All regions |
| AKS with GPU Nodes | β Any model | Unlimited | Per-node-hour | β οΈ Manual | β Yes | β Yes | β All regions |
Option Details¶
Option 1: ACA Serverless GPU with Container Apps (β RECOMMENDED)¶
Best for: Custom open-source models, variable workloads, long audio files
Architecture:
βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β Audio Files ββββββΆβ Azure Container Apps β
β (Blob Storage) β β - Serverless GPU (T4) β
βββββββββββββββββββ β - Custom Whisper/Hugging Face β
β - Scale-to-zero when idle β
β - Per-second billing β
βββββββββββββββββββββββββββββββββββββββ
Pros: - β Per-second billing (pay only during processing) - β Scale-to-zero (no cost when idle) - β Use ANY open-source model - β Process files of ANY size - β Full data governance
Cons: - β Limited to 17 supported regions (see availability table) - β Requires container image management - β Cold start time (mitigated with artifact streaming)
Option 2: ACA Jobs with GPU (β EXCELLENT FOR BATCH)¶
Best for: Scheduled batch processing of multiple audio files
Architecture:
βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β Queue/Schedule ββββββΆβ Azure Container Apps Job β
β (Storage Queue)β β - Event/Schedule triggered β
βββββββββββββββββββ β - Serverless GPU (T4) β
β - Runs, processes, exits β
β - Pay only during execution β
βββββββββββββββββββββββββββββββββββββββ
Job Types: - Manual Jobs: Trigger on-demand via API - Scheduled Jobs: Cron-based (e.g., nightly batch) - Event-driven Jobs: Triggered by queue messages
Pros: - β Perfect for batch audio processing - β Automatic retry on failure - β Pay only during job execution - β Scales based on queue depth
Option 3: Azure Speech Batch Transcription (For Standard Whisper)¶
Best for: Standard Whisper transcription without custom models
Pros: - β Fully managed, no infrastructure - β Pay-per-audio-hour - β Supports files >25MB (unlike Azure OpenAI Whisper) - β Available in Canada
Cons: - β Whisper model only (no custom models) - β No real-time processing (async only) - β May take up to 30 min to start at peak hours
Option 4: Azure Speech Fast Transcription API¶
Best for: Quick turnaround on files <2 hours
Limits:
- Files less than 2 hours
- Files less than 300MB
- Synchronous response (faster than real-time)
Pros: - β Predictable, fast latency - β Synchronous results - β Available in Canada
Cons: - β 2-hour file limit - β 300MB size limit - β Standard Whisper only - β Uses Microsoft's diarization (NOT Pyannote)
Option 5: Azure Batch with Spot VMs (β RECOMMENDED FOR LOWEST COST)¶
Best for: Maximum cost savings, large-scale batch processing, full infrastructure control
Architecture:
βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ
β Storage Queue ββββββΆβ Azure Batch Pool (Spot GPU VMs) β
β (Job Triggers) β β - NC-series or ND-series GPUs β
βββββββββββββββββββ β - Spot pricing (~90% off) β
β - Custom container image β
β - Auto-scale 0 to N nodes β
ββββββββββββββββββββββββββββββββββββββ
Pros: - β Lowest cost option - Spot VMs up to 90% cheaper - β Available in ALL Azure regions (including unsupported ACA GPU regions) - β Full GPU VM support - Spot pricing applies to NC, NCv3, ND, NDv2, NV series and all GPU SKUs - β Scale-to-zero (pools auto-scale based on job queue) - β Full flexibility - any Docker container - β Parallel processing across multiple nodes
Cons: - β οΈ Spot VMs can be preempted at any time - β οΈ More infrastructure setup required - β οΈ Need to handle checkpointing for preemption recovery - β οΈ Not available in BatchService mode (requires UserSubscription)
Spot VM Consideration: Jobs should be designed to handle preemption. For 1-2 hour audio files, consider checkpointing progress or breaking into smaller segments.
4. Whisper + Pyannote: Speaker Diarization Considerations¶
β οΈ Important: No Managed SaaS/PaaS for Pyannote¶
There is NO Azure managed service that offers Pyannote specifically. Pyannote is an open-source library, not a commercial product, so no cloud provider offers it as a SaaS/PaaS solution.
| Service | Whisper | Speaker Diarization | Fully Managed | Pyannote? |
|---|---|---|---|---|
| Azure Speech Batch Transcription | β Yes | β Microsoft's built-in | β Yes | β No |
| Azure Speech Fast Transcription | β Yes | β Microsoft's built-in | β Yes | β No |
| Azure OpenAI Whisper | β Yes | β No diarization | β Yes | β No |
| Self-hosted (ACA, Batch, AKS) | β Yes | β Pyannote | β You manage | β Yes |
π€ Key Question: Is Pyannote Actually Required?¶
Before investing in custom GPU hosting, evaluate if Azure Speech's built-in diarization meets your needs:
| Capability | Azure Speech Diarization | Pyannote |
|---|---|---|
| Speaker identification | β Yes (up to 36 speakers) | β Yes |
| Overlapping speech | β οΈ Limited | β Better |
| Custom fine-tuning | β No | β Yes |
| Accuracy on complex audio | Good | State-of-the-art |
| Infrastructure required | None (fully managed) | GPU container hosting |
| Available in Canada | β Yes | Depends on hosting |
Recommendation: - Try Azure Speech Batch Transcription first - it's fully managed, pay-per-use, and may be "good enough" - Only use Pyannote if you need higher accuracy on complex scenarios, overlapping speech detection, or custom fine-tuning
What is Pyannote?¶
Pyannote Audio is an open-source speaker diarization toolkit that provides: - Speaker diarization - identifying "who spoke when" - Voice activity detection (VAD) - Speaker segmentation and clustering - Overlapped speech detection
Why Combine Whisper + Pyannote?¶
Azure Speech Service has built-in diarization, but Pyannote offers: - Higher accuracy for complex scenarios - Better handling of overlapping speech - More customization options - Fine-tuning capability on your own data - State-of-the-art performance on benchmarks
π Alternative Hugging Face Models for Speaker Diarization¶
If Pyannote doesn't fit your needs (licensing, performance, or ease of use), consider these alternatives:
| Model/Framework | Type | License | VRAM | Best For |
|---|---|---|---|---|
| Pyannote 3.1 | Full pipeline | MIT | ~2-4 GB | State-of-the-art diarization, overlapped speech |
| NVIDIA NeMo | Full toolkit | Apache 2.0 | ~2-6 GB | Enterprise-grade, NVIDIA GPU optimized, Riva deployment |
| SpeechBrain ECAPA-TDNN | Speaker embeddings | Apache 2.0 | ~1-2 GB | Speaker verification, embedding extraction |
| Wav2Vec2 + clustering | DIY approach | MIT | ~2-4 GB | Custom pipelines, research flexibility |
| WhisperX | Whisper + diarization | BSD | ~4-8 GB | Combined transcription + diarization in one |
Option 1: NVIDIA NeMo (β Recommended Enterprise Alternative)¶
NVIDIA NeMo is a scalable AI framework with built-in speaker diarization: - β Apache 2.0 license (enterprise-friendly) - β Optimized for NVIDIA GPUs - β Can deploy to production with NVIDIA Riva - β Active development, 16k+ GitHub stars - β Includes ASR, TTS, and speaker diarization in one toolkit
# NeMo speaker diarization example
from nemo.collections.asr.models import ClusteringDiarizer
diarizer = ClusteringDiarizer.from_pretrained("nvidia/speakerdiarization_en")
Option 2: SpeechBrain ECAPA-TDNN¶
SpeechBrain provides speaker embedding models that can be combined with clustering for diarization: - β Apache 2.0 license - β Pre-trained on VoxCeleb (1M+ downloads/month) - β Easy to integrate with custom clustering - β οΈ Requires building your own diarization pipeline
# SpeechBrain speaker embeddings
from speechbrain.inference.speaker import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
embeddings = classifier.encode_batch(audio_signal)
# Then cluster embeddings to identify speakers
Option 3: WhisperX (Easiest Combined Solution)¶
WhisperX combines Whisper transcription with speaker diarization in a single package: - β Combines Whisper + Pyannote + forced alignment - β Word-level timestamps with speaker attribution - β Single package, easy to deploy - β οΈ Uses Pyannote under the hood (requires accepting Pyannote license)
# WhisperX combined transcription + diarization
import whisperx
model = whisperx.load_model("large-v3", device="cuda")
result = model.transcribe(audio_file)
diarize_result = whisperx.DiarizationPipeline()(audio_file)
Comparison: Pyannote vs Alternatives¶
| Criteria | Pyannote | NVIDIA NeMo | SpeechBrain | WhisperX |
|---|---|---|---|---|
| Accuracy (DER) | Best (~12-25%) | Very Good | Good (needs pipeline) | Good (uses Pyannote) |
| Ease of Use | Easy | Medium | Medium | Easiest |
| License | MIT (with contact form) | Apache 2.0 | Apache 2.0 | BSD |
| Enterprise Support | Commercial option | NVIDIA Riva | Community | Community |
| Overlapped Speech | β Best | β Good | β οΈ Limited | β Good |
| Production Ready | β Yes | β Yes (Riva) | β οΈ Custom | β οΈ Custom |
GPU Memory Requirements¶
Running both models requires more VRAM:
| Model | VRAM Usage | Notes |
|---|---|---|
| Whisper large-v3 | ~3-5 GB | Varies by batch size |
| Whisper medium | ~2-3 GB | Good quality/cost tradeoff |
| Pyannote 3.0 | ~2-4 GB | Speaker diarization |
| Combined | ~8-10 GB | Both models loaded |
Recommendation: NVIDIA T4 (16GB VRAM) is sufficient for Whisper + Pyannote. A100 (80GB) is overkill.
Processing Pipeline¶
Typical Whisper + Pyannote workflow:
flowchart TD
subgraph Pipeline["π΅ WHISPER + PYANNOTE PIPELINE"]
A[π€ Audio File<br/>1-2 hours] --> B[Pyannote VAD<br/>Voice Activity Detection]
B --> C[Pyannote Diarization<br/>Who spoke when]
C --> D[Whisper Transcription<br/>Per speaker segment]
D --> E[Merge Results<br/>Speaker-attributed transcript]
end
C -.-> C1["Speaker Segments:<br/>[Speaker1: 0:00-0:45]<br/>[Speaker2: 0:45-1:30]"]
E -.-> E1["π Final Output:<br/>Timestamped transcript<br/>with speaker labels"]
style A fill:#E6F3FF
style E fill:#90EE90
style E1 fill:#90EE90
Options Comparison for Whisper + Pyannote¶
| Option | Whisper | Pyannote | Billing | Scale-to-Zero | Region Availability | Cost Rating |
|---|---|---|---|---|---|---|
| ACA Serverless GPU | β | β | Per-second | β | β οΈ 17 regions | π° Lowest |
| Azure Batch (Spot VMs) | β | β | Spot (~90% off) | β | β All regions | π° Lowest |
| AI Foundry Managed Compute | β | β | Per-VM-hour | β | β All regions | π°π°π°π° High |
| Azure Speech Batch | β | β Microsoft's | Per-audio-hour | β | β Most regions | π°π° Medium |
*Spot VMs may be preempted - design jobs to handle checkpointing.
Decision Tree for Whisper + Pyannote¶
flowchart TD
A[π― Whisper + Pyannote<br/>Audio Processing] --> B{Is Pyannote specifically<br/>required?}
B -->|No| C[β
Azure Speech Batch Transcription<br/>Simpler, fully managed]
B -->|Yes| D{Is your region supported<br/>for ACA serverless GPU?}
D -->|Yes| E[β
ACA Serverless GPU<br/>Best: per-second billing<br/>Scale-to-zero]
D -->|No| F{Is workload<br/>variable/batch?}
F -->|Yes| G[β
Azure Batch - Spot VMs<br/>Lowest cost ~90% off<br/>Available in ALL regions]
F -->|No - 24/7| H[AI Foundry Managed Compute<br/>β οΈ Charges when idle]
style C fill:#90EE90
style E fill:#90EE90
style G fill:#90EE90
style H fill:#FFB366
5. Best Practices for Open-Source Model Hosting on ACA¶
Cold Start Optimization¶
For audio processing jobs, cold start matters less since processing time dominates:
| Technique | Impact | When to Use |
|---|---|---|
| ACR Artifact Streaming | Container starts before full image pull completes | Always (requires Premium ACR) |
| Storage Mounts for Models | Model weights load from persistent storage, not re-downloaded | Large models (>5GB) |
| Azure Files/Blob Volume | Persists between container restarts, speeds model loading | Production deployments |
| Minimum Replica = 1 | Eliminates cold start entirely | Latency-critical workloads |
| Minimum Replica = 0 | Maximum cost savings (scale-to-zero) | Batch processing (recommended) |
For 1-2 hour audio files: Cold start of 30-60 seconds is negligible compared to processing time. Use minReplicas=0 for cost optimization.
Scaling and Concurrency Tuning¶
| Strategy | Description | Recommendation |
|---|---|---|
| Batch inference per replica | Process multiple requests per GPU to maximize utilization | Avoid over-scaling on every request |
| Max replicas limit | Set maxReplicas to control costs |
Prevent runaway scaling |
| Queue-based triggers | Use Storage Queue or Service Bus to trigger processing | App only spins up when work arrives |
| HTTP scaling rules | KEDA-based autoscaling on concurrent requests | Fine-tune scale thresholds |
| Monitor GPU utilization | If GPU underutilized, handle more concurrent load per replica | Consider CPU if GPU usage very low |
Key Insight: Each ACA replica can only use ONE GPU. Horizontal scaling adds more replicas (each with its own GPU). Size your workload to maximize single-GPU utilization before scaling out.
Cost Optimization Strategies¶
| Strategy | Implementation | Savings |
|---|---|---|
| Use T4 instead of A100 | T4 (~$0.32/hr) vs A100 (~$2.34/hr) for Whisper/audio | ~85% savings |
| Scale to Zero | Set minReplicas=0 |
100% during idle |
| Use ACA Jobs | For batch processing workflows | More efficient than always-on |
| Optimize container image | Smaller image = faster cold start | Reduced startup latency |
| Mount models from storage | Azure Files/Blob instead of baking into image | Faster updates, smaller images |
| Queue-based architecture | Only process when work arrives | No idle GPU time |
6. Recommendations for Unsupported Regions¶
If your required region is not supported for ACA serverless GPUs (e.g., Canada East, UK, Germany, Middle East, Africa, etc.), here are alternatives:
Option 1: Azure Batch with Spot VMs (β RECOMMENDED)¶
Best for Whisper + Pyannote in any region with maximum savings
Pros: - β Spot pricing - up to 90% off - β Available in ALL Azure regions - β Full Docker container flexibility - β Scale pools based on job queue - β Full HuggingFace model support (Whisper, Pyannote)
Cons: - β οΈ Spot VMs can be preempted anytime - β οΈ More infrastructure setup required - β οΈ Need checkpointing for long jobs
Option 2: AI Foundry Managed Compute (Any Region)¶
Pros: - Full Hugging Face model support - Available in all Azure regions - data stays in your chosen region - Enterprise security features
Cons: - β οΈ Pay per VM uptime (charges even when idle) - Higher operational cost for variable workloads
Cost Mitigation: - Use Azure Automation to start/stop VMs on schedule - Consider if batch processing can be done in off-hours
Option 3: ACA Serverless GPU in Supported Region (If Cross-Region Processing OK)¶
Pros: - β Scale-to-zero capability - β Per-second billing - β Significantly lower cost for variable workloads
Cons: - Data leaves your local region (compliance concern) - Network latency to supported region (~20-100ms depending on distance)
For audio processing: Latency is usually acceptable since files are uploaded, processed, and results retrieved asynchronously.
Option 4: Azure Speech Batch Transcription (If Standard Whisper + MS Diarization Works)¶
Pros: - β Available in most Azure regions - β Pay-per-use (no idle charges) - β Handles files up to 1GB - β Diarization and word timestamps
Cons: - Standard Whisper only (no custom models) - Async processing (may take time to start)
Recommended Architecture for Unsupported Regions (Whisper + Pyannote)¶
flowchart LR
subgraph Unsupported["π OPTIONS FOR UNSUPPORTED REGIONS"]
direction TB
subgraph A["β
OPTION A: Azure Batch - Spot VMs (RECOMMENDED)"]
A1[Audio Files<br/>Your Region] --> A2[Azure Batch<br/>Spot GPU Pool]
A2 --> A3["β’ Spot pricing ~90% off<br/>β’ Custom Docker container<br/>β’ May be preempted<br/>β’ Data stays in your region"]
end
subgraph B["OPTION B: Azure Speech (If Pyannote NOT required)"]
B1[Audio Files<br/>Your Region] --> B2[Azure Speech<br/>Batch Transcription]
B2 --> B3["β’ Pay-per-audio-hour<br/>β’ Built-in diarization (up to 36)<br/>β’ Files up to 1GB"]
end
subgraph C["OPTION C: ACA Serverless (If cross-region OK)"]
C1[Audio Files<br/>Your Region] --> C2[ACA Serverless GPU<br/>Supported Region]
C2 --> C3["β’ Scale-to-zero, per-second<br/>β’ Whisper + Pyannote<br/>β’ β οΈ Data processed elsewhere"]
end
subgraph D["OPTION D: AI Foundry (24/7 workloads only)"]
D1[Audio Files<br/>Your Region] --> D2[AI Foundry<br/>Managed Compute]
D2 --> D3["β’ Your chosen region<br/>β’ β οΈ Charges when idle<br/>β’ Only for 24/7 loads"]
end
end
style A fill:#90EE90
style B fill:#E6F3FF
style C fill:#FFE4B5
style D fill:#FFB366
7. Is ACA Serverless GPU the Best Solution for Long Audio Processing?¶
β YES, ACA Serverless GPU is THE BEST When:¶
- β Workload is variable/batch (not running 24/7)
- β Processing long files (1-2 hours) where cold start is negligible
- β Running custom open-source models (Hugging Face Whisper variants, etc.)
- β Deploying in supported US or European regions
- β Want minimal cost - pay only during actual processing
- β Need scale-to-zero to eliminate idle costs
- β Processing audio files in batches (use ACA Jobs)
β Consider Alternatives When:¶
- β Your region doesn't support ACA serverless GPU β Use Azure Batch (Spot VMs) - available everywhere
- β Standard Whisper + MS diarization is sufficient β Use Azure Speech Batch/Fast Transcription (simpler, fully managed)
- β 24/7 consistent workload β Managed Compute may be simpler (though still more expensive)
- β Need fractional GPU sharing β Use AKS with GPU sharing
- β Real-time streaming transcription β Use Azure Speech real-time API
- β Need lowest possible cost with some preemption tolerance β Use Azure Batch with Spot VMs
8. Complete Cost Comparison Summary¶
| Scenario | Best Option | Billing | Idle Cost |
|---|---|---|---|
| Variable batch audio processing, supported region | ACA Serverless GPU | Per-second | $0 |
| Whisper + Pyannote, unsupported region | Azure Batch (Spot VMs) | Spot (~90% off) | $0 |
| Standard Whisper + MS diarization, files >25MB | Azure Speech Batch Transcription | Per-audio-hour | $0 |
| Standard Whisper, files <2h, fast turnaround | Azure Speech Fast Transcription | Per-audio-hour | $0 |
| Custom model, unsupported region, 24/7 consistent load | AI Foundry Managed Compute | Per-hour | πΈ Continues |
| Maximum control/customization | AKS with GPU Nodes | Per-node-hour | πΈ Continues |
Cost Formula Comparison (Monthly Estimate)¶
Scenario: 20 hours of audio processing per day, 22 business days/month = 440 processing hours
| Option | Calculation | Est. Monthly Cost |
|---|---|---|
| ACA Serverless GPU (T4) | 440h Γ ~$0.32/hr | ~$140/month |
| AI Foundry Managed Compute | 24h Γ 30 days Γ ~$0.32/hr | ~$230/month (3-4x more if larger VM) |
| Azure Speech Batch | 440 audio-hours Γ rate | ~$Y (competitive for standard Whisper) |
Key Insight: ACA Serverless GPU can be 60-85% cheaper than Managed Compute for batch/variable workloads.
β οΈ When Dedicated VMs Might Be Better¶
For high-volume, steady 24/7 workloads (e.g., continuous audio processing with consistent traffic), a dedicated GPU VM or Azure ML deployment with reserved capacity could be more economical:
| Workload Pattern | Best Option |
|---|---|
| Sporadic/variable (< 40% utilization) | ACA Serverless GPU (consumption) |
| Predictable off-hours (business hours only) | ACA Serverless GPU + scale-to-zero |
| 24/7 heavy traffic (> 80% utilization) | Dedicated VM or Reserved Capacity |
Rule of thumb: If GPU utilization exceeds ~80% consistently around the clock, compare per-second costs against reserved/dedicated pricing. The crossover point varies by workload.
9. Action Items¶
- Clarify Data Residency Requirements
- Check if your required region supports ACA serverless GPU (see availability table)
- If NOT supported β Plan for Azure Batch (Spot VMs) or AI Foundry Managed Compute
-
If cross-region processing is acceptable β β ACA Serverless GPU is optimal
-
Evaluate if Pyannote is Required (vs Azure Speech diarization)
- If Pyannote NOT required β Azure Speech Batch Transcription (simplest, most regions)
-
If Pyannote required β Need custom model hosting (ACA, Azure ML Batch, or Azure Batch)
-
Evaluate if Standard Whisper Works
- If YES β Azure Speech Batch/Fast Transcription (simplest, most regions)
-
If NO (need custom Whisper variant) β Custom model hosting required
-
Assess Workload Patterns
- Variable with idle time β β ACA Serverless GPU or Azure ML Batch/Azure Batch
-
24/7 consistent β Managed Compute (but still more expensive)
-
Evaluate Preemption Tolerance (for Spot VM option)
- Can tolerate preemption β Azure Batch (Spot VMs) for lowest cost
-
Need guaranteed execution β AI Foundry Managed Compute or dedicated VMs
-
Request GPU Quota (if choosing ACA)
- EA and Pay-as-you-go customers have default quota
-
Submit Azure support case for additional quota
-
Plan Architecture
- Use ACA Jobs for batch processing workflows (if using ACA)
- Use Azure ML Batch Endpoints for Canada with MLOps features
- Use Azure Batch for maximum control and lowest cost
- Use Storage Queue triggers for event-driven processing
-
Mount models from Azure Storage to reduce image size
-
Optimize Cold Start
- Use Premium ACR with artifact streaming
-
For batch jobs, cold start is negligible vs. 1-2 hour processing time
-
Design for GPU Memory (Whisper + Pyannote)
- Combined models need ~8-10GB VRAM
- NVIDIA T4 (16GB) is sufficient
- Consider sequential vs parallel model execution
References¶
Azure Container Apps¶
Azure Machine Learning¶
Azure Batch¶
Azure Speech Services¶
- Azure Speech Batch Transcription
- Azure Speech Fast Transcription
- Speaker Diarization
- Whisper Model Overview
AI Foundry¶
Pyannote (Open Source)¶
Document prepared: January 2026 Last verified against official Microsoft documentation: March 10, 2026 Based on current Azure documentation and service availability
β Verification Status¶
The following information has been verified against official Microsoft Learn documentation:
| Claim | Verified | Source |
|---|---|---|
| ACA Serverless GPU: 17 supported regions including Canada Central (many regions NOT available) | β | ACA GPU Overview |
| ACA Foundry Models deployment to serverless GPUs (preview) | β | ACA GPU Overview |
| ACA T4: 16GB VRAM, A100: 80GB HBM2e | β | ACA GPU Types |
| Speech Batch Transcription: 1GB max, 240min with diarization | β | Speech Quotas |
| Speech Fast Transcription: <300MB, <120min | β | Speech Quotas |
| Speaker Diarization: max 36 speakers | β | Batch Transcription Create |
| Azure OpenAI Whisper: 25MB file limit | β | Whisper Quickstart |
| Azure Batch Spot VMs: up to 90% discount | β | Batch Spot VMs |
| Low-Priority VMs: Retired September 30, 2025 | β | Batch Spot VMs |
| ACA Per-second billing, scale-to-zero | β | ACA Billing |
| ACA: Only one container per app can use GPU | β | ACA GPU Overview |
| ACA: Multi/fractional GPU replicas NOT supported | β | ACA GPU Overview |
| EA/Pay-as-you-go default GPU quota | β | ACA GPU Overview |
| West Europe requires new workload profile environment | β | ACA GPU Overview |
Note: Specific hourly pricing (~$0.32/hr T4, ~$2.34/hr A100) are estimates based on Azure pricing calculator and may vary. Always confirm current pricing at Azure Pricing Calculator.