Infrastructure

Deployment & Hardware Guide

Srasta runs entirely inside your infrastructure — on-prem, private cloud, or hybrid. This guide covers deployment models, hardware sizing, and GPU selection by tier.

Request a Deployment Sizing Session → Size Your Deployment →

Deployment Models

Srasta does not require public SaaS hosting. All components operate inside your controlled infrastructure.

On-Prem

GPU servers in your data centre. Full control, zero external dependencies. Recommended for regulated industries and air-gapped requirements.

Private Cloud

AWS, Azure, or GCP — deployed inside your VPC. No data leaves your cloud account. Supports GPU instance types across all major providers.

Self-Hosted Private Cloud

VMware, OpenStack, or Proxmox on your existing data centre infrastructure. Srasta deploys via Docker Compose or Kubernetes.

Hybrid

On-prem inference combined with cloud integrations. Run your models on owned hardware while connecting to cloud-based storage or services.

Reference Deployment

Our production reference configuration — the platform you see demonstrated is running on this hardware.

Reference Hardware
Platform
NVIDIA DGX Spark
Architecture
ARM64 / aarch64
Model running
30B parameter class (FP8)
Inference engine
vLLM (production-grade)
Embeddings
Local embedding model
Vector store
Milvus (hybrid search)

The DGX Spark is NVIDIA's purpose-built enterprise AI platform. It is the recommended starting point for organisations deploying Srasta on-prem at scale.

Hardware Requirements by Tier

Sizing scales with subscription tier and workload concurrency.

Foundation Tier
Knowledge assistant · Policy lookup · Internal document search · Low concurrency
CPU 8–16 vCPU
RAM 32–64 GB
Storage 500 GB – 1 TB SSD
GPU Recommended. CPU inference viable for low-concurrency evaluation only — expect significantly higher latency.
Cloud examples AWS: c6i / m6i Azure: D-series GCP: n2-standard
Enterprise Plus
Multi-team deployments · Compliance-sensitive environments · Dedicated model routing · High concurrency
CPU 32+ cores
RAM 128–256 GB
Storage 2 TB+ NVMe (RAID for HA)
GPU Multi-GPU cluster. A100 / H100 / L40S. Dedicated per-tenant model instances available.
Supports Dedicated model instances Tenant isolation High-availability configuration Kubernetes scaling

Model Sizing Guidance

7B – 14B
Foundation Tier

Lower VRAM requirements. Suitable for knowledge retrieval, document Q&A, and policy lookup. Runs on entry-level GPU hardware.

MoE
Enterprise Plus

Mixture-of-Experts architectures require higher VRAM and benefit from multi-GPU routing. Srasta supports model pooling and hybrid routing strategies.

High Availability Best Practices

For production deployments:

Separate inference from RAG ingestion

Prevents ingestion workloads from impacting active inference latency.

Dedicated vector database node

Isolate the vector store to its own node for reliable search performance at scale.

Observability enabled from day one

Monitor latency, token usage, error rates, and cost per team before scaling.

Multi-AZ for cloud deployments

Distribute across availability zones for resilience in AWS, Azure, and GCP environments.

Backup and restore configuration

Snapshot vector collections, knowledge ingestion pipelines, and governance configuration on a scheduled basis.

Kubernetes for horizontal scaling

Optional but recommended for Enterprise Plus deployments with unpredictable concurrency.

Hardware Sizing Estimator

Answer three questions to get an indicative configuration. For accurate sizing, schedule a deployment session with our team.

01 / 03 — Primary use case
02 / 03 — Deployment environment
03 / 03 — Expected active users

Not sure which configuration fits your environment?

We scope every deployment during the AI Readiness Assessment — including model selection, GPU sizing, cloud vs on-prem trade-offs, and a cost estimate.

Request a Deployment Sizing Session →