AWS Setup Guide¶

Deploy ML training jobs and manage AWS infrastructure with Nova AI.

Time: 15-20 minutes

What You'll Learn¶

AWS prerequisites and credential configuration
Installing AWS Labs MCP servers for enhanced capabilities
Using /novaai-aws for ML training deployment
DevBox management for cloud development environments
CDK infrastructure generation and deployment
Cost optimization with spot instances

Prerequisites¶

AWS Account & Credentials¶

AWS Account with appropriate permissions
AWS CLI installed and configured:

# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure credentials
aws configure
# Enter: Access Key ID, Secret Access Key, Region (us-west-2), Output format (json)

Verify credentials:

aws sts get-caller-identity

Required IAM Permissions¶

Create an IAM policy with these permissions for Nova AI AWS operations:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "s3:*",
        "iam:CreateRole",
        "iam:PutRolePolicy",
        "iam:GetRole",
        "iam:PassRole",
        "ssm:GetParameter",
        "ssm:PutParameter",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "cloudwatch:PutMetricData",
        "cloudwatch:PutDashboard",
        "cloudformation:*",
        "batch:*",
        "ecr:*"
      ],
      "Resource": "*"
    }
  ]
}

Least Privilege

For production, scope permissions to specific resources using ARNs.

AWS MCP Servers (Optional Enhancement)¶

AWS Labs provides official MCP servers that enhance Nova AI's AWS capabilities.

Available Servers¶

Server	Purpose
CCAPI	Cloud Control API - manage 1,100+ AWS resources
CDK	AWS CDK best practices and code generation
CloudFormation	Template generation and validation
Terraform	Infrastructure as Code with Terraform

Installation¶

Add to your MCP configuration (~/.aws/amazon/mcp.json or .amazonq/mcp.json):

Cloud Control APICDK Server

{
  "mcpServers": {
    "awslabs.ccapi-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.ccapi-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "default",
        "DEFAULT_TAGS": "enabled",
        "SECURITY_SCANNING": "enabled",
        "FASTMCP_LOG_LEVEL": "ERROR"
      }
    }
  }
}

{
  "mcpServers": {
    "awslabs.cdk-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.cdk-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "default"
      }
    }
  }
}

uvx Required

Install uv first: pip install uv or curl -LsSf https://astral.sh/uv/install.sh | sh

CCAPI Permissions¶

For Cloud Control API server, ensure your AWS credentials have:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudcontrol:ListResources",
        "cloudcontrol:GetResource",
        "cloudcontrol:CreateResource",
        "cloudcontrol:DeleteResource",
        "cloudcontrol:UpdateResource",
        "cloudformation:CreateGeneratedTemplate",
        "cloudformation:DescribeGeneratedTemplate",
        "cloudformation:GetGeneratedTemplate"
      ],
      "Resource": "*"
    }
  ]
}

Quick Start: /novaai-aws¶

Basic Commands¶

# Generate deployment files (interactive)
/novaai-aws

# Validate without deploying
/novaai-aws --dry-run

# Full deployment
/novaai-aws --deploy

DevBox Commands¶

# Create a new devbox
/novaai-aws spawn my-devbox medium

# Set up a GitHub repo on devbox
/novaai-aws setup my-devbox https://github.com/user/ml-project

# Run training
/novaai-aws run my-devbox

# Check status
/novaai-aws status my-devbox

# List all devboxes
/novaai-aws list

# Connect via SSH/Cursor
/novaai-aws connect my-devbox

# Stop/Terminate
/novaai-aws stop my-devbox
/novaai-aws terminate my-devbox

DevBox Setup Workflow¶

DevBox provides managed EC2 instances optimized for ML development.

1. Spawn a DevBox¶

/novaai-aws spawn jeff-devbox large

Sizes:

Size	Instance	vCPUs	RAM	GPU	Cost
micro	t3.micro	2	1GB	-	$0.01/hr
small	t3.medium	2	4GB	-	$0.04/hr
medium	g4dn.xlarge	4	16GB	1x T4	$0.53/hr
large	p3.2xlarge	8	61GB	1x V100	$3.06/hr

2. Set Up Repository¶

/novaai-aws setup jeff-devbox https://github.com/user/ml-project

What happens automatically:

Clone repository to devbox
Scan for dependencies:
ML framework (PyTorch, TensorFlow, JAX)
Environment files (environment.yml, requirements.txt)
GPU requirements (flash-attention, CUDA)
Detect S3 buckets and verify access
Set up conda/mamba environment
Generate run.sh training script

3. Run Training¶

/novaai-aws run jeff-devbox

Monitor with:

/novaai-aws status jeff-devbox

ML Training Deployment¶

For production ML training without DevBox.

Workflow Phases¶

graph TD
    A[Repository Analysis] --> B[Interactive Setup]
    B --> C{Parallel Generation}
    C --> D[Docker]
    C --> E[Wrapper]
    C --> F[Scripts]
    D --> G[Infrastructure CDK]
    E --> G
    F --> G
    G --> H[Summary & Docs]

Interactive Setup (5-6 Questions)¶

/novaai-aws

Nova AI will ask:

Framework - PyTorch/TensorFlow/JAX
Lightning? - Using PyTorch Lightning? (Y/n)
Tracking? - W&B/MLflow/None
GPU Type - Instance selection with pricing
Spot? - Use spot instances for 70% savings? (Y/n)
Entrypoint - Training script path

Generated Files¶

aws-deployment/
├── Dockerfile.env          # CUDA + micromamba
├── docker-compose.yml      # Local GPU testing
├── .env.aws               # SSM parameter references
├── train_wrapper.py       # Training integration
├── scripts/
│   ├── setup-ssm-params.sh
│   ├── deploy.sh
│   ├── monitor.sh
│   ├── logs.sh
│   └── debug.sh
├── infrastructure/aws/    # CDK stack
│   ├── app.py
│   ├── stack.py
│   └── requirements.txt
└── README.md

GPU Options & Pricing¶

Instance	GPU	VRAM	On-Demand	Spot
g4dn.xlarge	1x T4	16GB	$0.53/hr	$0.16/hr
p3.2xlarge	1x V100	16GB	$3.06/hr	$0.92/hr
p3.8xlarge	4x V100	64GB	$12.24/hr	$3.67/hr
p4d.24xlarge	8x A100	320GB	$32.77/hr	$9.83/hr

Cost Savings

Spot instances provide 70% savings. Nova AI automatically configures checkpointing for spot interruption handling.

Deployment Steps¶

1. Configure Secrets¶

cd aws-deployment
vim .env.aws  # Add AWS account ID, W&B API key
./scripts/setup-ssm-params.sh

2. Deploy Infrastructure¶

cd infrastructure/aws
pip install -r requirements.txt

# First time only
cdk bootstrap

# Deploy stack
cdk deploy

3. Deploy Training¶

cd ../..
./scripts/deploy.sh

4. Monitor¶

./scripts/monitor.sh --watch
./scripts/logs.sh

Autonomous Setup (Zero Questions)¶

Nova AI can deploy with 0-2 questions using smart defaults.

from src.orchestrator.aws import autonomous_setup

result = autonomous_setup(
    analysis=repo_analysis,
    project_root=Path("."),
    region="us-west-2",
    interactive=False,  # No questions
)

print(result.summary())
# ✅ Configuration generated (asked 0 questions)
#    Project: my-ml-project
#    Instance: p3.2xlarge
#    Cost: $3.06/hr

Auto-detected:

ML framework from imports
GPU requirements from dependencies
Entrypoint from common patterns
W&B/MLflow from installed packages

May ask (0-2 questions):

W&B API key (if W&B detected but no key)
Budget confirmation (if instance > $5/hr)

Checkpointing¶

Multi-tier checkpointing for spot instance resilience:

Tier	Interval	Location	Use Case
Fast	5 min	Local SSD	Quick recovery
Mid	30 min	EBS	Persistent storage
Durable	1 hour	S3	Cross-instance
Emergency	On SIGTERM	S3	Spot interruption

Experiment Tracking¶

Weights & Biases¶

Automatically configured with:

Git commit, branch, dirty status
S3 paths (data, checkpoints, outputs)
Compute info (GPU, CUDA, instance type)
Hyperparameters from config

# train_wrapper.py automatically logs:
wandb.init(
    config={
        "git_commit": get_git_commit(),
        "instance_type": get_instance_type(),
        "s3_checkpoint_path": os.environ["S3_CHECKPOINT_PATH"],
        # ... your hyperparameters
    }
)

SSM Parameter Store¶

Secrets stored encrypted:

/project-name/wandb/api-key           (SecureString)
/project-name/s3/checkpoint-path      (String)
/project-name/training/learning-rate  (String)

Local Testing¶

Test before deploying to AWS:

cd aws-deployment

# Build environment
docker build -f Dockerfile.env -t my-project-env .

# Test with GPU (requires nvidia-docker)
docker-compose up

Troubleshooting¶

Common Issues¶

Issue	Solution
"No GPU detected"	Install nvidia-docker
"CDK bootstrap failed"	Check credentials: `aws sts get-caller-identity`
"Spot interrupted"	Normal! Emergency checkpoint saved. Job retries.
"S3 access denied"	Verify bucket permissions in IAM policy

Debug Commands¶

# Check AWS credentials
aws sts get-caller-identity

# Check EC2 instances
aws ec2 describe-instances --filters "Name=tag:Project,Values=my-project"

# Check CloudWatch logs
aws logs tail /aws/ml/my-project/training

# Check Batch jobs
aws batch describe-jobs --jobs <job-id>

Cost Estimation¶

p3.2xlarge spot (typical ML training):

Usage	Monthly Cost
8 hr/day	~$221
24/7	~$664
+ Storage	+$50-100

Total: ~$300-400/month for 8hr/day training

What's Next?¶

Headless Mode

Automated CI/CD deployment

Headless Mode
Commands

All available commands

Command Reference
Configuration

Environment and settings

Config Reference

Resources¶

AWS Labs MCP Servers - Official AWS MCP servers
AWS MCP Documentation - Installation and configuration
AWS CDK Best Practices - Infrastructure patterns