Skip to content

AWS Setup Guide

Deploy ML training jobs and manage AWS infrastructure with Nova AI.

Time: 15-20 minutes


What You'll Learn

  • AWS prerequisites and credential configuration
  • Installing AWS Labs MCP servers for enhanced capabilities
  • Using /novaai-aws for ML training deployment
  • DevBox management for cloud development environments
  • CDK infrastructure generation and deployment
  • Cost optimization with spot instances

Prerequisites

AWS Account & Credentials

  1. AWS Account with appropriate permissions
  2. AWS CLI installed and configured:
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure credentials
aws configure
# Enter: Access Key ID, Secret Access Key, Region (us-west-2), Output format (json)
  1. Verify credentials:
aws sts get-caller-identity

Required IAM Permissions

Create an IAM policy with these permissions for Nova AI AWS operations:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "s3:*",
        "iam:CreateRole",
        "iam:PutRolePolicy",
        "iam:GetRole",
        "iam:PassRole",
        "ssm:GetParameter",
        "ssm:PutParameter",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "cloudwatch:PutMetricData",
        "cloudwatch:PutDashboard",
        "cloudformation:*",
        "batch:*",
        "ecr:*"
      ],
      "Resource": "*"
    }
  ]
}

Least Privilege

For production, scope permissions to specific resources using ARNs.


AWS MCP Servers (Optional Enhancement)

AWS Labs provides official MCP servers that enhance Nova AI's AWS capabilities.

Available Servers

Server Purpose
CCAPI Cloud Control API - manage 1,100+ AWS resources
CDK AWS CDK best practices and code generation
CloudFormation Template generation and validation
Terraform Infrastructure as Code with Terraform

Installation

Add to your MCP configuration (~/.aws/amazon/mcp.json or .amazonq/mcp.json):

{
  "mcpServers": {
    "awslabs.ccapi-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.ccapi-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "default",
        "DEFAULT_TAGS": "enabled",
        "SECURITY_SCANNING": "enabled",
        "FASTMCP_LOG_LEVEL": "ERROR"
      }
    }
  }
}
{
  "mcpServers": {
    "awslabs.cdk-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.cdk-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "default"
      }
    }
  }
}

uvx Required

Install uv first: pip install uv or curl -LsSf https://astral.sh/uv/install.sh | sh

CCAPI Permissions

For Cloud Control API server, ensure your AWS credentials have:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudcontrol:ListResources",
        "cloudcontrol:GetResource",
        "cloudcontrol:CreateResource",
        "cloudcontrol:DeleteResource",
        "cloudcontrol:UpdateResource",
        "cloudformation:CreateGeneratedTemplate",
        "cloudformation:DescribeGeneratedTemplate",
        "cloudformation:GetGeneratedTemplate"
      ],
      "Resource": "*"
    }
  ]
}

Quick Start: /novaai-aws

Basic Commands

# Generate deployment files (interactive)
/novaai-aws

# Validate without deploying
/novaai-aws --dry-run

# Full deployment
/novaai-aws --deploy

DevBox Commands

# Create a new devbox
/novaai-aws spawn my-devbox medium

# Set up a GitHub repo on devbox
/novaai-aws setup my-devbox https://github.com/user/ml-project

# Run training
/novaai-aws run my-devbox

# Check status
/novaai-aws status my-devbox

# List all devboxes
/novaai-aws list

# Connect via SSH/Cursor
/novaai-aws connect my-devbox

# Stop/Terminate
/novaai-aws stop my-devbox
/novaai-aws terminate my-devbox

DevBox Setup Workflow

DevBox provides managed EC2 instances optimized for ML development.

1. Spawn a DevBox

/novaai-aws spawn jeff-devbox large

Sizes:

Size Instance vCPUs RAM GPU Cost
micro t3.micro 2 1GB - $0.01/hr
small t3.medium 2 4GB - $0.04/hr
medium g4dn.xlarge 4 16GB 1x T4 $0.53/hr
large p3.2xlarge 8 61GB 1x V100 $3.06/hr

2. Set Up Repository

/novaai-aws setup jeff-devbox https://github.com/user/ml-project

What happens automatically:

  1. Clone repository to devbox
  2. Scan for dependencies:
  3. ML framework (PyTorch, TensorFlow, JAX)
  4. Environment files (environment.yml, requirements.txt)
  5. GPU requirements (flash-attention, CUDA)
  6. Detect S3 buckets and verify access
  7. Set up conda/mamba environment
  8. Generate run.sh training script

3. Run Training

/novaai-aws run jeff-devbox

Monitor with:

/novaai-aws status jeff-devbox

ML Training Deployment

For production ML training without DevBox.

Workflow Phases

graph TD
    A[Repository Analysis] --> B[Interactive Setup]
    B --> C{Parallel Generation}
    C --> D[Docker]
    C --> E[Wrapper]
    C --> F[Scripts]
    D --> G[Infrastructure CDK]
    E --> G
    F --> G
    G --> H[Summary & Docs]

Interactive Setup (5-6 Questions)

/novaai-aws

Nova AI will ask:

  1. Framework - PyTorch/TensorFlow/JAX
  2. Lightning? - Using PyTorch Lightning? (Y/n)
  3. Tracking? - W&B/MLflow/None
  4. GPU Type - Instance selection with pricing
  5. Spot? - Use spot instances for 70% savings? (Y/n)
  6. Entrypoint - Training script path

Generated Files

aws-deployment/
├── Dockerfile.env          # CUDA + micromamba
├── docker-compose.yml      # Local GPU testing
├── .env.aws               # SSM parameter references
├── train_wrapper.py       # Training integration
├── scripts/
│   ├── setup-ssm-params.sh
│   ├── deploy.sh
│   ├── monitor.sh
│   ├── logs.sh
│   └── debug.sh
├── infrastructure/aws/    # CDK stack
│   ├── app.py
│   ├── stack.py
│   └── requirements.txt
└── README.md

GPU Options & Pricing

Instance GPU VRAM On-Demand Spot
g4dn.xlarge 1x T4 16GB $0.53/hr $0.16/hr
p3.2xlarge 1x V100 16GB $3.06/hr $0.92/hr
p3.8xlarge 4x V100 64GB $12.24/hr $3.67/hr
p4d.24xlarge 8x A100 320GB $32.77/hr $9.83/hr

Cost Savings

Spot instances provide 70% savings. Nova AI automatically configures checkpointing for spot interruption handling.


Deployment Steps

1. Configure Secrets

cd aws-deployment
vim .env.aws  # Add AWS account ID, W&B API key
./scripts/setup-ssm-params.sh

2. Deploy Infrastructure

cd infrastructure/aws
pip install -r requirements.txt

# First time only
cdk bootstrap

# Deploy stack
cdk deploy

3. Deploy Training

cd ../..
./scripts/deploy.sh

4. Monitor

./scripts/monitor.sh --watch
./scripts/logs.sh

Autonomous Setup (Zero Questions)

Nova AI can deploy with 0-2 questions using smart defaults.

from src.orchestrator.aws import autonomous_setup

result = autonomous_setup(
    analysis=repo_analysis,
    project_root=Path("."),
    region="us-west-2",
    interactive=False,  # No questions
)

print(result.summary())
# ✅ Configuration generated (asked 0 questions)
#    Project: my-ml-project
#    Instance: p3.2xlarge
#    Cost: $3.06/hr

Auto-detected:

  • ML framework from imports
  • GPU requirements from dependencies
  • Entrypoint from common patterns
  • W&B/MLflow from installed packages

May ask (0-2 questions):

  1. W&B API key (if W&B detected but no key)
  2. Budget confirmation (if instance > $5/hr)

Checkpointing

Multi-tier checkpointing for spot instance resilience:

Tier Interval Location Use Case
Fast 5 min Local SSD Quick recovery
Mid 30 min EBS Persistent storage
Durable 1 hour S3 Cross-instance
Emergency On SIGTERM S3 Spot interruption

Experiment Tracking

Weights & Biases

Automatically configured with:

  • Git commit, branch, dirty status
  • S3 paths (data, checkpoints, outputs)
  • Compute info (GPU, CUDA, instance type)
  • Hyperparameters from config
# train_wrapper.py automatically logs:
wandb.init(
    config={
        "git_commit": get_git_commit(),
        "instance_type": get_instance_type(),
        "s3_checkpoint_path": os.environ["S3_CHECKPOINT_PATH"],
        # ... your hyperparameters
    }
)

SSM Parameter Store

Secrets stored encrypted:

/project-name/wandb/api-key           (SecureString)
/project-name/s3/checkpoint-path      (String)
/project-name/training/learning-rate  (String)

Local Testing

Test before deploying to AWS:

cd aws-deployment

# Build environment
docker build -f Dockerfile.env -t my-project-env .

# Test with GPU (requires nvidia-docker)
docker-compose up

Troubleshooting

Common Issues

Issue Solution
"No GPU detected" Install nvidia-docker
"CDK bootstrap failed" Check credentials: aws sts get-caller-identity
"Spot interrupted" Normal! Emergency checkpoint saved. Job retries.
"S3 access denied" Verify bucket permissions in IAM policy

Debug Commands

# Check AWS credentials
aws sts get-caller-identity

# Check EC2 instances
aws ec2 describe-instances --filters "Name=tag:Project,Values=my-project"

# Check CloudWatch logs
aws logs tail /aws/ml/my-project/training

# Check Batch jobs
aws batch describe-jobs --jobs <job-id>

Cost Estimation

p3.2xlarge spot (typical ML training):

Usage Monthly Cost
8 hr/day ~$221
24/7 ~$664
+ Storage +$50-100

Total: ~$300-400/month for 8hr/day training


What's Next?


Resources