AWS Setup Guide¶
Deploy ML training jobs and manage AWS infrastructure with Nova AI.
Time: 15-20 minutes
What You'll Learn¶
- AWS prerequisites and credential configuration
- Installing AWS Labs MCP servers for enhanced capabilities
- Using
/novaai-awsfor ML training deployment - DevBox management for cloud development environments
- CDK infrastructure generation and deployment
- Cost optimization with spot instances
Prerequisites¶
AWS Account & Credentials¶
- AWS Account with appropriate permissions
- AWS CLI installed and configured:
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Configure credentials
aws configure
# Enter: Access Key ID, Secret Access Key, Region (us-west-2), Output format (json)
- Verify credentials:
Required IAM Permissions¶
Create an IAM policy with these permissions for Nova AI AWS operations:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"s3:*",
"iam:CreateRole",
"iam:PutRolePolicy",
"iam:GetRole",
"iam:PassRole",
"ssm:GetParameter",
"ssm:PutParameter",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"cloudwatch:PutMetricData",
"cloudwatch:PutDashboard",
"cloudformation:*",
"batch:*",
"ecr:*"
],
"Resource": "*"
}
]
}
Least Privilege
For production, scope permissions to specific resources using ARNs.
AWS MCP Servers (Optional Enhancement)¶
AWS Labs provides official MCP servers that enhance Nova AI's AWS capabilities.
Available Servers¶
| Server | Purpose |
|---|---|
| CCAPI | Cloud Control API - manage 1,100+ AWS resources |
| CDK | AWS CDK best practices and code generation |
| CloudFormation | Template generation and validation |
| Terraform | Infrastructure as Code with Terraform |
Installation¶
Add to your MCP configuration (~/.aws/amazon/mcp.json or .amazonq/mcp.json):
uvx Required
Install uv first: pip install uv or curl -LsSf https://astral.sh/uv/install.sh | sh
CCAPI Permissions¶
For Cloud Control API server, ensure your AWS credentials have:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudcontrol:ListResources",
"cloudcontrol:GetResource",
"cloudcontrol:CreateResource",
"cloudcontrol:DeleteResource",
"cloudcontrol:UpdateResource",
"cloudformation:CreateGeneratedTemplate",
"cloudformation:DescribeGeneratedTemplate",
"cloudformation:GetGeneratedTemplate"
],
"Resource": "*"
}
]
}
Quick Start: /novaai-aws¶
Basic Commands¶
# Generate deployment files (interactive)
/novaai-aws
# Validate without deploying
/novaai-aws --dry-run
# Full deployment
/novaai-aws --deploy
DevBox Commands¶
# Create a new devbox
/novaai-aws spawn my-devbox medium
# Set up a GitHub repo on devbox
/novaai-aws setup my-devbox https://github.com/user/ml-project
# Run training
/novaai-aws run my-devbox
# Check status
/novaai-aws status my-devbox
# List all devboxes
/novaai-aws list
# Connect via SSH/Cursor
/novaai-aws connect my-devbox
# Stop/Terminate
/novaai-aws stop my-devbox
/novaai-aws terminate my-devbox
DevBox Setup Workflow¶
DevBox provides managed EC2 instances optimized for ML development.
1. Spawn a DevBox¶
Sizes:
| Size | Instance | vCPUs | RAM | GPU | Cost |
|---|---|---|---|---|---|
| micro | t3.micro | 2 | 1GB | - | $0.01/hr |
| small | t3.medium | 2 | 4GB | - | $0.04/hr |
| medium | g4dn.xlarge | 4 | 16GB | 1x T4 | $0.53/hr |
| large | p3.2xlarge | 8 | 61GB | 1x V100 | $3.06/hr |
2. Set Up Repository¶
What happens automatically:
- Clone repository to devbox
- Scan for dependencies:
- ML framework (PyTorch, TensorFlow, JAX)
- Environment files (environment.yml, requirements.txt)
- GPU requirements (flash-attention, CUDA)
- Detect S3 buckets and verify access
- Set up conda/mamba environment
- Generate
run.shtraining script
3. Run Training¶
Monitor with:
ML Training Deployment¶
For production ML training without DevBox.
Workflow Phases¶
graph TD
A[Repository Analysis] --> B[Interactive Setup]
B --> C{Parallel Generation}
C --> D[Docker]
C --> E[Wrapper]
C --> F[Scripts]
D --> G[Infrastructure CDK]
E --> G
F --> G
G --> H[Summary & Docs]
Interactive Setup (5-6 Questions)¶
Nova AI will ask:
- Framework - PyTorch/TensorFlow/JAX
- Lightning? - Using PyTorch Lightning? (Y/n)
- Tracking? - W&B/MLflow/None
- GPU Type - Instance selection with pricing
- Spot? - Use spot instances for 70% savings? (Y/n)
- Entrypoint - Training script path
Generated Files¶
aws-deployment/
├── Dockerfile.env # CUDA + micromamba
├── docker-compose.yml # Local GPU testing
├── .env.aws # SSM parameter references
├── train_wrapper.py # Training integration
├── scripts/
│ ├── setup-ssm-params.sh
│ ├── deploy.sh
│ ├── monitor.sh
│ ├── logs.sh
│ └── debug.sh
├── infrastructure/aws/ # CDK stack
│ ├── app.py
│ ├── stack.py
│ └── requirements.txt
└── README.md
GPU Options & Pricing¶
| Instance | GPU | VRAM | On-Demand | Spot |
|---|---|---|---|---|
| g4dn.xlarge | 1x T4 | 16GB | $0.53/hr | $0.16/hr |
| p3.2xlarge | 1x V100 | 16GB | $3.06/hr | $0.92/hr |
| p3.8xlarge | 4x V100 | 64GB | $12.24/hr | $3.67/hr |
| p4d.24xlarge | 8x A100 | 320GB | $32.77/hr | $9.83/hr |
Cost Savings
Spot instances provide 70% savings. Nova AI automatically configures checkpointing for spot interruption handling.
Deployment Steps¶
1. Configure Secrets¶
2. Deploy Infrastructure¶
cd infrastructure/aws
pip install -r requirements.txt
# First time only
cdk bootstrap
# Deploy stack
cdk deploy
3. Deploy Training¶
4. Monitor¶
Autonomous Setup (Zero Questions)¶
Nova AI can deploy with 0-2 questions using smart defaults.
from src.orchestrator.aws import autonomous_setup
result = autonomous_setup(
analysis=repo_analysis,
project_root=Path("."),
region="us-west-2",
interactive=False, # No questions
)
print(result.summary())
# ✅ Configuration generated (asked 0 questions)
# Project: my-ml-project
# Instance: p3.2xlarge
# Cost: $3.06/hr
Auto-detected:
- ML framework from imports
- GPU requirements from dependencies
- Entrypoint from common patterns
- W&B/MLflow from installed packages
May ask (0-2 questions):
- W&B API key (if W&B detected but no key)
- Budget confirmation (if instance > $5/hr)
Checkpointing¶
Multi-tier checkpointing for spot instance resilience:
| Tier | Interval | Location | Use Case |
|---|---|---|---|
| Fast | 5 min | Local SSD | Quick recovery |
| Mid | 30 min | EBS | Persistent storage |
| Durable | 1 hour | S3 | Cross-instance |
| Emergency | On SIGTERM | S3 | Spot interruption |
Experiment Tracking¶
Weights & Biases¶
Automatically configured with:
- Git commit, branch, dirty status
- S3 paths (data, checkpoints, outputs)
- Compute info (GPU, CUDA, instance type)
- Hyperparameters from config
# train_wrapper.py automatically logs:
wandb.init(
config={
"git_commit": get_git_commit(),
"instance_type": get_instance_type(),
"s3_checkpoint_path": os.environ["S3_CHECKPOINT_PATH"],
# ... your hyperparameters
}
)
SSM Parameter Store¶
Secrets stored encrypted:
/project-name/wandb/api-key (SecureString)
/project-name/s3/checkpoint-path (String)
/project-name/training/learning-rate (String)
Local Testing¶
Test before deploying to AWS:
cd aws-deployment
# Build environment
docker build -f Dockerfile.env -t my-project-env .
# Test with GPU (requires nvidia-docker)
docker-compose up
Troubleshooting¶
Common Issues¶
| Issue | Solution |
|---|---|
| "No GPU detected" | Install nvidia-docker |
| "CDK bootstrap failed" | Check credentials: aws sts get-caller-identity |
| "Spot interrupted" | Normal! Emergency checkpoint saved. Job retries. |
| "S3 access denied" | Verify bucket permissions in IAM policy |
Debug Commands¶
# Check AWS credentials
aws sts get-caller-identity
# Check EC2 instances
aws ec2 describe-instances --filters "Name=tag:Project,Values=my-project"
# Check CloudWatch logs
aws logs tail /aws/ml/my-project/training
# Check Batch jobs
aws batch describe-jobs --jobs <job-id>
Cost Estimation¶
p3.2xlarge spot (typical ML training):
| Usage | Monthly Cost |
|---|---|
| 8 hr/day | ~$221 |
| 24/7 | ~$664 |
| + Storage | +$50-100 |
Total: ~$300-400/month for 8hr/day training
What's Next?¶
-
Headless Mode
Automated CI/CD deployment
-
Commands
All available commands
-
Configuration
Environment and settings
Resources¶
- AWS Labs MCP Servers - Official AWS MCP servers
- AWS MCP Documentation - Installation and configuration
- AWS CDK Best Practices - Infrastructure patterns