DevOps AI Assistant with RAG

A production-ready AI assistant powered by local LLMs using Ollama with Retrieval-Augmented Generation for DevOps documentation. Query Kubernetes, Terraform, Docker, Ansible, and AWS using natural language.

Category
AI/ML + DevOps
Stack
Ollama, FastAPI, React
GPU
NVIDIA RTX 3090 24GB
Models
Llama 3.1, CodeLlama
Ollama Llama 3.1 Qdrant FastAPI React Redis Docker Python TypeScript

Project Overview

This AI assistant combines the power of local Large Language Models with Retrieval-Augmented Generation to provide accurate, context-aware answers to DevOps questions. Unlike cloud-based AI services, this runs entirely on local hardware, ensuring data privacy and zero API costs.

Key Innovation:

Traditional LLMs can hallucinate or provide outdated information. By implementing RAG, the system retrieves relevant documentation from a vector database before generating responses, ensuring answers are grounded in actual DevOps documentation.

Why This Project Matters:

  • Demonstrates understanding of modern AI/ML infrastructure
  • Shows ability to integrate cutting-edge technology with DevOps workflows
  • Privacy-focused solution for sensitive infrastructure queries
  • Cost-effective alternative to cloud AI services
  • Positions you at the intersection of AI and DevOps

System Architecture

┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │ Web UI │────▶│ FastAPI │────▶│ Ollama │ │ (React) │ │ Backend │ │ (LLM) │ └─────────────┘ └──────────────┘ └─────────────┘ │ ├──────────────▶┌─────────────┐ │ │ Qdrant │ │ │ (Vectors) │ │ └─────────────┘ │ └──────────────▶┌─────────────┐ │ Redis │ │ (Memory) │ └─────────────┘

Component Breakdown:

  • Ollama: Runs LLMs locally (Llama 3.1, CodeLlama, Mistral) with GPU acceleration
  • Qdrant: Vector database for semantic search across documentation
  • FastAPI: High-performance backend with REST API
  • React Frontend: Clean, responsive chat interface
  • Redis: Conversation memory and caching
  • Document Ingestion: Automated pipeline to index DevOps docs

Core Features

Local LLM Inference

Run powerful AI models completely offline using Ollama with GPU acceleration. No API keys, no cloud dependencies, complete data privacy.

RAG Pipeline

Retrieval-Augmented Generation ensures responses are grounded in actual documentation, reducing hallucinations and improving accuracy.

DevOps Documentation

Pre-configured to ingest Kubernetes, Terraform, Docker, Ansible, AWS, Azure, and GitLab CI/CD documentation.

REST API

FastAPI backend enables integration with other tools, automation scripts, and CI/CD pipelines.

Conversation Memory

Redis-backed chat history maintains context across multiple queries for more natural conversations.

GPU Optimized

Leverages NVIDIA GPU for fast inference, tested on RTX 3090 with support for various model sizes.

Hardware & Setup

Development Hardware

  • CPU: AMD Ryzen 9 9950X (16 cores)
  • GPU: NVIDIA RTX 3090 24GB VRAM
  • RAM: 128GB DDR5
  • Storage: 1TB NVMe SSD
  • OS: Ubuntu 24.04 LTS

Quick Start:

# Clone repository git clone https://github.com/jconover/ai-rag-stack.git cd ai-rag-stack # Verify system requirements bash scripts/verify_setup.sh # Initial setup make setup # Start all services make start # Pull LLM model docker exec ollama ollama pull llama3.1:8b # Ingest documentation python scripts/ingest_docs.py # Access UI open http://localhost:3000

Supported Models:

  • llama3.1:8b - Best general purpose (8GB VRAM)
  • codellama:13b - Better for code generation (13GB VRAM)
  • mistral:7b - Fast and efficient (7GB VRAM)
  • deepseek-coder:33b - Excellent for code (20GB+ VRAM)

Documentation Ingestion

The system automatically indexes comprehensive DevOps documentation:

Pre-configured Documentation Sources:

  • Kubernetes: Official K8s docs (concepts, reference, tutorials)
  • Terraform: HashiCorp Terraform documentation
  • Docker: Engine, Compose, and Swarm documentation
  • Ansible: Ansible docs and best practices
  • AWS: EC2, S3, Lambda, ECS, and more
  • Azure: Azure DevOps, AKS, Container Instances
  • GitLab CI/CD: Pipeline documentation
  • Prometheus/Grafana: Monitoring and observability
  • Custom Docs: Add your own markdown/text files

Vector Search Pipeline:

  • Documents chunked into 1000-token segments with 200-token overlap
  • Embedded using sentence transformers
  • Stored in Qdrant vector database for semantic search
  • Top 5 most relevant chunks retrieved for each query
  • Context provided to LLM for accurate response generation

API Integration

The FastAPI backend provides REST endpoints for integration with other tools:

Available Endpoints:

  • POST /api/chat - Send a message and get AI response
  • GET /api/models - List available Ollama models
  • POST /api/ingest - Ingest new documents
  • GET /api/health - Health check
  • GET /api/stats - Vector database statistics

Example Usage:

# Query via API curl -X POST http://localhost:8000/api/chat \ -H "Content-Type: application/json" \ -d '{ "message": "How do I create a Kubernetes deployment with 3 replicas?", "model": "llama3.1:8b" }' # Check available models curl http://localhost:8000/api/models # Get vector DB stats curl http://localhost:8000/api/stats

Technical Implementation

Project Structure:

ai-rag-stack/ ├── backend/ # FastAPI application │ ├── app/ │ │ ├── main.py # API entry point │ │ ├── rag.py # RAG pipeline │ │ ├── vectorstore.py # Qdrant client │ │ └── models.py # Pydantic models ├── frontend/ # React web UI │ ├── src/ │ │ ├── App.tsx │ │ └── components/ ├── scripts/ │ ├── ingest_docs.py # Documentation ingestion │ └── download_docs.sh # Download docs ├── data/ │ ├── docs/ # Downloaded documentation │ └── custom/ # Custom docs └── docker-compose.yml

GPU Optimization:

  • Automatic GPU detection and utilization
  • Configurable GPU memory allocation
  • Support for multiple concurrent model loads
  • Optimized batch processing for document ingestion
  • Monitor GPU usage with nvidia-smi

Performance Tuning:

  • Adjustable chunk size and overlap for context optimization
  • Configurable Top K results for retrieval
  • Model-specific thread and GPU settings
  • Redis caching for frequently asked questions
  • Async processing for concurrent requests

Skills Demonstrated

5
Services
Local
Privacy-First
GPU
Accelerated
RAG
Pipeline

Advanced DevOps + AI Skills:

  • AI/ML Infrastructure: Running and optimizing local LLMs with GPU acceleration
  • RAG Implementation: Building retrieval-augmented generation pipelines
  • Vector Databases: Working with Qdrant for semantic search
  • API Development: FastAPI backend with async processing
  • Frontend Development: React TypeScript for user interfaces
  • Document Processing: Automated ingestion and embedding pipelines
  • Docker Orchestration: Multi-container application with docker-compose
  • GPU Computing: CUDA optimization and resource management
  • System Integration: Combining multiple technologies into cohesive solution

Real-World Applications

Use Cases for DevOps Teams:

  • Knowledge Base: Instant answers to Kubernetes, Terraform, and AWS questions
  • Onboarding: Help new team members learn your infrastructure
  • Troubleshooting: Quick reference for error messages and solutions
  • Code Generation: Generate Terraform modules, Kubernetes manifests, and scripts
  • Documentation: Query internal documentation alongside public docs
  • Compliance: Privacy-focused alternative to cloud AI for sensitive environments

Example Queries:

  • "How do I create a Kubernetes deployment with 3 replicas?"
  • "Show me Terraform code for an AWS VPC with public and private subnets"
  • "What's the difference between Docker Compose and Kubernetes?"
  • "How do I configure Prometheus to scrape custom metrics?"
  • "Explain Ansible roles and how to structure a playbook"

Future Enhancements

  • Add support for fine-tuning on custom infrastructure documentation
  • Implement multi-modal support for diagrams and screenshots
  • Add integration with GitHub for code review assistance
  • Implement CI/CD pipeline documentation generation
  • Add Slack/Teams integration for team-wide access
  • Support for private documentation repositories
  • Implement feedback loop for continuous improvement
  • Add cost analysis and optimization suggestions