DevOps AI Assistant with RAG - DevOps Portfolio

Project Overview

This AI assistant combines the power of local Large Language Models with Retrieval-Augmented Generation to provide accurate, context-aware answers to DevOps questions. Unlike cloud-based AI services, this runs entirely on local hardware, ensuring data privacy and zero API costs.

Key Innovation:

Traditional LLMs can hallucinate or provide outdated information. By implementing RAG, the system retrieves relevant documentation from a vector database before generating responses, ensuring answers are grounded in actual DevOps documentation.

Why This Project Matters:

Demonstrates understanding of modern AI/ML infrastructure
Shows ability to integrate cutting-edge technology with DevOps workflows
Privacy-focused solution for sensitive infrastructure queries
Cost-effective alternative to cloud AI services
Positions you at the intersection of AI and DevOps

System Architecture

┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │ Web UI │────▶│ FastAPI │────▶│ Ollama │ │ (React) │ │ Backend │ │ (LLM) │ └─────────────┘ └──────────────┘ └─────────────┘ │ ├──────────────▶┌─────────────┐ │ │ Qdrant │ │ │ (Vectors) │ │ └─────────────┘ │ └──────────────▶┌─────────────┐ │ Redis │ │ (Memory) │ └─────────────┘

Component Breakdown:

Ollama: Runs LLMs locally (Llama 3.1, CodeLlama, Mistral) with GPU acceleration
Qdrant: Vector database for semantic search across documentation
FastAPI: High-performance backend with REST API
React Frontend: Clean, responsive chat interface
Redis: Conversation memory and caching
Document Ingestion: Automated pipeline to index DevOps docs

Core Features

Local LLM Inference

Run powerful AI models completely offline using Ollama with GPU acceleration. No API keys, no cloud dependencies, complete data privacy.

RAG Pipeline

Retrieval-Augmented Generation ensures responses are grounded in actual documentation, reducing hallucinations and improving accuracy.

DevOps Documentation

Pre-configured to ingest Kubernetes, Terraform, Docker, Ansible, AWS, Azure, and GitLab CI/CD documentation.

REST API

FastAPI backend enables integration with other tools, automation scripts, and CI/CD pipelines.

Conversation Memory

Redis-backed chat history maintains context across multiple queries for more natural conversations.

GPU Optimized

Leverages NVIDIA GPU for fast inference, tested on RTX 3090 with support for various model sizes.

Hardware & Setup

Development Hardware

CPU: AMD Ryzen 9 9950X (16 cores)
GPU: NVIDIA RTX 3090 24GB VRAM
RAM: 128GB DDR5
Storage: 1TB NVMe SSD
OS: Ubuntu 24.04 LTS

Quick Start:

# Clone repository
git clone https://github.com/jconover/ai-rag-stack.git
cd ai-rag-stack

# Verify system requirements
bash scripts/verify_setup.sh

# Initial setup
make setup

# Start all services
make start

# Pull LLM model
docker exec ollama ollama pull llama3.1:8b

# Ingest documentation
python scripts/ingest_docs.py

# Access UI
open http://localhost:3000

Supported Models:

llama3.1:8b - Best general purpose (8GB VRAM)
codellama:13b - Better for code generation (13GB VRAM)
mistral:7b - Fast and efficient (7GB VRAM)
deepseek-coder:33b - Excellent for code (20GB+ VRAM)

Documentation Ingestion

The system automatically indexes comprehensive DevOps documentation:

Pre-configured Documentation Sources:

Kubernetes: Official K8s docs (concepts, reference, tutorials)
Terraform: HashiCorp Terraform documentation
Docker: Engine, Compose, and Swarm documentation
Ansible: Ansible docs and best practices
AWS: EC2, S3, Lambda, ECS, and more
Azure: Azure DevOps, AKS, Container Instances
GitLab CI/CD: Pipeline documentation
Prometheus/Grafana: Monitoring and observability
Custom Docs: Add your own markdown/text files

Vector Search Pipeline:

Documents chunked into 1000-token segments with 200-token overlap
Embedded using sentence transformers
Stored in Qdrant vector database for semantic search
Top 5 most relevant chunks retrieved for each query
Context provided to LLM for accurate response generation

API Integration

The FastAPI backend provides REST endpoints for integration with other tools:

Available Endpoints:

POST /api/chat - Send a message and get AI response
GET /api/models - List available Ollama models
POST /api/ingest - Ingest new documents
GET /api/health - Health check
GET /api/stats - Vector database statistics

Example Usage:

# Query via API
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How do I create a Kubernetes deployment with 3 replicas?",
    "model": "llama3.1:8b"
  }'

# Check available models
curl http://localhost:8000/api/models

# Get vector DB stats
curl http://localhost:8000/api/stats

Technical Implementation

Project Structure:

ai-rag-stack/
├── backend/              # FastAPI application
│   ├── app/
│   │   ├── main.py      # API entry point
│   │   ├── rag.py       # RAG pipeline
│   │   ├── vectorstore.py  # Qdrant client
│   │   └── models.py    # Pydantic models
├── frontend/            # React web UI
│   ├── src/
│   │   ├── App.tsx
│   │   └── components/
├── scripts/
│   ├── ingest_docs.py  # Documentation ingestion
│   └── download_docs.sh  # Download docs
├── data/
│   ├── docs/           # Downloaded documentation
│   └── custom/         # Custom docs
└── docker-compose.yml

GPU Optimization:

Automatic GPU detection and utilization
Configurable GPU memory allocation
Support for multiple concurrent model loads
Optimized batch processing for document ingestion
Monitor GPU usage with nvidia-smi

Performance Tuning:

Adjustable chunk size and overlap for context optimization
Configurable Top K results for retrieval
Model-specific thread and GPU settings
Redis caching for frequently asked questions
Async processing for concurrent requests

Skills Demonstrated

5

Services

Local

Privacy-First

GPU

Accelerated

RAG

Pipeline

Advanced DevOps + AI Skills:

AI/ML Infrastructure: Running and optimizing local LLMs with GPU acceleration
RAG Implementation: Building retrieval-augmented generation pipelines
Vector Databases: Working with Qdrant for semantic search
API Development: FastAPI backend with async processing
Frontend Development: React TypeScript for user interfaces
Document Processing: Automated ingestion and embedding pipelines
Docker Orchestration: Multi-container application with docker-compose
GPU Computing: CUDA optimization and resource management
System Integration: Combining multiple technologies into cohesive solution

Real-World Applications

Use Cases for DevOps Teams:

Knowledge Base: Instant answers to Kubernetes, Terraform, and AWS questions
Onboarding: Help new team members learn your infrastructure
Troubleshooting: Quick reference for error messages and solutions
Code Generation: Generate Terraform modules, Kubernetes manifests, and scripts
Documentation: Query internal documentation alongside public docs
Compliance: Privacy-focused alternative to cloud AI for sensitive environments

Example Queries:

"How do I create a Kubernetes deployment with 3 replicas?"
"Show me Terraform code for an AWS VPC with public and private subnets"
"What's the difference between Docker Compose and Kubernetes?"
"How do I configure Prometheus to scrape custom metrics?"
"Explain Ansible roles and how to structure a playbook"

Future Enhancements

Add support for fine-tuning on custom infrastructure documentation
Implement multi-modal support for diagrams and screenshots
Add integration with GitHub for code review assistance
Implement CI/CD pipeline documentation generation
Add Slack/Teams integration for team-wide access
Support for private documentation repositories
Implement feedback loop for continuous improvement
Add cost analysis and optimization suggestions