Mitigate AI Platform
App maintanance

Weekly Checks

Weekly maintenance checklist for application performance and reliability

This guide outlines critical maintenance tasks that should be performed weekly to ensure optimal application performance and reliability.

Overview

Frequency: Weekly

Priority: High - These checks catch most issues before they become critical

Pre-Check Requirements

  • Admin access to the application
  • Access to Kubernetes cluster (kubectl configured)
  • Access to LLM provider dashboard (OpenAI/Gemini/Anthropic)
  • Access to Sentry dashboard
  • (Optional) Access to Langfuse dashboard

1. Error Monitoring (Sentry)

Priority: Critical

Steps

  1. Access Sentry Dashboard

    • Get Sentry URL from SENTRY_DSN environment variable
    • Or directly access your Sentry project dashboard
  2. Review Error Trends

    • Check for new error types in the last 7 days
    • Review error frequency trends (increasing/decreasing)
    • Identify errors affecting multiple users
  3. Priority Assessment

    • Critical: Errors preventing core functionality (chat, login, document access)
    • High: Errors affecting >10 users
    • Medium: Sporadic errors with workarounds
    • Low: Single-occurrence errors
  4. Action Items

    • Note critical/high priority errors for investigation
    • Apply immediate fixes if possible

What to Look For

  • 500 Internal Server Errors
  • Database connection errors
  • LLM API failures
  • Authentication failures
  • Background job errors

Expected State

  • Error rate stable or decreasing
  • No critical errors affecting core features
  • Known issues properly tracked

2. System Resources Monitoring

Priority: Critical

Kubernetes Native Monitoring

Check Disk Space

# Check disk usage from within application pod
kubectl exec <app-pod-name> -n <namespace> -- df -h

Thresholds:

  • ⚠️ Warning: >70% usage
  • 🚨 Critical: >80% usage
  • 🔥 Emergency: >90% usage

Actions if high:

  • Consider volume expansion
  • Consider app documents cleanup

Check Memory (RAM)

# Check pod memory usage
kubectl top pods -A

# Detailed pod resource usage
kubectl describe pod -n <namespace> <pod-name>

Thresholds:

  • ⚠️ Warning: >70% of requested memory
  • 🚨 Critical: >85% of requested memory
  • Check for memory leaks if usage steadily increases

Actions if high:

  • Check for memory leaks in logs
  • Review large data processing jobs
  • Consider scaling horizontally (more pods)
  • Adjust resource limits if needed

Check CPU Usage

# Check pod CPU usage
kubectl top pods -A

# Check node CPU usage
kubectl top nodes

Thresholds:

  • ⚠️ Warning: Sustained >70% usage
  • 🚨 Critical: Sustained >85% usage
  • Brief spikes to 100% are normal during processing

Actions if high:

  • Identify resource-intensive processes
  • Review background job queues
  • Consider horizontal pod autoscaling

3. Document Health

Priority: High

Admin Dashboard Review

  1. Access Main Dashboard

    • Navigate to: [APP_HOST]/admin/dashboards
    • Login with admin credentials
  2. Review Core Metrics

    Check the following metrics and trends:

    Documents Status:

    • Total Documents count
    • Total Chunks count
    • Unvectorized Documents percentage
    • Recent Documents (last 7 days)

    System Health Indicators:

    • Unvectorized Documents rate
    • Crawler Error Rate (last 7 days)
    • Empty Sources status
    • Uncrawled sources count

    User Engagement:

    • Daily Active Users (last 24h)
    • Monthly Active Users (last 30 days)
    • Average Messages per Chat
  3. Document Vectorization Check

    Navigate to: [APP_HOST]/admin/documents

    • Review documents list
    • Check "chunks / vectorized" status for each document
    • Identify documents with incomplete vectorization

    Expected State:

    • <10% of documents unvectorized
    • Vectorization completing within 24 hours of upload

    Actions if issues found:

    • Identify stuck vectorization jobs
    • Check background job queue health
    • Review LLM API errors
    • Consider manual rechunking via admin panel
  4. Crawler Health Check

    • Review crawler error rates
    • Check recent crawler runs status
    • Identify failing document sources

    Navigate to: [APP_HOST]/admin/document_sources

    • Review each source's last crawl status
    • Check for sources with repeated failures

Alert Thresholds

MetricWarningCritical
Unvectorized Docs>10%>25%
Crawler Error Rate>5%>15%
Empty Sources>0>3
Failed Chunks>50>200

4. LLM Provider Monitoring

Priority: Critical

Provider Dashboard Review

Steps to Review API Usage

  1. Access Billing Dashboard

    • Navigate to your LLM provider's billing/account section
  2. Review Key Metrics

    • API usage (last 7 days)
    • Cost trends
    • Rate limit status
    • Credit balance remaining
    • Any deprecation notices
  3. Alert Thresholds

    • ⚠️ Warning: Low credit balance (<1 week runway based on usage)
    • 🚨 Critical: Very low credit balance (<3 days runway)
    • 🚨 Critical: Rate limit errors detected
  4. Actions if Issues Found

    • Add credits if balance low
    • Review unusual cost spikes
    • Check for inefficient prompts
    • Upgrade API tier if rate limited

Example with OpenAI:

  • Navigate to: https://platform.openai.com/usage
  • Review API usage, cost trends, and credit balance
  • Check for rate limit errors or deprecation notices
  • Set up billing alerts if needed

Secondary: Langfuse Monitoring

Access Langfuse dashboard (default: https://cloud.langfuse.com)

Review application-level metrics:

  • Generation counts by model
  • Average latency per model
  • Token usage trends
  • Error rates by operation
  • Cost per generation

Actions if Issues Found

High Costs:

  • Review unusual spike patterns
  • Check for inefficient prompts

Rate Limits:

  • Identify peak usage times
  • Upgrade API tier if needed

5. Background Jobs Monitoring

Priority: Critical

Access Mission Control

  1. Navigate to: [APP_HOST]/jobs

    • Note: Requires super_admin role
    • Login with super admin credentials if needed
  2. Review Job Queue Health

    • Check queue depths (should be low)
    • Review failed jobs count
    • Check average processing times
  3. Failed Jobs Analysis

    For each failed job:

    • Review error message
    • Check failure time
    • Identify job type (document processing, crawler, etc.)
    • Determine if retry is safe
  4. Key Job Types to Monitor

    Job TypeExpected FrequencyMax Duration
    Document ProcessingPer upload5-60 minutes
    Crawler RunsScheduled10-120 minutes
    Embedding GenerationPer document2-10 minutes

Alert Thresholds

  • ⚠️ Warning: >10 failed jobs
  • 🚨 Critical: >50 failed jobs or jobs stuck >2 hours
  • 🚨 Critical: Queue depth >1000 jobs

Actions if Issues Found

High Failed Job Count:

  • Review error patterns
  • Check LLM API connectivity
  • Verify database connectivity
  • Check for code deployment issues

Stuck Jobs:

  • Identify the operation being performed
  • Check associated logs
  • Consider manual job termination
  • May require application restart

High Queue Depth:

  • Check worker availability
  • Review job priorities
  • Consider scaling workers
  • Identify slow operations

6. Health Endpoint Check

Priority: Medium

Manual Check

# Simple check
curl [APP_HOST]/up

# Expected response: 200 OK

What it Checks

The /up endpoint verifies:

  • Application is running
  • Database connectivity
  • Basic Rails stack health

Expected Response

200 OK

Actions if Fails

  • 🚨 Critical: Immediate investigation required
  • Check application logs: kubectl logs -n <namespace> <pod-name>
  • Verify database connectivity
  • Check pod status: kubectl get pods -A
  • Review recent deployments

Automation Recommendations

While this checklist is designed for manual review, consider automating alerts for:

  • Disk space >80%
  • Failed jobs >10
  • Health endpoint failures
  • Sentry critical error spikes
  • LLM credit balance low

On this page