App maintanance

Weekly Checks

Weekly maintenance checklist for application performance and reliability

This guide outlines critical maintenance tasks that should be performed weekly to ensure optimal application performance and reliability.

Overview

Frequency: Weekly

Priority: High - These checks catch most issues before they become critical

Pre-Check Requirements

Admin access to the application
Access to Kubernetes cluster (kubectl configured)
Access to LLM provider dashboard (OpenAI/Gemini/Anthropic)
Access to Sentry dashboard
(Optional) Access to Langfuse dashboard

1. Error Monitoring (Sentry)

Priority: Critical

Steps

Access Sentry Dashboard
- Get Sentry URL from SENTRY_DSN environment variable
- Or directly access your Sentry project dashboard
Review Error Trends
- Check for new error types in the last 7 days
- Review error frequency trends (increasing/decreasing)
- Identify errors affecting multiple users
Priority Assessment
- Critical: Errors preventing core functionality (chat, login, document access)
- High: Errors affecting >10 users
- Medium: Sporadic errors with workarounds
- Low: Single-occurrence errors
Action Items
- Note critical/high priority errors for investigation
- Apply immediate fixes if possible

What to Look For

500 Internal Server Errors
Database connection errors
LLM API failures
Authentication failures
Background job errors

Expected State

Error rate stable or decreasing
No critical errors affecting core features
Known issues properly tracked

2. System Resources Monitoring

Priority: Critical

Kubernetes Native Monitoring

Check Disk Space

# Check disk usage from within application pod
kubectl exec <app-pod-name> -n <namespace> -- df -h

Thresholds:

⚠️ Warning: >70% usage
🚨 Critical: >80% usage
🔥 Emergency: >90% usage

Actions if high:

Consider volume expansion
Consider app documents cleanup

Check Memory (RAM)

# Check pod memory usage
kubectl top pods -A

# Detailed pod resource usage
kubectl describe pod -n <namespace> <pod-name>

Thresholds:

⚠️ Warning: >70% of requested memory
🚨 Critical: >85% of requested memory
Check for memory leaks if usage steadily increases

Actions if high:

Check for memory leaks in logs
Review large data processing jobs
Consider scaling horizontally (more pods)
Adjust resource limits if needed

Check CPU Usage

# Check pod CPU usage
kubectl top pods -A

# Check node CPU usage
kubectl top nodes

Thresholds:

⚠️ Warning: Sustained >70% usage
🚨 Critical: Sustained >85% usage
Brief spikes to 100% are normal during processing

Actions if high:

Identify resource-intensive processes
Review background job queues
Consider horizontal pod autoscaling

3. Document Health

Priority: High

Admin Dashboard Review

Access Main Dashboard
- Navigate to: [APP_HOST]/admin/dashboards
- Login with admin credentials
Review Core Metrics

Check the following metrics and trends:

Documents Status:
- Total Documents count
- Total Chunks count
- Unvectorized Documents percentage
- Recent Documents (last 7 days)
System Health Indicators:
- Unvectorized Documents rate
- Crawler Error Rate (last 7 days)
- Empty Sources status
- Uncrawled sources count
User Engagement:
- Daily Active Users (last 24h)
- Monthly Active Users (last 30 days)
- Average Messages per Chat
Document Vectorization Check

Navigate to: [APP_HOST]/admin/documents
- Review documents list
- Check "chunks / vectorized" status for each document
- Identify documents with incomplete vectorization
Expected State:
- <10% of documents unvectorized
- Vectorization completing within 24 hours of upload
Actions if issues found:
- Identify stuck vectorization jobs
- Check background job queue health
- Review LLM API errors
- Consider manual rechunking via admin panel
Crawler Health Check
- Review crawler error rates
- Check recent crawler runs status
- Identify failing document sources
Navigate to: [APP_HOST]/admin/document_sources
- Review each source's last crawl status
- Check for sources with repeated failures

Alert Thresholds

Metric	Warning	Critical
Unvectorized Docs	>10%	>25%
Crawler Error Rate	>5%	>15%
Empty Sources	>0	>3
Failed Chunks	>50	>200

4. LLM Provider Monitoring

Priority: Critical

Provider Dashboard Review

Steps to Review API Usage

Access Billing Dashboard
- Navigate to your LLM provider's billing/account section
Review Key Metrics
- API usage (last 7 days)
- Cost trends
- Rate limit status
- Credit balance remaining
- Any deprecation notices
Alert Thresholds
- ⚠️ Warning: Low credit balance (<1 week runway based on usage)
- 🚨 Critical: Very low credit balance (<3 days runway)
- 🚨 Critical: Rate limit errors detected
Actions if Issues Found
- Add credits if balance low
- Review unusual cost spikes
- Check for inefficient prompts
- Upgrade API tier if rate limited

Example with OpenAI:

Navigate to: https://platform.openai.com/usage
Review API usage, cost trends, and credit balance
Check for rate limit errors or deprecation notices
Set up billing alerts if needed

Secondary: Langfuse Monitoring

Access Langfuse dashboard (default: https://cloud.langfuse.com)

Review application-level metrics:

Actions if Issues Found

High Costs:

Review unusual spike patterns
Check for inefficient prompts

Rate Limits:

Identify peak usage times
Upgrade API tier if needed

5. Background Jobs Monitoring

Priority: Critical

Access Mission Control

Navigate to: [APP_HOST]/jobs
- Note: Requires super_admin role
- Login with super admin credentials if needed
Review Job Queue Health
- Check queue depths (should be low)
- Review failed jobs count
- Check average processing times
Failed Jobs Analysis

For each failed job:
- Review error message
- Check failure time
- Identify job type (document processing, crawler, etc.)
- Determine if retry is safe
Key Job Types to Monitor

Job Type Expected Frequency Max Duration
Document Processing Per upload 5-60 minutes
Crawler Runs Scheduled 10-120 minutes
Embedding Generation Per document 2-10 minutes

Job Type	Expected Frequency	Max Duration
Document Processing	Per upload	5-60 minutes
Crawler Runs	Scheduled	10-120 minutes
Embedding Generation	Per document	2-10 minutes

Alert Thresholds

⚠️ Warning: >10 failed jobs
🚨 Critical: >50 failed jobs or jobs stuck >2 hours
🚨 Critical: Queue depth >1000 jobs

Actions if Issues Found

High Failed Job Count:

Review error patterns
Check LLM API connectivity
Verify database connectivity
Check for code deployment issues

Stuck Jobs:

Identify the operation being performed
Check associated logs
Consider manual job termination
May require application restart

High Queue Depth:

Check worker availability
Review job priorities
Consider scaling workers
Identify slow operations

6. Health Endpoint Check

Priority: Medium

Manual Check

# Simple check
curl [APP_HOST]/up

# Expected response: 200 OK

What it Checks

The /up endpoint verifies:

Application is running
Database connectivity
Basic Rails stack health

Expected Response

200 OK

Actions if Fails

🚨 Critical: Immediate investigation required
Check application logs: kubectl logs -n <namespace> <pod-name>
Verify database connectivity
Check pod status: kubectl get pods -A
Review recent deployments

Automation Recommendations

While this checklist is designed for manual review, consider automating alerts for:

Disk space >80%
Failed jobs >10
Health endpoint failures
Sentry critical error spikes
LLM credit balance low

Monthly Checks

Monthly maintenance checklist for long-term application health

Environment Variables

Comprehensive guide to all environment variables

Weekly Checks

On this page