Monitoring & Observability
Monitoring Architecture
Key Metrics
Business Metrics
System Metrics
metrics:
system:
- name: cpu_usage
type: gauge
threshold:
warning: 70
critical: 85
- name: memory_usage
type: gauge
threshold:
warning: 75
critical: 90
- name: disk_usage
type: gauge
threshold:
warning: 80
critical: 90
application:
- name: request_latency
type: histogram
threshold:
warning: 500ms
critical: 1s
- name: error_rate
type: counter
threshold:
warning: 1%
critical: 5%
Dashboard Configuration
Main Dashboard
Grafana Dashboard JSON
{
"dashboard": {
"id": null,
"title": "Oan Finance Overview",
"tags": ["oan", "production"],
"timezone": "browser",
"panels": [
{
"title": "System Health",
"type": "gauge",
"datasource": "Prometheus",
"targets": [
{
"expr": "system_health_score",
"refId": "A"
}
]
},
{
"title": "API Response Time",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "http_request_duration_seconds",
"refId": "B"
}
]
}
]
}
}
Alert Configuration
Alert Rules
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: cpu_usage > 80
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage detected
- alert: HighMemoryUsage
expr: memory_usage > 85
for: 5m
labels:
severity: warning
annotations:
summary: High memory usage detected
- name: application_alerts
rules:
- alert: HighErrorRate
expr: error_rate > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: High error rate detected
Log Management
Log Configuration
logging:
level: info
format: json
retention:
hot: 7d
warm: 30d
cold: 90d
indices:
- name: application
pattern: oan-app-logs-*
lifecycle:
hot_duration: 7d
warm_duration: 30d
delete_after: 90d
- name: audit
pattern: oan-audit-logs-*
lifecycle:
hot_duration: 30d
warm_duration: 60d
delete_after: 365d
Trace Configuration
Jaeger Configuration
tracing:
service_name: oan-finance
sampler:
type: probabilistic
param: 0.1
reporter:
queue_size: 100
buffer_flush_interval: 1s
tags:
environment: production
Health Checks
Endpoint Configuration
health_checks:
endpoints:
- name: api
url: /health
interval: 30s
timeout: 5s
success_threshold: 1
failure_threshold: 3
- name: database
url: /health/db
interval: 1m
timeout: 10s
success_threshold: 1
failure_threshold: 2
Performance Monitoring
Best Practices
Monitoring
- Define clear SLOs/SLIs
- Implement comprehensive monitoring
- Set up proper alerting
- Regular review of metrics
Logging
- Structured logging format
- Proper log levels
- Log rotation
- Log analysis
Tracing
- Proper sampling rates
- Context propagation
- Service naming conventions
- Trace correlation
Alerting
- Alert severity levels
- On-call rotations
- Alert routing
- Incident response