Interactive Demo
GitHub Get Started
OpenSRE
demo.opensre.in

Episodic Memory

Learn from past investigations to improve future ones

6
Total Episodes
6
Resolved
0
Unresolved
2
Strategies

Investigation Episodes

HTTP 500 Error Ratecritical
6m 52s
payment-servicepostgres

Payment service returning 500s due to exhausted PostgreSQL connection pool. Max connections reached after deploy removed connection recycling.

Mar 18, 2026
HTTP 500 Error Ratecritical
9m 47s
payment-servicepostgresorder-service

Cascading 500s from payment-service after postgres primary failover. Connection strings pointed to old primary.

Mar 10, 2026
Latency Spikehigh
4m 58s
auth-gatewayredissession-service

Auth gateway p99 latency spiked to 3.1s due to Redis cluster rebalancing after node failure. Session lookups timing out.

Mar 15, 2026
Latency Spikehigh
7m 25s
auth-gatewayredis

Auth gateway latency increased 20x after Redis memory limit reached and eviction policy started dropping session keys.

Mar 8, 2026
Queue Backlogcritical
8m 43s
order-processingkafkainventory-service

Order processing consumer lag hit 60k messages after Kafka partition reassignment. Consumer group rebalance took 8 minutes.

Mar 20, 2026
Queue Backlogcritical
6m 11s
order-processingkafka

Order queue depth growing unbounded — consumer deserialization errors after schema registry update. All messages failing validation.

Mar 5, 2026

Auto-Generated Strategies

Database Connection Pool Exhaustion

HTTP 500 Error Ratefrom 2 episodes

Investigation strategy for database connection pool issues causing service failures. Derived from recurring PostgreSQL connection exhaustion incidents.

1Check database connection pool metrics (active, idle, waiting counts) via Prometheus or pg_stat_activity
2Verify pgBouncer/connection pooler configuration — look for recent config changes in deployment history
3Inspect application database connection settings (max pool size, connection timeout, idle timeout)
4Check for connection leaks by correlating open connections with application instance count
5Validate service discovery and DNS resolution for database endpoints — ensure connections target correct primary

Cache Infrastructure Failures

Latency Spikefrom 2 episodes

Investigation strategy for cache layer (Redis/Memcached) issues causing latency spikes and downstream timeouts.

1Check Redis cluster health — node status, slot coverage, and recent failover events via redis-cli cluster info
2Monitor Redis memory usage and eviction metrics — verify maxmemory policy and current utilization
3Inspect client-side connection pool metrics and timeout configurations for affected services
4Review recent infrastructure changes — node scaling, maintenance windows, configuration updates
5Validate cache hit/miss ratios and identify if cache stampede or thundering herd is occurring