Episodic Memory

Learn from past investigations to improve future ones

Total Episodes

Resolved

Unresolved

Strategies

Investigation Episodes

HTTP 500 Error Ratecritical

6m 52s

payment-servicepostgres

Payment service returning 500s due to exhausted PostgreSQL connection pool. Max connections reached after deploy removed connection recycling.

Mar 18, 2026

HTTP 500 Error Ratecritical

9m 47s

payment-servicepostgresorder-service

Cascading 500s from payment-service after postgres primary failover. Connection strings pointed to old primary.

Mar 10, 2026

Latency Spikehigh

4m 58s

auth-gatewayredissession-service

Auth gateway p99 latency spiked to 3.1s due to Redis cluster rebalancing after node failure. Session lookups timing out.

Mar 15, 2026

Latency Spikehigh

7m 25s

auth-gatewayredis

Auth gateway latency increased 20x after Redis memory limit reached and eviction policy started dropping session keys.

Mar 8, 2026

Queue Backlogcritical

8m 43s

order-processingkafkainventory-service

Order processing consumer lag hit 60k messages after Kafka partition reassignment. Consumer group rebalance took 8 minutes.

Mar 20, 2026

Queue Backlogcritical

6m 11s

order-processingkafka

Order queue depth growing unbounded — consumer deserialization errors after schema registry update. All messages failing validation.

Mar 5, 2026

Auto-Generated Strategies

Database Connection Pool Exhaustion

HTTP 500 Error Ratefrom 2 episodes

Investigation strategy for database connection pool issues causing service failures. Derived from recurring PostgreSQL connection exhaustion incidents.

1Check database connection pool metrics (active, idle, waiting counts) via Prometheus or pg_stat_activity

2Verify pgBouncer/connection pooler configuration — look for recent config changes in deployment history

3Inspect application database connection settings (max pool size, connection timeout, idle timeout)

4Check for connection leaks by correlating open connections with application instance count

5Validate service discovery and DNS resolution for database endpoints — ensure connections target correct primary

Cache Infrastructure Failures

Latency Spikefrom 2 episodes

Investigation strategy for cache layer (Redis/Memcached) issues causing latency spikes and downstream timeouts.

1Check Redis cluster health — node status, slot coverage, and recent failover events via redis-cli cluster info

2Monitor Redis memory usage and eviction metrics — verify maxmemory policy and current utilization

3Inspect client-side connection pool metrics and timeout configurations for affected services

4Review recent infrastructure changes — node scaling, maintenance windows, configuration updates

5Validate cache hit/miss ratios and identify if cache stampede or thundering herd is occurring