Failure Handling

The Starting Point¶

Everything fails. Databases go down. Networks drop packets. Third-party APIs time out. Servers run out of memory. Disks fill up. The question is never "will something fail" — it is "when something fails, what does the system do?"

A system with no failure handling strategy crashes completely when any one component has a problem. A well-designed system degrades gracefully — some things stop working, but the core functionality continues, and when the failed component recovers, the system recovers with it automatically.

Types of Failure¶

Before designing a response, you need to understand what you are responding to. Failures in a system like TrafficGrid fall into a few categories.

Transient failures are temporary. A database query times out because the server was briefly under load. A network packet is dropped. The SMS provider returns a 503 for 30 seconds. If you retry the same operation a moment later, it succeeds. Most failures in production are transient.

Permanent failures are not resolved by retrying. A record does not exist. A payment was already processed. A fine category is inactive. Retrying will produce the same error. These require different handling — return the correct error response and do not retry.

Partial failures are the hardest. The database write succeeds but the queue publish fails. The payment provider confirms success but your webhook handler crashes before updating the fine status. Part of the operation completed and part did not. Without a strategy for partial failures, your data ends up in an inconsistent state.

Cascade failures happen when one failing component causes failures to spread. Your application is waiting for a slow EcoCash response. While it waits, the thread is blocked. New requests arrive and their threads also block waiting. Within seconds every thread in the pool is blocked and the entire application stops responding — not because of anything wrong with your own code, but because of a slow external dependency.

Retries and Idempotency¶

The natural response to a transient failure is to try again. But retries only work safely if the operation is idempotent — meaning running it twice produces the same result as running it once.

A GET request is inherently idempotent. Fetching a fine twice gives you the same fine twice — no harm done. A POST request is not inherently idempotent. If you issue a fine and the response is lost in transit, retrying the POST issues two fines against the same vehicle. That is a real problem.

This is exactly why the payments table has an idempotency_key column. When a client initiates a payment, it generates a unique key for that transaction and includes it in the request. If the network drops and the client retries, the server checks the key — if it already exists, it returns the existing payment record instead of processing a duplicate charge. The operation is safe to retry.

The same principle applies to your queue consumers. When a FINE_CREATED event is picked up and processed, the consumer sends an SMS and writes a notification record. If the consumer crashes after sending the SMS but before writing the record, the message broker will redeliver the event and the consumer will try again. You need to handle the case where the SMS has already been sent — check whether a notification record already exists before sending.

Retry strategy matters. Retrying immediately in a tight loop makes things worse — if the dependency is overloaded, hammering it with retries adds more load. The correct approach is exponential backoff with jitter: wait 1 second before the first retry, 2 seconds before the second, 4 before the third, and so on, with a small random component added to prevent all clients retrying simultaneously.

Timeouts¶

Every call to an external system must have a timeout. Without one, a thread can wait indefinitely for a response that will never come.

For TrafficGrid, every outbound call needs a configured timeout:

Call	Recommended Timeout
PostgreSQL queries	5 seconds — if a query takes longer something is wrong
Redis operations	500ms — Redis is in-memory and should be near-instant
EcoCash API	30 seconds — payment providers can be slow
SMS provider	10 seconds
VTS / ZINARA API (future)	15 seconds

In Spring Boot, HikariCP has a connectionTimeout for how long to wait for a connection from the pool. Spring Data Redis has commandTimeout. RestTemplate and WebClient have connectTimeout and readTimeout. None of these have safe defaults — they must be explicitly configured or you will have threads blocking indefinitely in production.

Circuit Breakers¶

A timeout prevents one thread from waiting forever. But if an external dependency is down, every request that depends on it will wait the full timeout duration before failing. With a 30-second EcoCash timeout and 100 requests per minute, you will quickly exhaust your thread pool.

A circuit breaker solves this. It sits in front of calls to external dependencies and tracks the failure rate. When failures exceed a threshold — say 50% of calls in the last 10 seconds — the circuit opens. While open, calls to that dependency fail immediately without even attempting the network call. After a configured time, the circuit enters a half-open state and allows one request through as a probe. If it succeeds, the circuit closes and normal operation resumes. If it fails, the circuit opens again.

Normal operation (circuit CLOSED):
Request → Circuit Breaker → EcoCash API → Response

EcoCash goes down, failures accumulate:
Request → Circuit Breaker → EcoCash API → Timeout (×5)

Circuit OPENS:
Request → Circuit Breaker → Immediate failure (no network call)

After recovery window, circuit HALF-OPEN:
One probe request → EcoCash API → Success → Circuit CLOSES

For Spring Boot, Resilience4j is the standard library. You annotate a method with @CircuitBreaker and configure the thresholds. Combined with a fallback method, you can return a degraded response instead of an error — for example, if the vehicle verification API is down, return the vehicle data without the isVerified flag rather than failing the entire request.

Graceful Degradation¶

Graceful degradation means that when a non-critical component fails, the system continues operating in a reduced capacity rather than failing entirely.

In TrafficGrid, not all functionality is equally critical. Here is how to think about it:

Core operations that must work — issuing fines, recording payments, citizen login, plate search. These cannot degrade. If the database is down, these fail and that is unavoidable.

Operations that can degrade — vehicle verification against VTS/ZINARA, SMS notifications, document expiry reminders. If the VTS API is down, an officer can still issue a fine — the vehicle just will not be marked as verified. If the SMS provider is down, the fine is still recorded and the notification is queued for delivery when the provider recovers.

The pattern for degradation is: attempt the non-critical operation, catch the failure, log it, continue without it. Do not let a failed SMS delivery roll back a successfully issued fine.

Partial Failure and Data Consistency¶

This is the hardest category. Consider the payment flow:

Client submits payment
Your server calls EcoCash
EcoCash processes the payment and returns success
Your server updates payment status to SUCCESS
Your server updates fine status to PAID
Your server publishes a PAYMENT_SUCCESS event to the queue
Queue consumer sends confirmation SMS

What happens if your server crashes at step 4 — after EcoCash has taken the money but before your database is updated? The citizen has been charged. Your system shows the fine as still PENDING. This is a real-world support nightmare.

The answer is to make the EcoCash callback the source of truth rather than the synchronous response. The flow becomes:

Client submits payment
Your server creates a PENDING payment record
Your server calls EcoCash with the payment reference
EcoCash processes the payment
EcoCash calls your webhook with the result
Your webhook handler updates payment and fine status atomically in a database transaction
Webhook handler publishes PAYMENT_SUCCESS event

Now even if your server crashes mid-process, when it restarts it will receive the webhook and process it correctly. If the webhook is not received, EcoCash will retry it. The idempotency key on the payment ensures double-processing is safe.

For database operations specifically, transactions are your primary tool for consistency. Operations that must succeed or fail together — updating a payment to SUCCESS and updating the fine to PAID — must be wrapped in a single transaction. If either fails, both are rolled back. In Spring, @Transactional on the service method handles this. The important thing is knowing which operations must be transactional and ensuring they are.

Dead Letter Queues¶

When a message on a queue fails processing repeatedly — after all retries are exhausted — it should not be discarded silently. It should be moved to a dead letter queue (DLQ).

A dead letter queue is a separate queue that holds messages that could not be processed. An alert fires when messages arrive in the DLQ. A developer can then inspect the message, understand why it failed, fix the underlying issue, and replay the message.

Without a dead letter queue, a failed notification or a failed fine status update disappears silently. You have no way of knowing the failure happened and no way to recover from it.

Health Checks¶

Railway and any other deployment platform needs to know whether your application is running correctly. Spring Boot Actuator exposes a /actuator/health endpoint automatically. It checks whether the application can reach the database and Redis and returns an aggregate status.

You should expose at least two health check endpoints:

/actuator/health/liveness — is the application process alive and able to handle requests? This is checked by the platform to decide whether to restart the instance.

/actuator/health/readiness — is the application ready to receive traffic? This checks database connectivity, Redis connectivity, and any other dependencies. If readiness fails, the load balancer stops sending traffic to this instance without killing it — giving it time to recover.

The distinction matters. An application that is alive but cannot reach the database should not receive traffic — but it should not be restarted either, because restarting will not fix a database issue. Readiness handles this correctly.

Failure Handling in TrafficGrid — Summary¶

Scenario	Handling Strategy
Database query timeout	HikariCP connection timeout + Spring `@Transactional` rollback
Redis unavailable	Fail open — skip cache, go directly to database. Log the failure.
EcoCash API slow	30s timeout + circuit breaker. Return `PENDING` payment, rely on webhook for confirmation.
EcoCash API down	Circuit breaker opens. Return error to client. Payment not attempted.
SMS provider down	Notification stays `PENDING` in queue. Retried with exponential backoff when provider recovers.
Webhook received twice	Idempotency key check — second webhook is a no-op if payment already `SUCCESS`.
Consumer crashes mid-processing	Message redelivered by broker. Idempotency checks prevent double-processing.
Message fails all retries	Moved to dead letter queue. Alert fires. Manual investigation.
Application instance crashes	Load balancer detects failed health check, stops routing traffic to that instance. Other instances absorb the load.