Scaling

The Core Problem¶

A system that works for 100 users will not automatically work for 100,000. The reason is not that the code is wrong — the code is the same. The reason is that certain resources have limits: a database can only handle so many concurrent connections, a server can only hold so much in memory, a disk can only read and write so fast. Scalability is the discipline of identifying which resource will hit its limit first and addressing it before it does.

The word for that first resource is the bottleneck. Every scalability conversation starts with finding it.

Vertical vs Horizontal Scaling¶

There are two ways to give a system more capacity.

Vertical scaling means making one machine bigger — more CPU cores, more RAM, faster disks. It is the simplest answer and it works immediately. You do not need to change any code. The problems are that there is a physical ceiling to how big one machine can get, bigger machines cost disproportionately more per unit of capacity, and you still have a single machine which means a single point of failure. If it goes down, everything goes down.

Horizontal scaling means running more machines and distributing work across them. This is how every large system in the world operates. The ceiling is much higher — you can keep adding machines. Individual machines can fail without taking the whole system down. The tradeoff is complexity — your application must be designed to support it.

The critical requirement for horizontal scaling is statelessness. If your application stores anything locally on the machine — user sessions in memory, uploaded files on disk — then requests from the same user must always go to the same machine, which defeats the purpose. TrafficGrid is already designed for this. JWTs carry their own authentication state so any server instance can validate them. Redis holds the token blocklist centrally so any instance can check it. There is nothing stored locally on the application server that a different instance would not have.

The Three Bottlenecks in TrafficGrid¶

Almost every performance problem comes from one of three places. Here is how they map to TrafficGrid specifically.

The database will be the first bottleneck in almost every web application, and TrafficGrid is no exception. Every meaningful request eventually touches PostgreSQL — issuing a fine, fetching a vehicle's history, recording a payment. Databases are stateful and cannot be scaled as freely as application servers. This is where the most effort goes.

The application server is usually not the bottleneck for a CRUD-heavy system like TrafficGrid because most of what it does is receive a request, validate it, hit the database, and return a response. It is not doing heavy computation. Where it can become a problem is if it blocks threads waiting for slow external calls — EcoCash taking 8 seconds to respond, an SMS provider being slow. The answer is async processing, covered below.

External dependencies — EcoCash, an SMS gateway, eventually VTS and ZINARA — are a bottleneck you cannot optimise directly because you do not own them. The answer is to never block a user request waiting for them.

Caching¶

Caching means storing the result of an expensive operation so that the next time the same data is needed, you return the stored result instead of repeating the work. For a database-backed system, that expensive operation is almost always a database query.

The rule for deciding what to cache: read frequently, written rarely, tolerable if briefly stale.

For TrafficGrid specifically:

Fine categories are the clearest cache candidate. Officers load the list every time they open the fine issuing screen. The list changes only when an admin creates or modifies a category — which happens rarely. Cache the full list in Redis with a TTL of several hours. When an admin modifies a category, explicitly invalidate the cache so the next request repopulates it with fresh data.

Vehicle lookups by number plate are performed constantly by officers in the field. Cache the result of a plate search in Redis with a short TTL — 5 minutes is reasonable. An officer searching the same plate twice in quick succession gets a fast response. Stale data is low risk because vehicle details change infrequently.

Authenticated user details — your JWT filter currently loads the full user from the database on every single request to verify the token and attach the user to the security context. For a system with hundreds of concurrent officers each making frequent requests, this is a significant number of database hits for data that barely changes. Cache the user object in Redis keyed by user ID with a TTL equal to your access token lifetime (15 minutes). One database hit per login session instead of one per request.

What you must never cache: fine status, payment status, or any financial data. These must always be read from the database. The cost of serving stale data here — telling a citizen their fine is still unpaid when it has already been paid — is a real-world support problem and a trust issue.

Database Optimisation¶

Since the database is the primary bottleneck, here are the tools available in order of how you should apply them.

Indexes are the first and cheapest tool, and you already have the right ones defined. A query that scans every row in a table of a million fines to find the ones for a specific vehicle becomes a query that jumps directly to those rows with an index on vehicle_id. The cost of an index is slightly slower writes and more disk space — both acceptable.

Query optimisation means ensuring your queries are not doing more work than necessary. The most common problem in JPA applications is the N+1 query problem: loading a list of 20 fines and then issuing 20 separate queries to load the vehicle for each one, totalling 21 queries instead of 1. In Spring Data JPA this is solved with @EntityGraph or JPQL JOIN FETCH to load related entities in a single query. This is worth auditing on every endpoint that returns a list.

Connection pool tuning — HikariCP manages a pool of open database connections so new requests do not need to open a new connection each time. The default pool size is 10. Under heavy load, requests queue up waiting for a connection if the pool is exhausted. Tuning the pool size for your expected concurrency is one of the highest-leverage optimisations available and requires no code changes — only configuration.

Read replicas — PostgreSQL supports replication where one primary instance handles all writes and one or more replicas receive copies of all writes and serve read queries. For TrafficGrid, the read/write ratio is heavily skewed toward reads — citizens browsing fines, officers searching plates, admins viewing reports. Routing read queries to a replica takes significant load off the primary. This is not needed at initial scale but is the standard next step when the primary starts showing strain.

Async Processing¶

Some operations do not need to complete before the API responds to the client. Doing them synchronously — making the client wait — is wasteful and fragile.

When an officer issues a fine, the critical path is: validate the request, save the fine to the database, return 201 to the officer's app. That should take under 200ms. Sending an SMS notification to the vehicle owner is important but the officer does not need to wait for it. If the SMS provider is slow or temporarily down, you do not want that to cause the fine issuing request to fail or time out.

The solution is a message queue. Instead of calling the notification service directly, the fine creation handler drops an event onto a queue and returns immediately. A separate consumer process picks events off the queue and handles them independently.

Officer submits fine
        ↓
Validate + save to database       ← must succeed, fast
        ↓
Publish FINE_CREATED event        ← fast, just writing to queue
        ↓
Return 201 to officer             ← done in under 200ms

[Separately, asynchronously]
Queue consumer reads FINE_CREATED
        ↓
Look up linked citizens for vehicle
        ↓
Send SMS / push notification
        ↓
Write to notifications table

The benefits compound. The officer's response time is decoupled from SMS delivery speed. If the SMS provider goes down, notifications queue up and are delivered automatically when it recovers — nothing is lost. The notification consumer can be scaled independently from the main API. Failed notifications can be retried automatically with exponential backoff.

In TrafficGrid, the operations that belong in a queue are: all notifications (fine issued, payment confirmed, document expiry reminders), payment webhook processing, and document expiry scanning (the scheduled job that checks for expiring documents and enqueues reminders).

Load Balancing¶

When you run multiple instances of the application, a load balancer distributes incoming requests across them. The simplest strategy is round-robin — first request to instance 1, second to instance 2, cycle back. More sophisticated strategies consider server health and response time.

Because TrafficGrid is stateless at the application layer, any instance can handle any request. The load balancer can send consecutive requests from the same user to different instances and everything works correctly.

For Railway specifically, this is handled automatically when you scale your service to multiple instances in the dashboard.

What This Looks Like for TrafficGrid at Scale¶

Today the architecture is a single application instance, PostgreSQL, and Redis. That is correct for the current stage — premature optimisation wastes engineering time.

As the system grows, the evolution path is:

First bottleneck hit — database query times increase as data grows. Response: add the missing indexes, fix any N+1 queries, tune HikariCP pool size.

Second bottleneck hit — database CPU under sustained load. Response: add Redis caching for fine categories, vehicle lookups, and user details. This alone can reduce database load by 60–70% for a read-heavy system.

Third bottleneck hit — application server CPU or connection limits. Response: add a second application instance behind a load balancer. Because the app is already stateless, this requires no code changes.

Fourth bottleneck hit — database primary under write pressure. Response: add a read replica and route read queries to it.

At no point does this require redesigning the system from scratch. The architecture decisions already made — stateless application, Redis for shared state, async-ready design — were made precisely so this path is straightforward.