May 06, 20266 min read

Anatomy of a Drupal Performance Crisis: When MySQL Is on Fire, More MySQL Isn't the Fix

A forensic walkthrough of stabilizing a high-traffic Drupal 9 platform under sustained bot pressure. Why the obvious fix was the wrong fix, and what actually worked.

A Drupal 9 platform with roughly 2M pages was buckling under sustained traffic. MySQL was pinned at 100%, the editorial team was locked out of admin, and the obvious instinct in the room was to scale the database server. That would have been the wrong move.

The real cause was three layers upstream:

Bot traffic hammering search endpoints in coordinated waves.
The search index queue piling up faster than it could clear.
No Redis layer, so every request was hitting MySQL for sessions and render cache.

Here is what the diagnosis looked like, and what actually moved the needle.

The setup

The platform was a Drupal 9 content site with around 2M pages of historical content built up over many years. Traffic was healthy and predictable on a normal day. The infrastructure was fairly standard: Nginx, PHP-FPM, MySQL, and Drupal core search behind a default Cloudflare proxy.

There was no Redis. There was no edge WAF beyond Cloudflare's defaults. The site had grown organically, and the caching architecture had not grown with it.

That gap is where the story starts.

The symptoms

The first signals were the ones you would expect:

Page response times spiking from sub-second to 8 -15 seconds.
MySQL CPU pinned at 100% across long stretches.
Editorial users reporting they could not log in or save content.
The PHP-FPM worker pool saturating and starting to reject connections.

The instinct in the room was the usual one: scale the database. More cores, more RAM, bigger instance. The reasoning is always tempting in the moment because the alarm is going off on MySQL, so MySQL must be the problem.

It was not.

Tracing upstream

The first useful move was to stop looking at the database and start looking at request patterns at the edge.

A scan of access logs showed a sustained pattern of requests to search endpoints from a relatively narrow range of user agents, hitting at a rate well outside any plausible human pattern. This was not a classic DDoS. It was distributed, persistent, low-intensity-per-IP bot traffic targeting the search functionality specifically.

That immediately reframed the problem. MySQL was not slow because it was undersized. MySQL was slow because every search request was triggering core Drupal search behavior, which queues indexing work, which writes to the database, which adds more rows under load, which makes subsequent reads slower, which makes more workers wait.

Once you see the loop, the fix sequence becomes obvious. You have to break the loop in three places: at the edge, in the cache layer, and at the source.

Fix 1: Block at the edge

The first move was Cloudflare WAF rules. Specifically:

Custom rules to challenge or block requests matching the abusive bot patterns by user agent and request signature.
Rate limiting on the search endpoints, so even legitimate-looking traffic could not flood that path.
Geo and ASN filters where the data justified it.

This alone removed a large fraction of the abusive load before it ever reached the origin. No amount of MySQL tuning beats traffic that never arrives.

A pattern I keep coming back to: the cheapest request to handle is the one your origin never sees.

Fix 2: Get sessions and cache off MySQL

Drupal's default cache and session storage is the database. On a small site this is fine. On a high-traffic site, it is the silent reason your database is on fire.

Installing Redis and pointing the cache backend, session storage, and render cache at it produced the most dramatic single-step improvement of the whole engagement. MySQL went from pinned to comfortable inside an hour. Page response times dropped back into the sub-second range for cache hits.

What got moved over:

The Redis cache backend module configured for cache, render, and dynamic page cache bins.
Sessions moved to Redis with appropriate TTLs.
Lock and flood control bins routed to Redis to remove additional MySQL pressure.

The lesson here is not subtle. If you are running Drupal at any meaningful scale without Redis or Memcached, your database is doing work it should not be doing.

Fix 3: Throttle the search index queue

The third fix was the most Drupal-specific. Core Search and the Search API ecosystem rely on indexing queues that process content in batches. Under sustained search abuse, those queues grow faster than they clear, and the queue runner workers pile up holding database connections and writing to indexing tables.

The fix involved:

Reducing the queue batch size for search indexing.
Capping the number of concurrent queue workers that could process the search queue.
Auditing for runaway queries originating from the indexing path and adding query-level limits where appropriate.
Scheduling heavy indexing work to off-peak windows.

This stopped the bleed at the source. Even if abusive traffic slipped past the edge, the indexing path could no longer destabilize the database.

The result

The site stabilized within hours, not days. Editorial access came back first because session storage was off MySQL. Public response times normalized once the WAF rules took effect and Redis absorbed the read pressure. The search indexing queue caught up overnight.

Total infrastructure cost change: trivial. A small Redis instance and Cloudflare rule configuration. No database scaling. No origin scaling. The fix was almost entirely architectural, not capacity-driven.

Lessons I keep coming back to

1. The bottleneck is almost never where the alarm is going off.

MySQL was on fire because of decisions made three layers up. If you start tuning at the alarm, you spend money and never reach the cause.

2. Vertical scaling is the most expensive answer to most Drupal performance problems.

It buys time, not solutions. Edge filtering, smart caching, and queue control beat raw hardware in nine out of ten Drupal scaling situations I have seen.

3. Redis is not optional past a certain threshold.

If your traffic is non-trivial and your cache backend is still the database, you are paying a tax on every request.

4. Search is the most under-discussed performance surface in Drupal.

Everyone talks about Views and entity loads. Almost no one talks about how core Search and Search API modules can take down a site when abused.

5. Read access logs first, dashboards second.

The dashboard tells you what is on fire. The logs tell you why.

Closing thought

When something is buckling at the database layer, the temptation is always to look down the stack: more disk, more RAM, faster CPUs, better indexes. Sometimes that is the answer. More often, the answer is upstream — where a request pattern, a missing cache layer, or an unbounded queue is doing damage that no amount of database tuning will undo.

If you are running Drupal at scale and any of this sounds familiar, I am always happy to compare notes. You can reach me through the contact link on this site.