AEM Performance & Troubleshooting: The Complete Guide

June 14, 202611 min read

A practical toolkit for diagnosing AEM problems — thread dumps, heap dumps, memory leaks, GC logs, query performance, Dispatcher cache analysis, log analysis, and the notorious resource resolver leak. With commands, tools, a cheat sheet, best practices, and do's & don'ts.

AEMPerformanceTroubleshootingDevOpsJVMReference

When an AEM instance is slow, hung, or running out of memory, the difference between a five-minute fix and a five-hour outage is knowing which tool to reach for. Performance troubleshooting isn't guesswork — it's a small set of diagnostics (thread dumps, heap dumps, GC logs, request logs) and a method for reading each one. This guide is that toolkit.

We'll cover the JVM-level diagnostics first — thread dumps, heap dumps, memory leaks, and GC logs — then the AEM-specific ones — query performance, Dispatcher cache analysis, log analysis, and the single most common AEM memory leak, the unclosed resource resolver. Each section is practical: how to capture the data, how to read it, and how to act. A cheat sheet, best practices, and do's & don'ts close it out.

It pulls together threads from the JCR & Oak guide (queries), the Dispatcher guide (cache), and the Sling guide (resource resolvers).

The troubleshooting mindset

Before any specific tool, internalize the method: gather evidence, localize the layer, then fix. Slow AEM is almost always one of a handful of causes — a hung thread, a memory leak, an unindexed query, a cold cache, or a leaked resolver — and each leaves a distinct fingerprint in a specific diagnostic. The mistake is changing things before you've identified which. Capture the right artifact first; the fix usually becomes obvious once you have it.

Thread Dumps

A thread dump is a snapshot of every thread in the JVM and exactly what each is executing. It's your go-to for hangs, deadlocks, and high CPU — anything where "the server is stuck" or "requests aren't returning."

Capture one a few different ways:

# By process id (best — full stacks)
jstack <pid> > threaddump.txt

# Or signal the JVM to print to its stdout log
kill -3 <pid>

On AEM you can also grab one from the Web Console (/system/console/status-Threads), and on AEM as a Cloud Service you download thread dumps through Cloud Manager / the Developer Console.

The crucial technique is to take 3–5 dumps a few seconds apart. A single dump is a still frame; the series reveals which threads are stuck (same stack across all dumps) versus merely busy. When you read them, look for: threads in BLOCKED state waiting on the same lock (contention), an explicit deadlock report at the bottom, and many threads parked in the same application stack frame (a slow dependency or a hot path). Tools like fastThread.io or a thread-dump analyzer make patterns jump out.

Heap Dumps

A heap dump is a snapshot of every object on the heap and the references between them. It's the definitive tool for memory problems — finding what's consuming memory and why it isn't being released.

# Capture a heap dump on demand
jmap -dump:format=b,file=heap.hprof <pid>

Far better than capturing by hand is to capture automatically on failure by setting these JVM flags, so an out-of-memory crash leaves you a dump to analyze:

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/dumps

Analyze the .hprof in the Eclipse Memory Analyzer (MAT). Its Leak Suspects report and dominator tree show you which object graph retains the most memory — for AEM that's frequently a collection that keeps growing or a pile of session/resolver objects. Sort by retained heap to find the real culprit, then trace its incoming references to see what's holding it alive.

Important: Capturing a heap dump pauses the JVM for the duration (seconds to minutes on a large heap), so avoid doing it casually on production at peak. On AEMaaCS, request heap dumps through the proper Adobe channel rather than running jmap yourself.

Memory Leaks

A memory leak in a managed runtime means objects that are no longer needed are still referenced, so the garbage collector can't reclaim them — and the heap creeps upward until the instance slows to a crawl or crashes with an OutOfMemoryError. The symptoms are a steadily rising heap (even after full GCs), increasingly frequent and long GC pauses, and eventually OOM.

In AEM, the usual suspects are predictable:

Unclosed ResourceResolvers / JCR Sessions — the number-one cause (its own section below).
Unclosed streams or HTTP connections.
Unbounded caches — a Map used as a cache with no eviction.
Listeners/observers never unregistered (e.g. a component that registers in @Activate but doesn't clean up in @Deactivate).
ThreadLocals not cleared on pooled threads.

The diagnostic path is consistent: confirm the trend in GC logs (heap rising after each full GC), then capture a heap dump and use MAT's dominator tree to identify the growing object graph, fix the reference, and verify the heap flattens.

GC Logs

Garbage collection logs are the cheapest early-warning system you have, and they should be on in every environment. They record every GC event — when it ran, how long it paused, and how much heap it reclaimed.

# Java 11+ unified logging
-Xlog:gc*:file=/var/log/gc.log:time,uptime:filecount=5,filesize=20M

What you're reading the logs for:

Pause times — long stop-the-world pauses hurt response times.
Frequency — GC running constantly means the heap is too small or churning.
Heap after full GC — this is the leak signal. In a healthy JVM, the old generation drops back down after a full GC; in a leaking one, the post-GC floor rises over time until there's no room left.

Feed the log into GCeasy.io or GCViewer for a visual. A healthy graph is a sawtooth that returns to roughly the same baseline; an upward-trending baseline after full GCs is a leak, full stop.

Query Performance

Slow, unindexed queries are one of the most common causes of a sluggish AEM, because a query without an index makes Oak traverse thousands of nodes per request. The fingerprint is a log warning like Traversed 10000 nodes ... consider creating an index.

The diagnostic workflow — recognize the traversal warning, EXPLAIN the query to see its plan, inspect the Oak Query Statistics JMX MBean for the worst offenders, and fix by adding or tightening an index — is covered step by step in the JCR & Oak guide. The one-line takeaway for triage: if a page is slow and the logs show traversal, you have an indexing problem, not a code problem.

Dispatcher Cache Analysis

If publish CPU is high or pages are slow under load, the question is often "why isn't this cached?" Every cache miss runs Java on publish, so a low cache-hit ratio is a performance problem in disguise.

Analyze it by reading response headers and the docroot: a missing or zero Age, X-Cache: MISS, or no file in /docroot for a URL means it isn't being cached — and the usual causes are a Set-Cookie on the response, a no-cache/max-age=0 header, query-string fragmentation, or a cache rule that excludes the path. The full cache-troubleshooting playbook (including "content not updating" and "wrong content served") lives in the Dispatcher guide. For performance specifically, aim for the highest hit ratio you can and treat every persistent miss as a bug to investigate.

Log Analysis

Logs are where most investigations start, so knowing which file says what saves real time. The key AEM logs are:

Log	Tells you
`error.log`	Exceptions, warnings, stack traces — the first place to look
`request.log`	Every request with its duration — find the slow ones
`access.log`	HTTP access records (status, size)
`stdout.log`	JVM output, GC (if logged there), startup

A few high-value moves: grep request.log for long durations to find slow requests, grep error.log for ERROR/repeated WARN and for tell-tale strings (Traversed, OutOfMemoryError, has not been closed), and raise a specific package's log level on the fly via the OSGi Sling Log Support console (/system/console/slinglog) to capture detail without restarting. On AEM as a Cloud Service you don't tail files on a box — you stream and download logs through Cloud Manager, the Developer Console, or aio CLI tooling.

Tip: Add a temporary, narrowly-scoped DEBUG logger (one package, one file) when reproducing an issue, and remove it afterward. A broad DEBUG logger on a busy instance generates enormous noise and can itself become a performance problem.

Resource Resolver Leaks

This deserves its own section because it's the most common memory leak in AEM, full stop. Every ResourceResolver (and the JCR Session behind it) holds resources, and if you don't close it, it lingers — thousands of leaked resolvers will steadily exhaust the heap.

The fingerprint is unmistakable. You'll see warnings in error.log such as "Resource resolver ... has not been closed" (with a stack trace pointing at the opening code), the JCR session count climbing in JMX, and a heap dump full of ResourceResolverImpl/SessionImpl objects.

The fix is discipline: always close every resolver, ideally with try-with-resources so it's automatic even on exceptions, and never use an admin resolver:

Map<String, Object> authInfo =
    Map.of(ResourceResolverFactory.SUBSERVICE, "mysite-reader");

try (ResourceResolver resolver = resolverFactory.getServiceResourceResolver(authInfo)) {
    // use the resolver — closed automatically at the end of the block
}

The leak almost always comes from a code path that opens a resolver and returns early, or stores it in a field and never closes it. The warning's stack trace tells you exactly where it was opened — start there. (The correct resolver lifecycle is covered in the Sling guide.)

First-response triage

When an alarm fires, a quick decision tree gets you to the right tool fast:

Symptom	Reach for
Requests hang / high CPU	Thread dumps (3–5, seconds apart)
Heap climbing / OOM	GC logs → heap dump (MAT)
One page slow, others fine	request.log + check the query (EXPLAIN)
Site slow under load	Dispatcher hit ratio + cache analysis
OOM after days of uptime	Resource resolver leak (check `error.log` warnings)
"It just broke"	error.log first, always

Cheat sheet

Tool	Capture	Analyze with
Thread dump	`jstack <pid>` / `kill -3` / `/system/console/status-Threads`	fastThread.io
Heap dump	`jmap -dump` / `HeapDumpOnOutOfMemoryError`	Eclipse MAT
GC log	`-Xlog:gc*`	GCeasy / GCViewer
Slow query	`Traversed N nodes` warning, `EXPLAIN`, Oak Query Stats (JMX)	add index
Cache miss	`Age` / `X-Cache` / docroot file	Dispatcher rules
Slow request	`request.log` durations	grep
Resolver leak	`"has not been closed"` warning, JMX session count	close it (try-with-resources)

Best practices

✅ Enable GC logging and HeapDumpOnOutOfMemoryError in every environment, proactively.
✅ Take multiple thread dumps seconds apart, never just one.
✅ Always close resource resolvers with try-with-resources; never use admin resolvers.
✅ Index every production query and keep the Dispatcher hit ratio high.
✅ Use narrowly-scoped, temporary DEBUG logging; remove it after.
✅ Localize the layer with evidence before changing anything.

Do's and Don'ts

✅ Start with error.log and request.log — they answer most questions.
✅ Correlate GC-log trends with a heap dump to confirm a leak.
✅ Read the "has not been closed" stack trace to find the leaking code.

Don't

❌ Don't capture heap dumps casually on production at peak — they pause the JVM.
❌ Don't leave broad DEBUG logging on a busy instance.
❌ Don't fix by guessing — capture the diagnostic first.
❌ Don't ship code that opens a resolver/session without a guaranteed close.
❌ Don't run unindexed queries; they traverse and degrade the whole instance.

Wrapping up

Performance troubleshooting in AEM is methodical, not magical: thread dumps for hangs, heap dumps + GC logs for memory, the query/EXPLAIN workflow for slow content, Dispatcher hit-ratio analysis for load, logs for everything else — and a hard habit of closing resource resolvers to avoid the most common leak of all. Capture the right evidence, localize the layer, and the fix follows. Build these reflexes and you'll resolve in minutes what otherwise becomes an outage.

Go deeper where you need it: the JCR & Oak guide for query optimization, the Dispatcher guide for cache troubleshooting, and the Sling guide for the resource resolver lifecycle.