String Immutability: From Memory Layout to JVM Internals
Java's String is designed to be Immutable. This is not a casual design choice; it is a fundamental architectural decision that influences the String Pool, hash caching, thread safety, and the security of the JVM. Understanding immutability is the prerequisite for mastering Java's memory and performance optimization.
1. The Source Code Implementation
How does Java enforce immutability at the code level?
In JDK 8:
public final class String implements java.io.Serializable, Comparable<String> {
/** The character array used for storage */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0
}
The Three Layers of Protection:
finalClass: Prevents subclasses from overriding methods to break immutability.privateField: External code cannot modify the underlying array directly.finalArray Reference: Thevaluevariable cannot be pointed to another array.
Note: While
finalon an array prevents the reference from changing, it doesn't prevent its elements from being modified. String's immutability is truly enforced because the class provides no methods that modify thevaluearray. Every "modification" (likesubstring()orreplace()) returns a new String object.
In JDK 9+: Compact Strings (JEP 254)
JDK 9 introduced a major refactoring to reduce string memory footprint.
public final class String {
private final byte[] value; // Changed from char[] to byte[]
private final byte coder; // 0 for Latin-1, 1 for UTF-16
}
In JDK 8, every character occupied 2 bytes (char). However, Oracle found that most strings in typical apps consist solely of Latin-1 characters (ASCII + Western European chars), where the high byte is always 0x00.
The Optimization: If a string only contains Latin-1 characters, it is stored using 1 byte per character (coder = 0). If it contains characters like Chinese or Emojis, it falls back to 2 bytes per character (coder = 1) using UTF-16 encoding. This can reduce heap usage by up to 50% for English-heavy applications.
2. A Deep Dive into Character Encoding
To understand how Compact Strings works, we must understand the encoding standards they utilize.
2.1 The Encoding Spectrum
- Latin-1 (ISO-8859-1): An 8-bit encoding (0-255). It is a superset of ASCII.
- Unicode: A universal map of "Code Points" (U+0000 to U+10FFFF).
- UTF-16: Java's internal representation. Characters in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) use 2 bytes. Characters outside BMP (like Emojis) use 4 bytes via Surrogate Pairs.
2.2 Surrogate Pairs (The "Magic" of D800-DFFF)
How does a decoder know a 16-bit value is a single character or half of a pair?
Unicode permanently reserved the range U+D800 to U+DFFF. These code positions never represent actual characters.
- High Surrogates:
0xD800to0xDBFF - Low Surrogates:
0xDC00to0xDFFF
When the Java decoder sees a value in the High Surrogate range, it immediately knows to read the next 16 bits as the Low Surrogate to reconstruct the full character code point.
Critical Fact: In Java, a
charis a "UTF-16 Code Unit," not necessarily a "Character." An Emoji like 😀 has alength()of 2 because it occupies two Code Units. To count actual characters, usecodePointCount().
3. Why Immutability?
3.1 String Constant Pool
JVM maintains a global String Table to share strings. When you use a literal "abc", JVM checks the pool. If it exists, you get a reference to the same object.
Logic: If strings were mutable, one thread changing its "abc" would unintentionally change the "abc" for all other threads sharing that same reference.
3.2 Hash Caching
Strings are overwhelmingly used as keys in HashMap. Due to immutability, the hash is calculated once (lazily) and cached. Subsequent calls to hashCode() return the cached integer in $O(1)$ time, providing a massive boost for collection performance.
3.3 Security: The TOCTOU Attack
Strings are used to handle file paths, network hostnames, and database credentials.
If a String were mutable, an attacker could pass a valid file path (e.g., "/home/user/log") to a security check. Between the check and the actual file open operation, the attacker could modify the string content to "/etc/passwd".
Since it is immutable, the content is "frozen" the moment it passes the security check, making this Time-of-Check to Time-of-Use (TOCTOU) attack impossible.
4. JVM Architecture: The Three Layers of Constant Pools
Understanding where strings "live" requires distinguishing three layers:
- Class File Constant Pool: Exists in the
.classfile. Stores literals as descriptive metadata. - Runtime Constant Pool: Memory representation created when a class is loaded into the Metaspace.
- String Table (String Pool): A global C++ HashTable in the heap that stores references to unique String objects.
The ldc Instruction (Lazy Resolution)
A string literal isn't "interned" as soon as the class is loaded. It happens the first time the code executes the ldc byte-code instruction for that literal.
- JVM checks the String Table.
- If not found, it creates the String object in the heap.
- It puts the reference into the String Table and the Runtime Constant Pool.
5. intern() and Memory Optimization
The intern() method ensures that if you have multiple strings with the same content, they all point to the same object in the pool.
- JDK 6:
intern()copied the string into the Permanent Generation (PermGen), which had a tiny fixed size, often leading to OOM. - JDK 7+: The pool was moved to the Main Heap, and
intern()now stores a reference to the heap object instead of a full copy, making it much safer.
G1 String Deduplication
Modern JVMs using the G1 collector can perform automatic "String Deduplication" (-XX:+UseStringDeduplication). It background-scans the heap for duplicate byte[] arrays inside different String objects and combines them into one, saving memory without changing the object identity.
6. StringBuilder vs. StringBuffer
When concatenating strings in a loop, String is inefficient because it creates $O(N^2)$ temporary objects.
StringBuilder: Best for single-threaded usage. Fast, uses a resizable internal array.StringBuffer: Concurrent/Thread-safe version of StringBuilder. Methods aresynchronized.- Expansion Logic: Internal array capacity grows by $(OldCapacity \times 2) + 2$.
7. Breaking Immutability with Reflection
While the API prevents modification, you can technically use Reflection to access the private value field and change its contents.
Warning: This is extremely dangerous. Modifying a pooled string literal via reflection will change the value of that literal for every other part of the application, leading to unpredictable and catastrophic logical failures.