在学习和开发过程中，我们经常会讨论 short、int 和 long 这些基本数据类型的取值范围，但是对于 String 类型我们好像很少注意它的“取值范围”。

那么，对于 String 类型，它到底有没有长度限制？

其实 String 类型的对象，他们是有长度限制的，String 对象并不能存储无限长度的字符串。关于 String 的长度限制要从编译时限制和运行时限制两方面考虑。

编译期限制

有 JVM 虚拟机相关知识的同学肯定知道，定义的字符串常量会被放入方法区的常量池中。

String 长度之所以会受限制，是因 JVM 规范对常量池有所限制。常量池中的每一种数据项都有自己的类型，CONSTANT_String_info 用于表示 String 类型的常量对象，结构如下：

CONSTANT_String_info {
    u1 tag;
    u2 string_index;
}

其中，string_index 项的值必须是对常量池的有效索引，常量池在该索引处的项必须是 CONSTANT_Utf8_info 结构，表示一组 Unicode 码点序列，这组 Unicode 码点序列最终会被初始化一个 String 对象。

即 Java 中的 UTF-8 编码的 Unicode 字符串在常量池中以 CONSTANT_Utf8_info 类型表示，结构如下：

CONSTANT_Utf8_info {
    u1 tag;
    u2 length;
    u1 bytes[length];
}

长度为 length 的那个 bytes 数组就是真正存储常量数据的地方，而 length 就是数组可以存储的最大字节数。length 的类型是 u2，u2 表示两个字节的无符号数，即无符号的 16 位整数，因此理论上允许的的最大长度是 2^16-1，所以上面 byte 数组的最大长度可以是 65535。

但是当你企图生成一个长度为 65535 的字符串时，编译依然会失败，而生成一个长度为 65534 的字符串却编译成功，这似乎与上面的结论不符。

其实，这是 Javac 编译器的额外限制。在 Javac 的源代码中可以找到以下方法：

public class Gen extends JCTree.Visitor {
    ...
    /** Check a constant value and report if it is a string that is too large. */
    private void checkStringConstant(DiagnosticPosition pos, Object constValue) {
        if (nerrs != 0 || // only complain about a long string once
            constValue == null ||
            !(constValue instanceof String) ||
            ((String)constValue).length() < PoolWriter.MAX_STRING_LENGTH)
            return;
        log.error(pos, Errors.LimitString);
        nerrs++;
    }
}

还有一个最重要的常量：

public class PoolWriter {

    /** Max number of char in a string constant. */
    public static final int MAX_STRING_LENGTH = 0xFFFF;
    
    ...
}

0xFFFF 就是十进制的 65535。

代码中可以看出，当参数类型为 String，并且长度大于等于 65535 的时候，就会导致编译失败。

这里需要重点强调下的是，String 的限制并不是对字符串长度的限制，而是对字符串底层存储的限制，也就是字节长度的限制。

举个例子，Java 中的字符常量都是使用 UTF-8 编码的，UTF-8 编码使用 1~4 个字节来表示具体的 Unicode 字符。所以有的字符占用一个字节，而我们平时所用的大部分中文都需要 3 个字节来存储，另外还有的汉字需要 4 个字节存储。

// 65534 个字母“d”，编译通过
String s1 = "dd..d";

// 21845 个中文“自”，编译通过
String s2 = "自自...自";

// 1 个英文字母“d”加上 21845 个中文“自”，编译失败
String s3 = "d自自...自";

对于 s1，1 个字母 d 的 UTF-8 编码占用 1 个字节，65534 个字母占用 65534 个字节，长度是 65534，也没超过 Javac 的限制，所以可以编译通过。

对于 s2，1 个中文占用 3 个字节，21845 个正好占用 65535 个字节，而且字符串长度是 21845，并没有超过 Javac 对长度的限制，所以可以编译通过。

对于 s3，1 个英文字母 d 加上 21845 个中文占用 65536 个字节，超过了限制，编译失败。

回到一开始，u2 类型能表达的最大值是 65535，长度 65535 的字符串在 Javac 下报错了是受到了 Javac 编译器的限制，如果你先将长度为 65534 的字符串用 Javac 编译，再在生成的 CLASS 文件中手动添加一个字符，是可以得到长度为 65535 的结果。

另外，使用『Eclipse』编译超过 65534 长度的字符串不报错，是因为『Eclipse』有自己的 Java 编译器，『Eclipse』使用自己的编译器，主要原因是 JDT 核心具有渐进式编译的能力，这意味着它会逐步编译代码中的更改，这也是『Eclipse』不需要编译按钮的原因，因为它会在检测到更改时自动编译。但 Oracle 的 JDK 不支持增量编译。

运行时限制

String 运行时的限制主要体现在 String 的构造函数上：

public final class String implements Serializable, Comparable<String>, CharSequence {
    /**
     * Allocates a new {@code String} that contains characters from a subarray
     * of the character array argument. The {@code offset} argument is the
     * index of the first character of the subarray and the {@code count}
     * argument specifies the length of the subarray. The contents of the
     * subarray are copied; subsequent modification of the character array does
     * not affect the newly created string.
     *
     * @param value  Array that is the source of characters
     * @param offset The initial offset
     * @param count  The length
     * @throws IndexOutOfBoundsException If the {@code offset} and {@code count} arguments index characters outside the bounds of the {@code value} array
     */
    public String(char value[], int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= value.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.
        if (offset > value.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }
        this.value = Arrays.copyOfRange(value, offset, offset + count);
    }
}

count 值就是字符串的最大长度，在 Java 中 int 的最大长度是 2^31-1。所以在运行时，String 的最大长度是 2^31-1。

但是这个也是理论上的长度，实际的长度还要跟 JVM 的内存相关，最大的字符串会占用 (2^31-1)*2*16/8/1024/1024/1024 的内存（char 在 Java 中占 2 个字节，16 指 16-bit Unicodecharacter 即 u2，8 指 Bit 转 Byte，1024 指 KB、MB 和 GB 的单位转换）。

所以在最坏的情况下，一个最大的字符串要占用 4GB 的内存，如果虚拟机不能分配这么多内存的话，会直接报错的。

以上源码是基于 Java 8 的，Java 9 以后对 String 的存储进行了优化。底层不再使用 char[] 数组存储字符串，而是使用 byte[] 数组，对于 LATIN1 字符的字符串可以节省一倍的内存空间：

public final class String implements Serializable, Comparable<String>, CharSequence {
    ...
    
    public String(char[] value, int offset, int count) {
        this(value, offset, count, rangeCheck(value, offset, count));
    }
    
    String(char[] value, int off, int len, Void sig) {
        if (len == 0) {
            this.value = "".value;
            this.coder = "".coder;
        } else {
            if (COMPACT_STRINGS) {
                byte[] val = StringUTF16.compress(value, off, len);
                if (val != null) {
                    this.value = val;
                    this.coder = 0;
                    return;
                }
            }
            this.coder = 1;
            this.value = StringUTF16.toBytes(value, off, len);
        }
    }
}

简单总结

String 的长度是有限制的。

编译期的限制：字符串的 UTF-8 编码值的字节数不能超过 65535，字符串的长度不能超过 65534；
运行时限制：字符串的长度不能超过 2^31-1，占用的内存数不能超过虚拟机能够提供的最大值。

Java String 长度限制

编译期限制

运行时限制

简单总结

Android 获取 WebView 选中文本

Let's party like it's 1995!