Have you ever noticed that a string with just a few characters in Python uses several dozen bytes of memory? We handle a lot of short strings in Python in our team, that works on matching offers of the same products together, and I have been wondering why Python seems to store strings seemingly so inefficiently.
In this article, we will take a brief look at how Python strings are implemented, why they consume so much memory and we will examine certain interesting aspects of Python string.
In the previous blog posts (the first and second), we analysed the implementation of float
and list
types in Python.
We have established that every Python object has a common ancestor, a common basis: the PyObject
struct.
In this article I assume the reader is familiar with character representations and encodings (ASCII, UTF-8 – why variable character length is used) to some extent. Also, I assume a 64-bit *nix system, in any memory calculation.
Python String Structures
In lower-level languages such as C, a „string“ is commonly represented as an array of characters (generally integers; in C, types such as char
, wchar_t
are used) terminated by null character.
Python has three string representations (but the last one, PyUnicodeObject
, is deprecated and it will be removed in Python 3.12 so I’ve omitted it from this article). The first string object, PyASCIIObject
, is for ASCII only strings and the other, PyCompactUnicodeObject
, is for strings that contain at least one non-ASCII character. PyCompactUnicodeObject
actually uses PyASCIIObject
as its base and extends it.
PyObject_HEAD
is a macro for creating PyObject
, which has 16 bytes as we have shown.
The length
is the number of characters in the string and it has 8 bytes. There is also a hash
of the string (it does not have to be calculated) which is 8 bytes.
typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the * terminating \0. */ char *utf8; /* UTF-8 representation (null-terminated) */ Py_ssize_t wstr_length; /* Number of code points in wstr, possible * surrogates count as two code points. */ } PyCompactUnicodeObject;
The PyCompactUnicodeObject definition in CPython 3.10
Then there is the struct
with all the bitfields:
interned
says whether the string is interned, not interned or interned and immortal- Interning is a memory-saving optimisation detail. It also improves the performance
of some operations, such as equality comparison (if immutable objects are equal, then their contents are also equal). - By interning a string (or any object for that matter), Python ensures that only one instance of that object is stored in memory, i. e. if
'hello'
was interned, then
for any two stringsa = 'hello'
andb = 'hello'
anywhere in the programa is b
would beTrue
- All function, class, and variable names are interned, and so are the keys of dictionaries, for example. Also, any string can be explicitly interned.
- An interned string marked as immortal is never deleted. However, this feature has been deprecated since Python 3.10 and scheduled for removal in Python 3.12 (and it seems it is not used in practice).
- A side note for readers interested in the immortality of objects: There is a PEP proposal on making some objects (e. g.
None
) truly immortal and immutable, including their reference count. This has several interesting consequences, such as the avoidance of cache invalidation after each reference count change.
- A side note for readers interested in the immortality of objects: There is a PEP proposal on making some objects (e. g.
- Interning is a memory-saving optimisation detail. It also improves the performance
kind
either 1, 2, or 4 byte unicode symbol (ASCII has 1 byte)compact
(legacy bit) for the two described objects is always set to 1ascii
says whether the string is composed of ASCII only charactersready
(legacy bit) for the two described objects is always set to 1- And a 24-bit padding to align the structure to 4 bytes.
Finally, there is a pointer wstr
that will be referred to later.
The string itself is stored just after this structure. Again, this will be explained later.
typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the * terminating \0. */ char *utf8; /* UTF-8 representation (null-terminated) */ Py_ssize_t wstr_length; /* Number of code points in wstr, possible * surrogates count as two code points. */ } PyCompactUnicodeObject;
The PyCompactUnicodeObject definition in CPython 3.10
Strings are commonly created through the PyUnicode_New
C function. If they contain a character with a code point higher than 127, the PyUnicodeCompactObject
structure is used, otherwise PyASCIIObject
. There are other ways to create string, and some result in the legacy PyUnicodeObject
.
As mentioned before, the actual string is stored directly after this structure, and it’s not referenced from it. Also, the stored string is none of UTF-{8,16,32} encodings, but only the fixed-size UCS code points (=numbers, IDs,.) are stored. The reason for that is that fixed-size characters are necessary so that, for example, indexing works in constant time. It is generally is much easier to work
with strings where each character takes up the same amount of space. UTF only encodes these UCS code points to variable-sized characters, thus (usually) saving a lot of space to the detriment of some operations.
At the beginning, utf8
is NULL
and utf8_length
is zero. utf-8
is basically a cache for
the UTF-8 representation of the string. This representation is created on-demand, when a string is about to be written to a file (in UTF-8), for example, and it is cached for further usage. utf8_length
is the number of bytes in the UTF-8 encoded version of the string, excluding the null at the end.
The wstr
pointer and wstr_length
are legacy and they are scheduled to be removed in Python 3.12, so I won’t describe them here. An interested reader can refer to string structs definitions and the PyObject_New
function for further information.
Immutability
Strings in Python are immutable as described in the documentation. Immutable means that once the object is created it cannot be changed. There can be many references to such an object and all of them can assume the object will never change. Also, the immutability of an object is a necessary (but not a sufficient) condition for being a key in a dictionary.
However, sometimes strings are actually mutable. See the following example.
counter = 0 string1 = '' for i in range(1_000_000): old_id = id(string1) string1 += str(i) if id(string1) != old_id: counter += 1 print(counter) # returns 47
Example of string mutability
Edit: However, sometimes strings are actually mutable. I have been corrected by @gjmos that strings are always immutable. In this example, the actual string object remains unchanged and on the same memory address, with concatenated string appended at the end, but as @gjmos pointed out, it’s actually a new object, and we should not work with IDs of objects that have died because new object may be allocated at the same memory address, thus having the same ID (in CPython).
After most of the concatenations, the new object occupies the same space as the old object and reuses the data of the old object, appending a string at the end. It may happen only if there is only one reference to the object (and there is free memory space following the object). This is optimisation detail of CPython, making concatenation of strings much faster (in some cases)
There is a blog post about it. Or see the implementation of [unicode_resize]
, [resize_inplace]
functions in CPython.
Memory Consumption
Let us see how much memory strings use. If we create a new empty string, it uses PyASCIIObject
and totals 49 bytes:
import sys print(sys.getsizeof('')) # prints 49
Why 49 bytes? Let’s count
PyObject
uses 16 byteslength
,hash
,wstr*
use 8 bytes each, so 24 bytes totalstate
uses 4 bytes- the actual string uses 1 byte, as it is null-terminated (
\0
)
That makes it 45 bytes. However, the structure needs to be aligned to 8 bytes, so it is ’padded’ using 4 more bytes (the structure then takes 48 bytes).
If we add one ASCII character
print(sys.getsizeof('a')) # prints 50
it uses 50 bytes.
Now what if we use non-ASCII characters like ž
. ž
has code point 382
, so at least 2 bytes are necessary to store this character. And it will use the PyCompactUnicodeObject
:
print(sys.getsizeof('ž')) # prints 76
These strings are getting quite big 🙂. Let’s count again:
- The
PyASCIIObject
uses 48 bytes utf_length
,utf8*
, andwstr_length
use 8 bytes each, so 24B total- 4 bytes for the actual string (two characters, including
\0
, each uses 2 bytes)
So that is 76 bytes in total.
Now let’s add a long ASCII text, 10000 characters:
text = 'a' * 10**4 print(sys.getsizeof(text)) # prints 10049
The size of this is expected to be 10001 bytes for the string, and 48 bytes for the structure.
Let’s add one non-ASCII character to the text, a big 4-byte one:
print(sys.getsizeof(text+🖕)) # prints 40080
The size of the string is about 4 times larger because now each character uses 4 bytes (the size of the largest character). The actual text (10002 characters) uses 40008 bytes, and the PyCompactUnicode
structure uses 72 bytes as we have seen earlier.
Conclusion
In this blog post, we analyzed the structures that are used to store strings in CPython and compared them with actual Python code.
We have also shown that Python strings are not completely immutable.
Note that the above holds specifically for the reference Python implementation, CPython, and it might (and likely does) differ in other Python implementations.