Storing strings in computer words is an old practice, here this optimization technique is modernized for unicode and combined with string interning. The least significant bit of pointers (or hash) is used to indicate if a string is embedded or not - even values are interpreted as embedded strings. By storing short strings directly we avoid cache and lock contention involved in hash-table and RAM overhead..
Calling it a hash might seem like a bit of a misnomer, but the same optimization can be used for with larger hashes in content addressed storage systems like git/IPFS, as well as in parsers with a limited vocabulary so the resulting behavior is that of a perfect hash.
The embedded data is stored in either UTF-8 or transcoded to UTF5+, a 5bit variable length and dynamic window unicode coding scheme.
The benefits of embedding strings in pointers is dataset dependent - but note that in the english language the average word length is 5. If all data fits in caches the added computational overhead might only slightly reduce cache contention. As I understand it microcontrollers have no L1/L2 cache, but there can still be benefits from RAM savings
A series of subvariants have been specified as part of parameterizing the benchmarks, and definining the encoding:
squoze64-utf8 achieves 7x speedup over murmurhash one-at-a-time used for initial string interning and 3x speedup for subsequent lookups of the same string when the strings are shorther than 8bytes of utf8, see the benchmarks for details.
the squoze-64 encoding is a small bitmanipulation of UTF-8 in-memory encoding, for strings that will fit only the first byte is manipulated and only if it ascii and not vertical-tab (ASCII 11) the value is doubled and 1 is added. When the first byte does not match our constraints we store 23 - which means the following 7bytes directly encode the value.
The following example code is an illustration based on the benchmarked code.
/*Copyright (c) 2021-2022 Øyvind Kolås <pippin@gimp.org>
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED \AS IS\" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFOR-
MANCE OF THIS SOFTWARE.*/
void *squoze64(const char *utf8, size_t len)
{
size_t squoze_dim = 64;
uint64_t hash = 0;
uint8_t *encoded = (uint8_t*)&hash;
uint8_t first_byte = ((uint8_t*)utf8)[0];
if (first_byte < 128
&& first_byte != 11
&& (len <= (squoze_dim/8)))
{
for (int i = 0; utf8[i]; i++)
encoded[i] = utf8[i];
encoded[0] = encoded[0] * 2 + 1;
return (void*)*((uint64_t*)&encoded[0]);
}
else if (len <= (squoze_dim/8)-1)
{
for (int i = 0; utf8[i]; i++)
encoded[i+1] = utf8[i];
encoded[0] = 23;
return (void*)*((uint64_t*)&encoded[0]);
}
return intern_string(utf8); // fall back to
// regular interning
// and rely on an aligned
// pointer for the interned
// string
}
// caller provides a temporary buffer of at least 9 bytes
const char *squoze64_decode (void *squozed, uint8_t *buf)
{
uint64_t bits = (uint64_t)squozed;
if ((bits & 1) == 0)
return (char*)squozed;
buf[8] = 0;
((uint64_t*)buf)[0] = bits;
if (buf[0] == 23)
return buf + 1;
buf[0] /= 2;
return buf;
}
These encodings use UTF5+ to encode strings of up to 6, 10 or 12 characters.
The implementation benchmarked is an open adressing hashtable storing heap allocated chunks containing both the hash, length and raw byte data for the UTF-8 string. The benchmarking code can be downloaded here: squoze.tar.xz it is a small C project that should build on a 64bit linux system, with /usr/share/dict/words available.
The benchmark is goes through all the words in /usr/share/dict/words and respectively creating them for the first time, a second time, and finally having an array of IDs get a copy of the string. The benchmark selects the quickest runtimes out of 10 (or more) runs to average out effects of caches and other tasks on the system. For the shorter strings, the whole datastructure fitting in L2 cache and there being no concurrent cache contention has impact on the results.
The create column is the microseconds taken on average to intern a word, lookup is the time taken the second and subsequent times a string is referenced. For comparisons the handle/pointer of the interned string would normally be used and be the same for all cases, decode is the time taken for getting a copy of the interned string, for interned strings all we need to do is dereference a pointer, in most uses decoding of strings is not where the majority of time is spent. cat is the time taken to concatenate "s" at the end of each word, getting some matches due to pluralisation. The embed% column shows how many of the words got embedded instead of interned. This can also be read as the percentage of possible cache-misses that are avoided for the workload.The RAM use column shows the amount of bytes used by the allocations for interned strings as well as the size taken by the hash table used, without the size taken by tempty slots in the hash-table to be comparable with what a more compact optimizing structure used when targeting memory constrained systems.
The first line in each set of benchmarks is a baseline string-interning implementation using the same infrastructure as the others, but without embedding.
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.080 | 0.017 | 0.006 | 0.044 | 0% | 106849 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.009 | 0.004 | 0.009 | 0.017 | 100% | 40 |
squoze32-utf8 | 0.009 | 0.004 | 0.009 | 0.017 | 100% | 40 |
squoze32-utf5 | 0.063 | 0.032 | 0.025 | 0.081 | 99% | 740 |
squoze52-utf5 | 0.037 | 0.032 | 0.024 | 0.067 | 100% | 40 |
squoze62-utf5 | 0.036 | 0.032 | 0.023 | 0.067 | 100% | 40 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.063 | 0.021 | 0.009 | 0.054 | 0% | 341440 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.012 | 0.006 | 0.012 | 0.020 | 100% | 40 |
squoze32-utf8 | 0.012 | 0.006 | 0.012 | 0.065 | 100% | 40 |
squoze32-utf5 | 0.059 | 0.039 | 0.027 | 0.083 | 93% | 23294 |
squoze52-utf5 | 0.045 | 0.038 | 0.027 | 0.076 | 100% | 40 |
squoze62-utf5 | 0.044 | 0.038 | 0.026 | 0.074 | 100% | 40 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.089 | 0.038 | 0.008 | 0.123 | 0% | 1755254 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.017 | 0.005 | 0.010 | 0.021 | 100% | 40 |
squoze32-utf8 | 0.067 | 0.031 | 0.010 | 0.114 | 21% | 1413854 |
squoze32-utf5 | 0.060 | 0.048 | 0.025 | 0.112 | 70% | 537576 |
squoze52-utf5 | 0.055 | 0.048 | 0.031 | 0.094 | 100% | 2948 |
squoze62-utf5 | 0.055 | 0.047 | 0.029 | 0.088 | 100% | 102 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.116 | 0.067 | 0.013 | 0.179 | 0% | 4530961 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.019 | 0.006 | 0.013 | 0.051 | 100% | 40 |
squoze32-utf8 | 0.110 | 0.060 | 0.013 | 0.176 | 9% | 4189561 |
squoze32-utf5 | 0.102 | 0.070 | 0.022 | 0.159 | 29% | 3313283 |
squoze52-utf5 | 0.065 | 0.057 | 0.033 | 0.102 | 99% | 47597 |
squoze62-utf5 | 0.063 | 0.056 | 0.033 | 0.102 | 100% | 1548 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.143 | 0.088 | 0.014 | 0.582 | 0% | 7603232 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.060 | 0.031 | 0.015 | 0.117 | 62% | 4000911 |
squoze32-utf8 | 0.133 | 0.078 | 0.014 | 0.576 | 5% | 7261832 |
squoze32-utf5 | 0.127 | 0.088 | 0.019 | 0.219 | 18% | 6385554 |
squoze52-utf5 | 0.076 | 0.064 | 0.034 | 0.117 | 92% | 861018 |
squoze62-utf5 | 0.073 | 0.069 | 0.041 | 0.120 | 99% | 62537 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.162 | 0.103 | 0.013 | 0.503 | 0% | 9990936 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.083 | 0.045 | 0.015 | 0.166 | 49% | 7060095 |
squoze32-utf8 | 0.150 | 0.091 | 0.013 | 0.506 | 4% | 9649536 |
squoze32-utf5 | 0.140 | 0.095 | 0.017 | 0.504 | 14% | 8773258 |
squoze52-utf5 | 0.087 | 0.070 | 0.032 | 0.141 | 72% | 3920202 |
squoze62-utf5 | 0.080 | 0.069 | 0.036 | 0.123 | 94% | 805463 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 0.171 | 0.110 | 0.013 | 0.464 | 0% | 12326290 |
---|---|---|---|---|---|---|
squoze64-utf8 | 0.107 | 0.067 | 0.016 | 0.461 | 41% | 9990473 |
squoze32-utf8 | 0.166 | 0.091 | 0.014 | 0.467 | 4% | 11984890 |
squoze32-utf5 | 0.157 | 0.108 | 0.017 | 0.471 | 12% | 11108612 |
squoze52-utf5 | 0.106 | 0.081 | 0.031 | 0.174 | 61% | 6850580 |
squoze62-utf5 | 0.088 | 0.074 | 0.034 | 0.135 | 79% | 3735841 |
As large words do not embed in 32bit, but 4-6 characters still covers many of the most common words in english, as well as many common keyword tokens in programming.
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 2.324 | 0.274 | 0.054 | 0.800 | 0% | 85333 |
---|---|---|---|---|---|---|
squoze32-utf8 | 0.072 | 0.042 | 0.039 | 0.200 | 100% | 24 |
squoze32-utf5 | 0.335 | 0.256 | 0.190 | 2.137 | 99% | 584 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 1.298 | 0.290 | 0.060 | 0.805 | 0% | 275240 |
---|---|---|---|---|---|---|
squoze32-utf8 | 0.078 | 0.045 | 0.043 | 1.157 | 100% | 24 |
squoze32-utf5 | 0.697 | 0.311 | 0.212 | 1.106 | 93% | 18842 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 1.265 | 0.356 | 0.068 | 0.940 | 0% | 3762441 |
---|---|---|---|---|---|---|
squoze32-utf8 | 1.175 | 0.342 | 0.069 | 0.930 | 9% | 3487225 |
squoze32-utf5 | 1.014 | 0.378 | 0.125 | 0.967 | 29% | 2765311 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 1.318 | 0.387 | 0.071 | 1.918 | 0% | 6370412 |
---|---|---|---|---|---|---|
squoze32-utf8 | 1.263 | 0.381 | 0.071 | 1.926 | 5% | 6095196 |
squoze32-utf5 | 1.142 | 0.395 | 0.106 | 1.044 | 18% | 5373282 |
create | lookup | decode | cat | embed% | RAM use | |
murmurhash | 1.361 | 0.415 | 0.073 | 1.701 | 0% | 8422376 |
---|---|---|---|---|---|---|
squoze32-utf8 | 1.320 | 0.412 | 0.073 | 1.711 | 4% | 8147160 |
squoze32-utf5 | 1.216 | 0.415 | 0.102 | 1.760 | 14% | 7425246 |
This work has been done with funding from supports of the account pippin on Patreon and liberapay. To join in to fund similar and unrelated independent research and development.