squoze - unicode encodings for pointers and hashes

compute more, save energy, parse faster;
with strings squozed to fit in computer words

Storing strings in computer words is an old practice, here this optimization technique is modernized for unicode and combined with string interning. The least significant bit of pointers (or hash) is used to indicate if a string is embedded or not - even values are interpreted as embedded strings. By storing short strings directly we avoid cache and lock contention involved in hash-table and RAM overhead..

Calling it a hash might seem like a bit of a misnomer, but the same optimization can be used for with larger hashes in content addressed storage systems like git/IPFS, as well as in parsers with a limited vocabulary so the resulting behavior is that of a perfect hash.

The embedded data is stored in either UTF-8 or transcoded to UTF5+, a 5bit variable length and dynamic window unicode coding scheme.

The benefits of embedding strings in pointers is dataset dependent - but note that in the english language the average word length is 5. If all data fits in caches the added computational overhead might only slightly reduce cache contention. As I understand it microcontrollers have no L1/L2 cache, but there can still be benefits from RAM savings

A series of subvariants have been specified as part of parameterizing the benchmarks, and definining the encoding:

squoze64-utf8 achieves 7x speedup over murmurhash one-at-a-time used for initial string interning and 3x speedup for subsequent lookups of the same string when the strings are shorther than 8bytes of utf8, see the benchmarks for details.

squoze64-utf8: UTF-8 embed encoding, supporting up 8 UTF-8 bytes of embedded data.
squoze32-utf8: UTF-8 embed encoding, supporting up 4 UTF-8 bytes of embedded data
squoze32-utf5: UTF5+ embed encoding, supporting up to 6 lower-case ascii of embedded data
squoze52-utf5: UTF5+ embed encoding, supporting up to 10 lower-case ascii of embedded data, 52bit integers can be stored without loss in a double.
squoze62-utf5: UTF5+ embed encoding, supporting up to 12 unicode code points.

squoze64 implementation

the squoze-64 encoding is a small bitmanipulation of UTF-8 in-memory encoding, for strings that will fit only the first byte is manipulated and only if it ascii and not vertical-tab (ASCII 11) the value is doubled and 1 is added. When the first byte does not match our constraints we store 23 - which means the following 7bytes directly encode the value.

The following example code is an illustration based on the benchmarked code.

/*Copyright (c) 2021-2022 Øyvind Kolås <pippin@gimp.org>

Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above 
copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED \AS IS\" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFOR-
MANCE OF THIS SOFTWARE.*/

void *squoze64(const char *utf8, size_t len)
{
  size_t   squoze_dim = 64;
  uint64_t hash       = 0;
  uint8_t *encoded    = (uint8_t*)&hash;
  uint8_t  first_byte = ((uint8_t*)utf8)[0];

  if (first_byte < 128
      && first_byte != 11
      && (len <= (squoze_dim/8)))
  {
    for (int i = 0; utf8[i]; i++)
      encoded[i] = utf8[i];
    encoded[0] = encoded[0] * 2 + 1;
    return (void*)*((uint64_t*)&encoded[0]);
  }
  else if (len <= (squoze_dim/8)-1)
  {
    for (int i = 0; utf8[i]; i++)
      encoded[i+1] = utf8[i];
    encoded[0] = 23;
    return (void*)*((uint64_t*)&encoded[0]);
  }

  return intern_string(utf8); // fall back to
                              // regular interning
                              // and rely on an aligned
                              // pointer for the interned
                              // string
}

// caller provides a temporary buffer of at least 9 bytes 
const char *squoze64_decode (void *squozed, uint8_t *buf)
{
  uint64_t bits = (uint64_t)squozed;
  if ((bits & 1) == 0)
     return (char*)squozed;

  buf[8] = 0;
  ((uint64_t*)buf)[0] = bits;
  if (buf[0] == 23)
    return buf + 1;
  buf[0] /= 2;
  return buf;
}

squoze32, squoze52 and squoze62 implementation

These encodings use UTF5+ to encode strings of up to 6, 10 or 12 characters.

benchmarks

The implementation benchmarked is an open adressing hashtable storing heap allocated chunks containing both the hash, length and raw byte data for the UTF-8 string. The benchmarking code can be downloaded here: squoze.tar.xz it is a small C project that should build on a 64bit linux system, with /usr/share/dict/words available.

The benchmark is goes through all the words in /usr/share/dict/words and respectively creating them for the first time, a second time, and finally having an array of IDs get a copy of the string. The benchmark selects the quickest runtimes out of 10 (or more) runs to average out effects of caches and other tasks on the system. For the shorter strings, the whole datastructure fitting in L2 cache and there being no concurrent cache contention has impact on the results.

The create column is the microseconds taken on average to intern a word, lookup is the time taken the second and subsequent times a string is referenced. For comparisons the handle/pointer of the interned string would normally be used and be the same for all cases, decode is the time taken for getting a copy of the interned string, for interned strings all we need to do is dereference a pointer, in most uses decoding of strings is not where the majority of time is spent. cat is the time taken to concatenate "s" at the end of each word, getting some matches due to pluralisation. The embed% column shows how many of the words got embedded instead of interned. This can also be read as the percentage of possible cache-misses that are avoided for the workload.The RAM use column shows the amount of bytes used by the allocations for interned strings as well as the size taken by the hash table used, without the size taken by tempty slots in the hash-table to be comparable with what a more compact optimizing structure used when targeting memory constrained systems.

The first line in each set of benchmarks is a baseline string-interning implementation using the same infrastructure as the others, but without embedding.

5375 words, max-word-len: 3

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.080	0.017	0.006	0.044	0%	106849
squoze64-utf8	0.009	0.004	0.009	0.017	100%	40
squoze32-utf8	0.009	0.004	0.009	0.017	100%	40
squoze32-utf5	0.063	0.032	0.025	0.081	99%	740
squoze52-utf5	0.037	0.032	0.024	0.067	100%	40
squoze62-utf5	0.036	0.032	0.023	0.067	100%	40

16546 words, max-word-len: 4

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.063	0.021	0.009	0.054	0%	341440
squoze64-utf8	0.012	0.006	0.012	0.020	100%	40
squoze32-utf8	0.012	0.006	0.012	0.065	100%	40
squoze32-utf5	0.059	0.039	0.027	0.083	93%	23294
squoze52-utf5	0.045	0.038	0.027	0.076	100%	40
squoze62-utf5	0.044	0.038	0.026	0.074	100%	40

79014 words, max-word-len: 6

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.089	0.038	0.008	0.123	0%	1755254
squoze64-utf8	0.017	0.005	0.010	0.021	100%	40
squoze32-utf8	0.067	0.031	0.010	0.114	21%	1413854
squoze32-utf5	0.060	0.048	0.025	0.112	70%	537576
squoze52-utf5	0.055	0.048	0.031	0.094	100%	2948
squoze62-utf5	0.055	0.047	0.029	0.088	100%	102

192126 words, max-word-len: 8

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.116	0.067	0.013	0.179	0%	4530961
squoze64-utf8	0.019	0.006	0.013	0.051	100%	40
squoze32-utf8	0.110	0.060	0.013	0.176	9%	4189561
squoze32-utf5	0.102	0.070	0.022	0.159	29%	3313283
squoze52-utf5	0.065	0.057	0.033	0.102	99%	47597
squoze62-utf5	0.063	0.056	0.033	0.102	100%	1548

308201 words, max-word-len: 10

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.143	0.088	0.014	0.582	0%	7603232
squoze64-utf8	0.060	0.031	0.015	0.117	62%	4000911
squoze32-utf8	0.133	0.078	0.014	0.576	5%	7261832
squoze32-utf5	0.127	0.088	0.019	0.219	18%	6385554
squoze52-utf5	0.076	0.064	0.034	0.117	92%	861018
squoze62-utf5	0.073	0.069	0.041	0.120	99%	62537

392136 words, max-word-len: 12

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.162	0.103	0.013	0.503	0%	9990936
squoze64-utf8	0.083	0.045	0.015	0.166	49%	7060095
squoze32-utf8	0.150	0.091	0.013	0.506	4%	9649536
squoze32-utf5	0.140	0.095	0.017	0.504	14%	8773258
squoze52-utf5	0.087	0.070	0.032	0.141	72%	3920202
squoze62-utf5	0.080	0.069	0.036	0.123	94%	805463

466514 words, max-word-len: 23

	create	lookup	decode	cat	embed%	RAM use
murmurhash	0.171	0.110	0.013	0.464	0%	12326290
squoze64-utf8	0.107	0.067	0.016	0.461	41%	9990473
squoze32-utf8	0.166	0.091	0.014	0.467	4%	11984890
squoze32-utf5	0.157	0.108	0.017	0.471	12%	11108612
squoze52-utf5	0.106	0.081	0.031	0.174	61%	6850580
squoze62-utf5	0.088	0.074	0.034	0.135	79%	3735841

32bit ARM benchmarks on a RPI3

As large words do not embed in 32bit, but 4-6 characters still covers many of the most common words in english, as well as many common keyword tokens in programming.

5375 words, max-word-len: 3

	create	lookup	decode	cat	embed%	RAM use
murmurhash	2.324	0.274	0.054	0.800	0%	85333
squoze32-utf8	0.072	0.042	0.039	0.200	100%	24
squoze32-utf5	0.335	0.256	0.190	2.137	99%	584

16546 words, max-word-len: 4

	create	lookup	decode	cat	embed%	RAM use
murmurhash	1.298	0.290	0.060	0.805	0%	275240
squoze32-utf8	0.078	0.045	0.043	1.157	100%	24
squoze32-utf5	0.697	0.311	0.212	1.106	93%	18842

192126 words, max-word-len: 8

	create	lookup	decode	cat	embed%	RAM use
murmurhash	1.265	0.356	0.068	0.940	0%	3762441
squoze32-utf8	1.175	0.342	0.069	0.930	9%	3487225
squoze32-utf5	1.014	0.378	0.125	0.967	29%	2765311

308201 words, max-word-len: 10

	create	lookup	decode	cat	embed%	RAM use
murmurhash	1.318	0.387	0.071	1.918	0%	6370412
squoze32-utf8	1.263	0.381	0.071	1.926	5%	6095196
squoze32-utf5	1.142	0.395	0.106	1.044	18%	5373282

392136 words, max-word-len: 12

	create	lookup	decode	cat	embed%	RAM use
murmurhash	1.361	0.415	0.073	1.701	0%	8422376
squoze32-utf8	1.320	0.412	0.073	1.711	4%	8147160
squoze32-utf5	1.216	0.415	0.102	1.760	14%	7425246

Funding

This work has been done with funding from supports of the account pippin on Patreon and liberapay. To join in to fund similar and unrelated independent research and development.