squoze - unicode encodings for pointers and hashes

compute more, save energy, parse faster;
with strings squozed to fit in computer words

Storing strings in computer words is an old practice, here this optimization technique is modernized for unicode and combined with string interning. The least significant bit of pointers (or hash) is used to indicate if a string is embedded or not - even values are interpreted as embedded strings. By storing short strings directly we avoid cache and lock contention involved in hash-table and RAM overhead..

Calling it a hash might seem like a bit of a misnomer, but the same optimization can be used for with larger hashes in content addressed storage systems like git/IPFS, as well as in parsers with a limited vocabulary so the resulting behavior is that of a perfect hash.

The embedded data is stored in either UTF-8 or transcoded to UTF5+, a 5bit variable length and dynamic window unicode coding scheme.

The benefits of embedding strings in pointers is dataset dependent - but note that in the english language the average word length is 5. If all data fits in caches the added computational overhead might only slightly reduce cache contention. As I understand it microcontrollers have no L1/L2 cache, but there can still be benefits from RAM savings

A series of subvariants have been specified as part of parameterizing the benchmarks, and definining the encoding:

squoze64-utf8 achieves 7x speedup over murmurhash one-at-a-time used for initial string interning and 3x speedup for subsequent lookups of the same string when the strings are shorther than 8bytes of utf8, see the benchmarks for details.

squoze64-utf8
UTF-8 embed encoding, supporting up 8 UTF-8 bytes of embedded data.
squoze32-utf8
UTF-8 embed encoding, supporting up 4 UTF-8 bytes of embedded data
squoze32-utf5
UTF5+ embed encoding, supporting up to 6 lower-case ascii of embedded data
squoze52-utf5
UTF5+ embed encoding, supporting up to 10 lower-case ascii of embedded data, 52bit integers can be stored without loss in a double.
squoze62-utf5
UTF5+ embed encoding, supporting up to 12 unicode code points.

squoze64 implementation

the squoze-64 encoding is a small bitmanipulation of UTF-8 in-memory encoding, for strings that will fit only the first byte is manipulated and only if it ascii and not vertical-tab (ASCII 11) the value is doubled and 1 is added. When the first byte does not match our constraints we store 23 - which means the following 7bytes directly encode the value.

The following example code is an illustration based on the benchmarked code.

/*Copyright (c) 2021-2022 Øyvind Kolås <pippin@gimp.org>

Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above 
copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED \AS IS\" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFOR-
MANCE OF THIS SOFTWARE.*/

void *squoze64(const char *utf8, size_t len)
{
  size_t   squoze_dim = 64;
  uint64_t hash       = 0;
  uint8_t *encoded    = (uint8_t*)&hash;
  uint8_t  first_byte = ((uint8_t*)utf8)[0];

  if (first_byte < 128
      && first_byte != 11
      && (len <= (squoze_dim/8)))
  {
    for (int i = 0; utf8[i]; i++)
      encoded[i] = utf8[i];
    encoded[0] = encoded[0] * 2 + 1;
    return (void*)*((uint64_t*)&encoded[0]);
  }
  else if (len <= (squoze_dim/8)-1)
  {
    for (int i = 0; utf8[i]; i++)
      encoded[i+1] = utf8[i];
    encoded[0] = 23;
    return (void*)*((uint64_t*)&encoded[0]);
  }

  return intern_string(utf8); // fall back to
                              // regular interning
                              // and rely on an aligned
                              // pointer for the interned
                              // string
}

// caller provides a temporary buffer of at least 9 bytes 
const char *squoze64_decode (void *squozed, uint8_t *buf)
{
  uint64_t bits = (uint64_t)squozed;
  if ((bits & 1) == 0)
     return (char*)squozed;

  buf[8] = 0;
  ((uint64_t*)buf)[0] = bits;
  if (buf[0] == 23)
    return buf + 1;
  buf[0] /= 2;
  return buf;
}

squoze32, squoze52 and squoze62 implementation

These encodings use UTF5+ to encode strings of up to 6, 10 or 12 characters.

benchmarks

The implementation benchmarked is an open adressing hashtable storing heap allocated chunks containing both the hash, length and raw byte data for the UTF-8 string. The benchmarking code can be downloaded here: squoze.tar.xz it is a small C project that should build on a 64bit linux system, with /usr/share/dict/words available.

The benchmark is goes through all the words in /usr/share/dict/words and respectively creating them for the first time, a second time, and finally having an array of IDs get a copy of the string. The benchmark selects the quickest runtimes out of 10 (or more) runs to average out effects of caches and other tasks on the system. For the shorter strings, the whole datastructure fitting in L2 cache and there being no concurrent cache contention has impact on the results.

The create column is the microseconds taken on average to intern a word, lookup is the time taken the second and subsequent times a string is referenced. For comparisons the handle/pointer of the interned string would normally be used and be the same for all cases, decode is the time taken for getting a copy of the interned string, for interned strings all we need to do is dereference a pointer, in most uses decoding of strings is not where the majority of time is spent. cat is the time taken to concatenate "s" at the end of each word, getting some matches due to pluralisation. The embed% column shows how many of the words got embedded instead of interned. This can also be read as the percentage of possible cache-misses that are avoided for the workload.The RAM use column shows the amount of bytes used by the allocations for interned strings as well as the size taken by the hash table used, without the size taken by tempty slots in the hash-table to be comparable with what a more compact optimizing structure used when targeting memory constrained systems.

The first line in each set of benchmarks is a baseline string-interning implementation using the same infrastructure as the others, but without embedding.

5375 words, max-word-len: 3

createlookupdecodecatembed%RAM use
murmurhash0.0800.0170.0060.0440%106849
squoze64-utf80.0090.0040.0090.017100%40
squoze32-utf80.0090.0040.0090.017100%40
squoze32-utf50.0630.0320.0250.08199%740
squoze52-utf50.0370.0320.0240.067100%40
squoze62-utf50.0360.0320.0230.067100%40

16546 words, max-word-len: 4

createlookupdecodecatembed%RAM use
murmurhash0.0630.0210.0090.0540%341440
squoze64-utf80.0120.0060.0120.020100%40
squoze32-utf80.0120.0060.0120.065100%40
squoze32-utf50.0590.0390.0270.08393%23294
squoze52-utf50.0450.0380.0270.076100%40
squoze62-utf50.0440.0380.0260.074100%40

79014 words, max-word-len: 6

createlookupdecodecatembed%RAM use
murmurhash0.0890.0380.0080.1230%1755254
squoze64-utf80.0170.0050.0100.021100%40
squoze32-utf80.0670.0310.0100.11421%1413854
squoze32-utf50.0600.0480.0250.11270%537576
squoze52-utf50.0550.0480.0310.094100%2948
squoze62-utf50.0550.0470.0290.088100%102

192126 words, max-word-len: 8

createlookupdecodecatembed%RAM use
murmurhash0.1160.0670.0130.1790%4530961
squoze64-utf80.0190.0060.0130.051100%40
squoze32-utf80.1100.0600.0130.1769%4189561
squoze32-utf50.1020.0700.0220.15929%3313283
squoze52-utf50.0650.0570.0330.10299%47597
squoze62-utf50.0630.0560.0330.102100%1548

308201 words, max-word-len: 10

createlookupdecodecatembed%RAM use
murmurhash0.1430.0880.0140.5820%7603232
squoze64-utf80.0600.0310.0150.11762%4000911
squoze32-utf80.1330.0780.0140.5765%7261832
squoze32-utf50.1270.0880.0190.21918%6385554
squoze52-utf50.0760.0640.0340.11792%861018
squoze62-utf50.0730.0690.0410.12099%62537

392136 words, max-word-len: 12

createlookupdecodecatembed%RAM use
murmurhash0.1620.1030.0130.5030%9990936
squoze64-utf80.0830.0450.0150.16649%7060095
squoze32-utf80.1500.0910.0130.5064%9649536
squoze32-utf50.1400.0950.0170.50414%8773258
squoze52-utf50.0870.0700.0320.14172%3920202
squoze62-utf50.0800.0690.0360.12394%805463

466514 words, max-word-len: 23

createlookupdecodecatembed%RAM use
murmurhash0.1710.1100.0130.4640%12326290
squoze64-utf80.1070.0670.0160.46141%9990473
squoze32-utf80.1660.0910.0140.4674%11984890
squoze32-utf50.1570.1080.0170.47112%11108612
squoze52-utf50.1060.0810.0310.17461%6850580
squoze62-utf50.0880.0740.0340.13579%3735841

32bit ARM benchmarks on a RPI3

As large words do not embed in 32bit, but 4-6 characters still covers many of the most common words in english, as well as many common keyword tokens in programming.

5375 words, max-word-len: 3

createlookupdecodecatembed%RAM use
murmurhash2.3240.2740.0540.8000%85333
squoze32-utf80.0720.0420.0390.200100%24
squoze32-utf50.3350.2560.1902.13799%584

16546 words, max-word-len: 4

createlookupdecodecatembed%RAM use
murmurhash1.2980.2900.0600.8050%275240
squoze32-utf80.0780.0450.0431.157100%24
squoze32-utf50.6970.3110.2121.10693%18842

192126 words, max-word-len: 8

createlookupdecodecatembed%RAM use
murmurhash1.2650.3560.0680.9400%3762441
squoze32-utf81.1750.3420.0690.9309%3487225
squoze32-utf51.0140.3780.1250.96729%2765311

308201 words, max-word-len: 10

createlookupdecodecatembed%RAM use
murmurhash1.3180.3870.0711.9180%6370412
squoze32-utf81.2630.3810.0711.9265%6095196
squoze32-utf51.1420.3950.1061.04418%5373282

392136 words, max-word-len: 12

createlookupdecodecatembed%RAM use
murmurhash1.3610.4150.0731.7010%8422376
squoze32-utf81.3200.4120.0731.7114%8147160
squoze32-utf51.2160.4150.1021.76014%7425246

Funding

This work has been done with funding from supports of the account pippin on Patreon and liberapay. To join in to fund similar and unrelated independent research and development.