Squoze - reversible unicode string hashes.

compact embedding of unicode text in integers

Squoze is a type of unicode string hashes for string interning. The hashes trade the least significant bit of digest data for being able to embed digest_size-1 bits of recoverable data in the hash itself.

The embedded data is stored in either UTF8 or transcoded to UTF5+, a 5bit variable length and dynamic window unicode coding scheme. The hashes produced by squoze are collision free for strings up to string length, and rely on a regular hash function as fallback when embedding is not possible.

The benefits of embedding strings in pointers is also very dataset dependent, if all data fits in caches the added computational overhead might not be worth it, while there still is RAM (and ROM) savings that can be beneficial on embedded platforms.

On embedded platforms not having the strings consume heap space can be a significant saving, this should however be weighed against the overhead of needing 32bit values to store/pass around sometimes being able to use 16bit references to strings is a more significant overall saving.

A series of subvariants have been specified as part of parameterizing the benchmarks, and definining the encoding:

squoze-bignum
UTF5+ encoding, supporting arbitrary length unicode strings stored in bignums.
squoze32
UTF5+ embed encoding, supporting up to 6 lower-case ascii of embedded data
squoze52
UTF5+ embed encoding, supporting up to 10 lower-case ascii of embedded data, 52bit integers can be stored without loss in a double.
squoze62
UTF5+ embed encoding, supporting up to 12 unicode code points.
squoze64
UTF-8 embed encoding, supporting up to 8 ASCII chars of embedded data.
squoze256
UTF5+ embed encoding, supporting up to 50 unicode code points

squoze is still under development, preliminary variants for python3 and C are available, the C code is the hashes tested in the string interning benchmarks - it is heavily parameterized with ifdefs to make it configurable and able for a single implementation to cover all the cases.

squoze64-utf8 achieves 7x speedup over murmurhash one-at-a-time used for initial string interning and 2.5x speedup for subsequent lookups of the same string when the strings are shorther than 8bytes of utf8, see the benchmarks for details.

squoze64 implementation

the squoze-64 encoding is a small bitmanipulation of UTF-8 in-memory encoding, for strings that will fit only the first byte is manipulated and only if it ascii and not @ the value is double and 1 is added. When the first byte does not match our constraints we store 129 - which means the following 7bytes directly encode the value

uint64_t squoze64(const char *utf8, size_t len)
{
  size_t   squoze_dim = 64;
  uint64_t hash       = 0;
  uint8_t *encoded    = (uint8_t*)&hash;
  uint8_t  first_byte = ((uint8_t*)utf8)[0];

  if (first_byte<128
      && first_byte != '@'
      && (len <= (squoze_dim/8)))
  {
    for (int i = 0; utf8[i]; i++)
      encoded[i] = utf8[i];
    encoded[0] = encoded[0]*2+1;
    return *((uint64_t*)&encoded[0]);
  }
  else if (len <= (squoze_dim/8)-1)
  {
    for (int i = 0; utf8[i]; i++)
      encoded[i+1] = utf8[i];
    encoded[0] = 129;
    return *((uint64_t*)&encoded[0]);
  }

  // murmurhash one-at-a-time
  hash = 3323198485ul;
  for (unsigned int i = 0; i < len; i++)
  { 
    uint8_t key = utf8[i];
    hash ^= key;
    hash *= 0x5bd1e995;
    hash ^= hash >> 15;
  }
  return hash & ~1; // make even
}
const char *squoze64_decode (uint64_t hash)
{
  static uint8_t buf[10];
  buf[8] = 0;
  ((uint64_t*)buf)[0]= hash; // or memcpy (buf, hash, 8);
  if ((buf[0] & 1) == 0)
     return NULL;
  if (buf[0] == 129)
     return buf+1;
  buf[0] /= 2;
  return buf;
}

UTF5+ and squoze-bignum implementation

The first stage of this encoding is encoding to UTF5+ which extends UTF-5. The symbol 'G' with value 16 does not occur in normal UTF-5 and is used to change encoding mode to a sliding window, valid UTF5 strings are correctly decoded by a UTF5+ decoder.

In squeeze mode the initial offset is set based on the last encoded unicode codepoint in UTF5 mode. Start offsets for a code point follow the pattern 19 + 26 * N, which makes a-z fit in one window. In sliding window mode the following quintets have special meaning:

00emit SPACE
11codepoint at offset + 0
22codepoint at offset + 1
..
10Acodepoint at offset + 9
11Bcodepoint at offset + 10
12Ccodepoint at offset + 11
..
26Qcodepoint at offset + 25
27Roffset += 26 *1
28Soffset += 26 *1
29Toffset += 26 *1
30Uoffset += 26 *1
31Vswitch to UTF-5 mode

For compatibility with UTF-5 we start out in UTF-5 mode rather than window mode.

The encoder decides if the current mode is kept or not for every codepoint. The cost in output quintets is computed for UTF-5 and windowed is computed for both this and the next codepoint. We switch from UTF-5 to windowed when the cost of switching considering this and the next code points is equal or smaller, in the other direction we only switch if there is a gain to be had.

For example the string Hello World is encoded as follows:

H   e  l l o   W  o  r l d                             11 bytes
GT2 U5 C C F 0 TH UF I C 4     16 quintets = 80 bits = 10 bytes

h  e l l o   w o r l d     11 bytes
G8 5 C C F 0 N F I C 4     12 quintets = 60 bits = 7.5bytes padded to 8 bytes

When transforming a quintet sequence into an integer the initial mode is encoded as a bit of 1 if we are starting out in UTF-5 mode, allowing us to skip the G. To create an integer we start with 0, add the integer value of the first quintet. If there are more quintets, multiply by 32 and continue adding quintets. The resulting value is multipled by 4, the second lowest bit set according to windowed or utf-5 initial mode and the lowest bit set.

squoze32, squoze52 and squoze62 implementation

These hashes are just like squoze-bignum if they as UTF5+ encode as fewer than 6, 10 or 12 quintets. If this is not the case a murmurhash is computed and the lowest bit stripped.

benchmarks

The implementation benchmarked is an open adressing hashtable storing heap allocated chunks containing both the hash, length and raw byte data for the UTF8 string.

The alwaysintern variants of squoze are using the squoze hashes without their embedding capability.

The create column is the microseconds taken on average to intern a word, lookup is the time taken the second and subsequent times a string is referenced. For comparisons the handle/pointer of the interned string would normally be used and be the same for all cases, decode is the time taken for getting a copy of the interned string, for interned strings all we need to do is dereference a pointer, in most uses decoding of strings is not where the majority of time is spent.

The embed% column shows how many of the words got embedded instead of interned.

The RAM use column shows the amount of bytes used by the allocations for interned strings as well as the size taken by the hash table used, without the size taken by tempty slots in the hash-table to be comparable with what a more compact optimizing structure used when targeting memory constrained systems.

The hashes with the direct UTF8 embedding are the most reliable optimization when only runtime/energy use is considered. The UTF5 embeddings save more RAM and allow more guarantee collision free strings but are more expensive to compute.

79014 words, max-word-len: 6

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit0.6130.1610.0190%1755262
squoze64-utf80.0500.0480.028100%48
squoze32-utf80.2790.1200.02421%1413862
squoze32-utf50.2200.1540.07470%537584
squoze52-utf50.1650.1290.096100%2956
squoze62-utf50.1720.1480.103100%110
squoze32-utf5 alwaysintern0.3870.2000.0150%1755262
squoze52-utf5 alwaysintern1.7130.9060.0150%2387374
squoze62-utf5 alwaysintern1.8331.0800.0170%2387374

192126 words, max-word-len: 8

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit0.4140.1880.0250%4530969
squoze64-utf80.0530.0540.036100%48
squoze32-utf80.3320.1610.0289%4189569
squoze32-utf50.2990.1570.05729%3313291
squoze52-utf50.1850.1570.10199%47605
squoze62-utf50.1700.1570.101100%1556
squoze52-utf5 alwaysintern3.9942.3270.0210%6067977
squoze62-utf5 alwaysintern3.1971.9070.0200%6067977

308201 words, max-word-len: 10

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit0.5170.2160.0250%7603240
squoze64-utf80.1790.1100.03562%4000919
squoze32-utf80.4590.2050.0295%7261840
squoze32-utf50.3860.1870.04218%6385562
squoze52-utf50.1960.1780.10392%861026
squoze62-utf50.1800.1690.10599%62545
squoze32-utf5 alwaysintern0.5330.2270.0240%7603240
squoze52-utf5 alwaysintern3.4412.2310.0210%10068848
squoze62-utf5 alwaysintern4.4822.6470.0210%10068848

392136 words, max-word-len: 12

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit0.6920.2600.0250%9990944
squoze64-utf80.2820.1470.03449%7060103
squoze32-utf80.5950.2450.0294%9649544
squoze32-utf50.5140.2250.03814%8773266
squoze52-utf50.2320.1780.09172%3920210
squoze62-utf50.1980.1910.11394%805471
squoze32-utf5 alwaysintern0.6230.2530.0240%9990944
squoze52-utf5 alwaysintern3.6162.3100.0200%13128032
squoze62-utf5 alwaysintern5.9513.2680.0200%13128032

466514 words, max-word-len: 23

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit0.7160.2730.0250%12326298
squoze64-utf80.4020.1790.03441%9990481
squoze32-utf80.6860.2750.0284%11984898
squoze32-utf50.5920.2490.03812%11108620
squoze52-utf50.2880.2070.08361%6850588
squoze62-utf50.2340.1950.10279%3735849
squoze32-utf5 alwaysintern0.6930.2810.0260%12326298
squoze52-utf5 alwaysintern3.9742.6380.0200%16058410
squoze62-utf5 alwaysintern7.2493.9690.0200%16058410

32bit ARM benchmarks on a RPI3

As large words do not embed in 32bit, but 4-6 characters still covers many of the most common words in english, as well as many common keyword tokens in programming.

16546 words, max-word-len: 4

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit1.5960.3330.0560%275244
squoze32-utf80.1060.1050.063100%28
squoze32-utf51.1620.3680.23693%18846
squoze32-utf5 alwaysintern2.2340.5810.0600%275244

79014 words, max-word-len: 6

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit1.6790.4320.0640%1439186
squoze32-utf81.3380.3310.06821%1163970
squoze32-utf50.9120.4610.21970%442056
squoze32-utf5 alwaysintern1.7640.6560.0620%1439186

192126 words, max-word-len: 8

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit1.7570.4870.0680%3762445
squoze32-utf81.5660.4080.0709%3487229
squoze32-utf51.3520.4410.13329%2765315
squoze32-utf5 alwaysintern1.8110.5970.0670%3762445

308201 words, max-word-len: 10

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit1.8270.5400.0710%6370416
squoze32-utf81.6630.4580.0725%6095200
squoze32-utf51.5240.4650.11018%5373286
squoze32-utf5 alwaysintern1.8770.6280.0700%6370416

392136 words, max-word-len: 12

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit1.8740.5770.0730%8422380
squoze32-utf81.7360.4920.0754%8147164
squoze32-utf51.6060.4910.10514%7425250
squoze32-utf5 alwaysintern1.9400.6700.0720%8422380

466514 words, max-word-len: 23

createlookupdecodeembed%RAM use
murmurhash OAAT 32bit1.9220.6230.0760%10460222
squoze32-utf81.7910.5310.0774%10185006
squoze32-utf51.7000.5240.10312%9463092
squoze32-utf5 alwaysintern2.0320.7230.0760%10460222

Funding

This work has been done with funding from patrons on Patreon and liberapay. To join in to fund similar and unrelated independent research and development.

Licence

Copyright (c) 2021-2022 Øyvind Kolås <pippin@gimp.org>

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.