Ointers: A library for representing pointers where bits have been stolen (2021)

12 days ago (github.com)

Stealing bits from pointers makes sense sometimes, but in most cases an easier and less error-prone way to increase memory efficiency is to just use indices or relative pointers. Instead of shaving individual bits of a pointer take advantage of the fact that most pointers in a program are almost exactly alike.

Instead of storing a 64 bit prev and next pointer for an item in a linked list just store e.g. two 16 bit offsets. Then the next_ptr = (this_ptr & base_mask) + offset * alignment.

Most of the time wasting memory by storing full 64bit pointers is no big deal. But when it is, you can go a lot further than just shaving off a few bits.

If you want to be extra clever you can encode values in the base address of your pointers by taking advantage of how virtual memory is allocated. Every process has its own virtual memory pages and you can request memory pages at specific address ranges. For instance if you align all your memory at a 32gb boundary then you can use 32bit pointers + a base pointer without even having to mask. You can also use virtual memory allocation tricks to convey information about alignment, typing, whether a pointer is allowed to be shared between threads and so on. And again, without having to butcher the pointer itself.

  • Considering how common it is to use indices in lieu of pointers in Rust, it would be nice if it had native NonMax integer types similar to the NonZero types it already has. NonZero gives you the niche-filling optimization (e.g. Option<NonZero> and NonZero are the same size) but with indices you want zero to be a valid value, so it would be natural to use the maximum as the illegal value instead. You can kind of do it by offsetting the index by 1 or inverting the bits then putting it in a NonZero but that's a small extra runtime cost on every use.

    • You can do a handy struct wrapper over a private Nonzero that xors the given value with the prohibited value (max in this case) at the new constructor. And like so for the get method. Xoring is very cheap. That's my favorite way of storing indices/links for nodes, since you can wrap them in Option for free.

      4 replies →

  • That's really not the application for this, it's tags, which are endlessly useful or characterizing the pointer. For instance you can use the MSBs to establish the type of whats pointed to, or for reference counting. I use the LSB to indicate if the value is a pointer or an unsigned integer, whereby you can store both an (integer) key to a record in a database, or a pointer to the record data in memory. You could use another bit in the pointer to indicate that the record is dirty.

    Using unused bits in pointers can make you code more efficient and simpler.

  • This requires using arenas extensively in your program or perhaps playing tricks with virtual memory, because your system allocator will give you arbitrary pointers by default and you'll often want to point to an object from a different allocation.

    • Right. Which is why I favor doing your own memory management entirely (which has some huge advantages) or not worrying about memory at all and trusting a garbage collector. I don't think there are many situations left where memory management is too important to leave to a gc but not important enough to do from scratch.

      1 reply →

  • Definitely indices are a great option, but then you need a base pointer and a way to allocate off of it. That can add significant complexity. So it is all a tradeoff.

Technically it’s an ointer on PowerPC and SPARC, but a pointe everywhere else.

This look like something that debuggers would hate, because following pointers would be broken

  • I was just thinking this. You'd either need to special-case your debuggers or be able to use something that provided bespoke hooks for pointer-following.

    • Wrap it in a class and teach the debugger to interpret it. It might be possible with gdb and python pretty-printers for example.

I've posted about MGS PS1 port we did to PC and how some pointers to the C4 bomb planted posted meant it was on the wall or on the ground - here ->

https://news.ycombinator.com/item?id=9739731

to quote me:

- The game used a tricky pointer hack, basically on the PSX accessing a pointer with the highest-24-bit set means read it from the CPU cache, otherwise not (or maybe the other way around). This was used for example to indicate whether the C4 bomb was planted on the ground, or on the wall instead of keeping a booblean/bit flag for it. Possibly was used for some more things. Since it was 24-bit, that meant 16mb.

Are there ways on Linux to make sure that all user-visible pointers in a process are allocated within some subset of the overall address space? For instance, imagine writing a 64-bit program such that all userspace pointers are allocated within the first 32 bits. This would allow for representing them in memory using four bytes, as is commonly done on 32-bit architectures--without adding more extensive or ad-hoc changes such as a new "x32" binary format. Except that you could also limit to 24-bit (giving 16M of address space), 16-bit (for a "tiny" code model) or anything else. Of course, this would require some amount of support in the linker/loader, userspace libraries and the kernel itself. But the effort might be worthwhile if it can allow for replacing more ad-hoc approaches.

  • Linux mmap (since 2.6) has a MAP_32BIT flag, but only for x86-64:

    Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2 GB of memory, so as to improve context-switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this performance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set.

  • An allocator is free to only `mmap` addresses that are within a range. I think jemalloc could allow you to do that using custom arenas.

  • ASAN relies on kernel internal details, but this has broken a few times.

    Realistically you should just mmap a huge aligned chunk in the first place, then allocate inside that.

> What do you call a pointer with the high bits stolen? An ointer!

What is the name a pointer with the low bits stolen? Machines with strict word addressing will ignore the bottom two bits so you can safely store what you want there.

Not sure I've used, much less programmed a machine with that strict constraint in a while though. These days CPUs are all designed to implement the C abstract machine. At least the GPU folks are not subject to this constraint.

Am I misreading the bitmask code? It looks like (in addition to a few other ideas) it's using the old "stick a few extra bits in an aligned pointer", but it seems to be only manipulating high bits, whereas aligned pointer zeroes are low-order bits.

I'd suggest a heavier investment in testing infrastructure.

  • 64 bit architectures don't actually have 64 bit address spaces, both AMD64 and ARM64 have 48 bit address spaces by default (some CPUs have extensions you can enable to request larger address spaces e.g. LVA on ARM64 and 5LP / LA57 on AMD64 but that's opt-in on a per-process basis).

    So while you have 3 bits available at the bottom of the pointer, there are 16 at the top. That's a lot more payload you can smuggle. There are even CPU extensions which tell the processor to ignore some of that (Linear Address Masking for Intel, Upper Address Ignore for AMD, and Top Byte Ignore for ARM).

    • Small correction. That's true for 4-level paging where virtual memory is capped at 256 TiB (LAM48). There's a 5-level extension that reduces the wasted space to 7 bits (LAM57) to allow for 128 PiB of virtual space.

      I have no idea what the purpose of the extension is that when I don't believe you can get that much secondary storage attached to a CPU unless you go to tape which is pointless for virtual mapping.

      5 replies →

  • The stored representation is packed such that all the stealable bits are contiguous. To get the original pointer value it’s unpacked first.

Every time I've seen similar done in the past, it has come back to hurt the instigator.

  • Apple's been using that for more than a decade: https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-bui... http://www.sealiesoftware.com/blog/archive/2013/09/24/objc_e...

    OCaml's also been using tagging to mark unboxed type pretty much from the start, nearly 30 years ago: https://dev.realworldocaml.org/runtime-memory-layout.html#:~....

    • And it's not like Apple isn't aware of the dangers, either; they used top-byte tagging on the 68000, and it took them a lot of time and effort to migrate away from that to support 16+ MB systems.

      1 reply →

  • It's common for libraries to use the 2 lowest bits in tree structures e.g. RB tree. It's not a problem at all since the data structure is controlled by the library. Even if tree nodes are allocated/provided by the user, having a "4-byte aligned" requirement in the API contract isn't a problem at all -- in fact you need to work quite hard to allocate a structure with a pointer that isn't at least 4 byte aligned (i386), or 8 byte aligned (x64).

    • > in fact you need to work quite hard to allocate a structure with a pointer that isn't at least 4 byte aligned (i386), or 8 byte aligned (x64).

      Well, no, actually; it's:

        p = malloc(size+1)+1;
      

      It's just quite implausible that you'd do that by accident.

      6 replies →

  • Every lisp system that ever existed uses this technique, and it never hurt them any. Emacs or SBCL or Open Genera; they all work perfectly well.

  • This bitvec I wrote for Servo and Gecko uses this technique, and has been shipping in Firefox releases for over six years with no bugs found:

    https://docs.rs/smallbitvec/

    I'm pretty sure you can find several more examples in Firefox, as well as other major browsers.

  • VM writers use this egregiously. I've not heard of it causing issues. It's not that complicated.

    Lua, Lisps, many others.

  • All fast JavaScript VMs use some form of pointer tagging. V8 uses (compressed) tagged pointers and both JSC and Firefox used NaN-boxing.

    WasmGC has a i31ref as part of its reference hierarchy, which is implemented with pointer tagging.

  • I've used tagged pointers with no issues at all over the years to "smuggle" / store values in...

  • If you build your whole app with it, yeah. If you use it tactically for a certain data type it can be very nice. Just don't try to dissimulate it's specialness.

Could something like this be used to lower memory footprint?

  • Besides that it makes some lock-free algorithms even feasible and some easier to implement, yes, it also can be used to downsize the memory pressure at the cost of extra CPU cycles if you know that the metadata you want to track fits in those 16+ bits.