C++ proposal: There are exactly 8 bits in a byte

5 hours ago (open-std.org)

Previously, in JF's "Can we acknowledge that every real computer works this way?" series: "Signed Integers are Two’s Complement" <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p09...>

During an internship in 1986 I wrote C code for a machine with 10-bit bytes, the BBN C/70. It was a horrible experience, and the existence of the machine in the first place was due to a cosmic accident of the negative kind.

  • I've worked on a machine with 9-bit bytes (and 81-bit instructions) and others with 6-bit ones - nether has a C compiler

    • The Nintendo64 had 9-bit RAM. But, C viewed it as 8 bit. The 9th bit was only there for the RSP (GPU).

    • I think the pdp-10 could have 9 bit bytes, depending on decisions you made in the compiler. I notice it's hard to Google information about this though. People say lots of confusing, conflicting things. When I google pdp-10 byte size it says a c++ compiler chose to represent char as 36 bits.

  • Somehow this machine found its way onto The Heart of Gold in a highly improbable chain of events.

  • 10-bit arithmetics are actually not uncommon on fpgas these days and are used in production in relatively modern applications.

    10-bit C, however, ..........

    • How so? Arithmetic on FPGA usually use the minimum size that works, because any size over that will use more resources than needed.

      9-bit bytes are pretty common in block RAM though, with the extra bit being used for either for ECC or user storage.

  • I programmed the Intel Intellivision cpu which had a 10 bit "decl". A wacky machine. It wasn't powerful enough for C.

D made a great leap forward with the following:

1. bytes are 8 bits

2. shorts are 16 bits

3. ints are 32 bits

4. longs are 64 bits

5. arithmetic is 2's complement

6. IEEE floating point

and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.

  • Zig is even better:

    1. u8 and i8 are 8 bits.

    2. u16 and i16 are 16 bits.

    3. u32 and i32 are 32 bits.

    4. u64 and i64 are 64 bits.

    5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic.

    6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.

    The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.

  • "1. bytes are 8 bits"

    How big is a bit?

    • This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.

  • Yeah, this is something Java got right as well. It got "unsigned" wrong, but it got standardizing primitive bits correct

    byte = 8 bits

    short = 16

    int = 32

    long = 64

    float = 32 bit IEEE

    double = 64 bit IEEE

    • I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.

      On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.

      5 replies →

    • Yep. Pity about getting chars / string encoding wrong though. (Java chars are 16 bits).

      But it’s not alone in that mistake. All the languages invented in that era made the same mistake. (C#, JavaScript, etc).

      4 replies →

Some people are still dealing with DSPs.

https://thephd.dev/conformance-should-mean-something-fputc-a...

Me? I just dabble with documenting an unimplemented "50% more bits per byte than the competition!" 12-bit fantasy console of my own invention - replete with inventions such as "UTF-12" - for shits and giggles.

  • They can just target C++23 or earlier, right? I have a small collection of SHARCs but I am not going to go crying to the committee if they make C++30 (or whatever) not support CHAR_BIT=32

  • no doubt you've got your brainfuck compiler hard at work on this ...

    • TI DSP Assembler is pretty high level, it's "almost C" already.

      Writing geophysical | military signal and image processing applications on custom DSP clusters is suprisingly straightforward and doesn't need C++.

      It's a RISC architecture optimised for DSP | FFT | Array processing with the basic simplification that char text is for hosts, integers and floats are at least 32 bit and 32 bits (or 64) is the smallest addressable unit.

      Fantastic architecture to work with for numerics, deep computational pipelines, once "primed" you push in raw aquisition samples in chunks every clock cycle and extract processed moving window data chunks every clock cycle.

      A single ASM instruction in a cycle can accumulate totals from vector multiplication and modulo update indexes on three vectors (two inputs and and out).

      Not your mama's brainfuck.

Is C++ capable of deprecating or simplifying anything?

Honest question, haven't followed closely. rand() is broken,I;m told unfixable and last I heard still wasn't deprecated.

Is this proposal a test? "Can we even drop support for a solution to a problem literally nobody has?"

  • As mentioned by others, we've dropped trigraph and deprecated rand (and offer an alternative). I also have:

    * p2809 Trivial infinite loops are not Undefined Behavior * p1152 Deprecating volatile * p0907 Signed Integers are Two's Complement * p2723 Zero-initialize objects of automatic storage duration * p2186 Removing Garbage Collection Support

    So it is possible to change things!

  • Signed integers did not have to be 2’s complement, there were 3 valid representations: signed mag, 1s and 2s complement. Modern C and C++ dropped this and mandate 2s complement (“as if” but that distinction is moot here, you can do the same for CHAR_BIT). So there is certainly precedence for this sort of thing.

  • I think you are right. Absolutely.

    Don’t break perfection!! Just accumulate more perfection.

    What we need is a new C++ symbol that reliably references eight bit bytes, without breaking compatibility, or wasting annnnnny opportunity to expand the kitchen sink once again.

    I propose “unsigned byte8” and (2’s complement) “signed byte8”. And “byte8” with undefined sign behavior because we can always use some more spice.

    “unsigned decimal byte8” and “signed decimal byte8”, would limit legal values to 0 to 10 and -10 to +10.

    For the damn accountants.

    “unsigned centimal byte8” and “signed centimal byte8”, would limit legal values to 0 to 100 and -100 to +100.

    For the damn accountants who care about the cost of bytes.

    Also for a statistically almost valid, good enough for your customer’s alpha, data type for “age” fields in databases.

    And “float byte8” obviously.

  • C++ long ago crossed the line where making any change is more work than any benefit it could ever create.

  • well they managed to get two's complement requirement into C++20. there is always hope.

    • Well then someone somewhere with some mainframe got so angry they decided to write a manifesto to condemn kids these days and announced a fork of Qt because Qt committed the cardinal sin of adopting C++20. So don’t say “a problem literally nobody has”, someone always has a use case; although at some point it’s okay to make a decision to ignore them.

      https://news.ycombinator.com/item?id=41614949

      Edit: Fixed typo pointed out by child.

      2 replies →

What will be the benefit?

- CHAR_BIT cannot go away; reams of code references it.

- You still need the constant 8. It's better if it has a name.

- Neither the C nor C++ standard will be simplified if CHAR_BIT is declared to be 8. Only a few passages will change. Just, certain possible implementations will be rendered nonconforming.

- There are specialized platforms with C compilers, such as DSP chips, that are not byte addressable machines. They are in current use; they are not museum pieces.

I'm totally fine with enforcing that int8_t == char == 8-bits, however I'm not sure about spreading the misconception that a byte is 8-bits. A byte with 8-bits is called an octet.

At the same time, a `byte` is already an "alias" for `char` since C++17 anyway[1].

[1] https://en.cppreference.com/w/cpp/types/byte

  • I, for one, hate that int8 == signed char.

    std::cout << (int8_t)32 << std::endl; //should print 32 dang it

    • Now you can also enjoy the fact that you can't even compile:

        std::cout << (std::byte)32 << std::endl;
      

      because there is no default operator<< defined.

      1 reply →

There are DSP chips that have C compilers, and do not have 8 bit bytes; smallest addressable unit is 16 (or larger).

Less than a decade ago I worked with something like that: the TeakLite III DSP from CEVA.

> We can find vestigial support, for example GCC dropped dsp16xx in 2004, and 1750a in 2002.

Honestly kind of surprised it was relavent as late as 2004. I thought the era of non 8-bit bytes was like 1970s or earlier.

I just put static_assert(CHAR_BITS==8); in one place and move on. Haven't had it fire since it was #if equivalent

Not sure about that, seems pretty controversial to me. Are we forgetting about the UNIVACs?

  • Hopefully we are; it's been a long time, but as I remember indexing in strings on them is a disaster.

    • They still exist. You can still run OS 2200 on a Clearpath Dorado.[1] Although it's actually Intel Xeon processors doing an emulation.

      Yes, indexing strings of 6-bit FIELDATA characters was a huge headache. UNIVAC had the unfortunate problem of having to settle on a character code in the early 1960s, before ASCII was standardized. At the time, a military 6-bit character set looked like the next big thing. It was better than IBM's code, which mapped to punch card holes and the letters weren't all in one block.

      [1] https://www.unisys.com/siteassets/collateral/info-sheets/inf...

  • This would be a great setup for a time travelling science fiction where there is some legacy UNIVAC software that needs to be debugged, and John Titor, instead of looking for an IBM 5100, came back to the year 2024 to find a pre-P3477R0 compiler.

  • idk. by today most software already assumes 8 bit == byte in subtle ways all over the place to a point you kinda have to use a fully custom or at least fully self reviewed and patched stack of C libraries

    so delegating such by now very very edge cases to non standard C seems fine, i.e. seems to IMHO not change much at all in practice

    and C/C++ compilers are anyway full of non standard extensions and it's not that CHAR_BIT go away or you as a non-standard extension assume it might not be 8

    • > most software already assumes 8 bit == byte in subtle ways all over the place

      Which is the real reason why 8-bits should be adopted as the standard byte size.

      I didn't even realize that the byte was defined as anything other than 8-bits until recently. I have known, for decades, that there were non-8-bit character encodings (including ASCII) and word sizes were all over the map (including some where word size % 8 != 0). Enough thought about that last point should have helped me realize that there were machines where the byte was not 8-bits, yet the rarity of encountering such systems left me with the incorrect notion that a byte was defined as 8-bits.

      Now if someone with enough background to figure it out doesn't figure it out, how can someone without that background figure it out? Someone who has only experienced systems with 8-bit bytes. Someone who has only read books that make the explicit assumption of 8-bit bytes (which virtually every book does). Anything they write has the potential of breaking on systems with a different byte size. The idea of writing portable code because the compiler itself is "standards compliant" breaks down. You probably should modify the standard to ensure the code remains portable by either forcing the compiler for non-8-bit systems to handle the exceptions, or simply admitting that compiler does not portable code for non-8-bit systems.

  • Do UNIVACs care about modern C++ compilers? Do modern C++ compilers care about UNIVACs?

    Given that Wikipedia says UNIVAC was discontinued in 1986 I’m pretty sure the answer is no and no!

    • The UNIVAC 1108 (and descendants) mainframe architecture was not discontinued in 1986. The company that owned it (Sperry) merged with Burroughs in that year to form Unisys. The platform still exists, but now runs as a software emulator under x86-64. The OS is still maintained and had a new release just last year. Around the time of the merger the old school name “UNIVAC” was retired in a rebranding, but the platform survived.

      Its OS, OS 2200, does have a C compiler. Not sure if there ever was a C++ compiler, if there once was it is no longer around. But that C compiler is not being kept up to date with the latest standards, it only officially supports C89/C90 - this is a deeply legacy system, most application software is written in COBOL and the OS itself itself is mainly written in assembler and a proprietary Pascal-like language called “PLUS”. They might add some features from newer standards if particularly valuable, but formal compliance with C99/C11/C17/C23/etc is not a goal.

      The OS does contain components written in C++, most notably the HotSpot JVM. However, from what I understand, the JVM actually runs in x86-64 Linux processes on the host system, outside of the emulated mainframe environment, but the mainframe emulator is integrated with those Linux processes so they can access mainframe files/data/apps.

So please do excuse my ignorance, but is there a "logic" related reason other than hardware cost limitations ala "8 was cheaper than 10 for the same number of memory addresses" that bytes are 8 bits instead of 10? Genuinely curious, as a high-level dev of twenty years, I don't know why 8 was selected.

To my naive eye, It seems like moving to 10 bits per byte would be both logical and make learning the trade just a little bit easier?

  • One example from the software side: A common thing to do in data processing is to obtain bit offsets (compression, video decoding etc.). If a byte would be 10 bits you would need mod%10 operations everywhere which is slow and/or complex. In contrast mod%(2^N) is one logic processor instruction.

  • I'm fairly sure it's because the English character set fits nicely into a byte. 7 bits would have have worked as well, but 7 is a very odd width for something in a binary computer.

  • If you're ignoring what's efficient to use then just use a decimal data type and let the hardware figure out how to calculate that for you best. If what's efficient matters then address management, hardware operation implementations, and data packing are all simplest when the group size is a power of the base.

  • likely mostly as a concession to ASCII in the end. you used a typewriter to write into and receive terminal output from machines back in the day. terminals would use ASCII. there were machines with all sorts of smallest-addressable-sizes, but eight bit bytes align nicely with ASCII. makes strings easier. making strings easier makes programming easier. easier programming makes a machine more popular. once machines started standardizing on eight bit bytes, others followed. when they went to add more data, they kept the byte since code was written for bytes, and made their new registeres two bytes. then two of those. then two of those. so we're sitting at 64 bit registers on the backs of all that that came before.

Hmm, I wonder if any modern languages can work on computers that use trits instead of bits.

https://en.wikipedia.org/wiki/Ternary_computer

  • Possible, but likely slow. There's nothing in the "C abstract machine" that mandates specific hardware. But, the bitshift is only a fast operation when you have bits. Similarly with bitwise boolean operations.

  • It'd just be a translation/compiler problem. Most languages don't really have a "bit", instead it's usually a byte with the upper bits ignored.

But how many bytes are there in a word?

This is entertaining and probably a good idea but the justification is very abstract.

Specifically, has there even been a C++ compiler on a system where bytes weren't 8 bits? If so, when was it last updated?

  • There were/are C++ compilers for PDP-10 (9 bit byte). Those haven't been maintained AFAICT, but there are C++ compilers for various DSP's where the smallest unit of access is 16 or 32 bits that are still being sold.

As a person who designed and built a hobby CPU with a sixteen-bit byte, I’m not sure how I feel about this proposal.

JF Bastien is a legend for this, haha.

I would be amazed if there's any even remotely relevant code that deals meaningfully with CHAR_BIT != 8 these days.

(... and yes, it's about time.)

  • Here's a bit of 40 year old code I wrote which originally ran on 36-bit PDP-10 machines, but will work on non-36 bit machines.[1] It's a self-contained piece of code to check passwords for being obvious. This will detect any word in the UNIX dictionary, and most English words, using something that's vaguely like a Bloom filter.

    This is so old it predates ANSI C; it's in K&R C. It used to show up on various academic sites. Now it's obsolete enough to have scrolled off Google. I've seen copies of this on various academic sites over the years, but it seems to have finally scrolled off.

    I think we can dispense with non 8-bit bytes at this point.

    [1] https://animats.com/source/obvious/obvious.c

  • DSP chips are a common exception that people bring up. I think some TI made ones have 64 bit chars.

    Edit: I see TFA mentions them but questions how relevant C++ is in that sort of embedded environment.

  • The tms320c28x DSPs have 16 bit char, so e.g. the Opus audio codec codebase works with 16-bit char (or at least it did at one point -- I wouldn't be shocked if it broke from time to time, since I don't think anyone runs regression tests on such a platform).

    For some DSP-ish sort of processors I think it doesn't make sense to have addressability at char level, and the gates to support it would be better spent on better 16 and 32 bit multipliers. ::shrugs::

    I feel kind of ambivalent about the standards proposal. We already have fixed size types. If you want/need an exact type, that already exists. The non-fixed size types set minimums and allow platforms to set larger sizes for performance reasons.

    Having no fast 8-bit level access is a perfectly reasonable decision for a small DSP.

    Might it be better instead to migrate many users of char to (u)int8_t?

    The proposed alternative of CHAR_BIT congruent to 0 mod 8 also sounds pretty reasonable, in that it captures the existing non-8-bit char platforms and also the justification for non-8-bit char platforms (that if you're not doing much string processing but instead doing all math processing, the additional hardware for efficient 8 bit access is a total waste).

Why? Pls no. We've been told (in school!) that byte is byte. Its only sometimes 8bits long (ok, most of the time these days). Do not destroy the last bits of fun. Is network order little endian too?

Ignoring this C++ proposal, especially because C and C++ seem like a complete nightmare when it comes to this stuff, I've almost gotten into the habit of treating a "byte" as a conceptual concept. Many serial protocols will often define a "byte", and it might be 7, 8, 9, 11, 12, or whatever bits long.

I wish I knew what a 9 bit byte means.

One fun fact I found the other day: ASCII is 7 bits, but when it was used with punch cards there was an 8th bit to make sure you didn't punch the wrong number of holes. https://rabbit.eng.miami.edu/info/ascii.html

  • A 9-bit byte is found on 36-bit machines in quarter-word mode.

    Parity is for paper tape, not punched cards. Paper tape parity was never standardized. Nor was parity for 8-bit ASCII communications. Which is why there were devices with settings for EVEN, ODD, ZERO, and ONE for the 8th bit.

    Punched cards have their very own encodings, only of historical interest.

    • >A 9-bit byte is found on 36-bit machines in quarter-word mode.

      I've only programmed in high level programming languages in 8-bit-byte machines. I can't understand what you mean by this sentence.

      So in a 36-bit CPU a word is 36 bits. And a byte isn't a word. But what is a word and how does it differ from a byte?

      If you asked me what 32-bit/64-bit means in a CPU, I'd say it's how large memory addresses can be. Is that true for 36-bit CPUs or does it mean something else? If it's something else, then that means 64-bit isn't the "word" of a 64-bit CPU, so what would the word be?

      This is all very confusing.

the fact that this isn't already done after all these years is one of the reasons why I no longer use C/C++. it takes years and years to get anything done, even the tiniest, most obvious drama free changes. contrast with Go, which has had this since version 1, in 2012:

https://pkg.go.dev/builtin@go1#byte

In a char, not in a byte. Byte != char