Skip to content
This repository was archived by the owner on Aug 15, 2025. It is now read-only.
This repository was archived by the owner on Aug 15, 2025. It is now read-only.

char specification mismatch with implementation #789

@InKryption

Description

@InKryption

char in the specification is described as being encoded as a u32:

- `char` is encoded as a 32-bit unsigned integer representing its Unicode Scalar Value

But it appears that the actual implementation just encodes and decodes them as multi-byte UTF-8 codepoint sequences:

  • Encode:

    bincode/src/enc/impls.rs

    Lines 290 to 294 in 55fd029

    impl Encode for char {
    fn encode<E: Encoder>(&self, encoder: &mut E) -> Result<(), EncodeError> {
    encode_utf8(encoder.writer(), *self)
    }
    }

    \/

    bincode/src/enc/impls.rs

    Lines 325 to 349 in 55fd029

    fn encode_utf8(writer: &mut impl Writer, c: char) -> Result<(), EncodeError> {
    let code = c as u32;
    if code < MAX_ONE_B {
    writer.write(&[c as u8])
    } else if code < MAX_TWO_B {
    let mut buf = [0u8; 2];
    buf[0] = ((code >> 6) & 0x1F) as u8 | TAG_TWO_B;
    buf[1] = (code & 0x3F) as u8 | TAG_CONT;
    writer.write(&buf)
    } else if code < MAX_THREE_B {
    let mut buf = [0u8; 3];
    buf[0] = ((code >> 12) & 0x0F) as u8 | TAG_THREE_B;
    buf[1] = ((code >> 6) & 0x3F) as u8 | TAG_CONT;
    buf[2] = (code & 0x3F) as u8 | TAG_CONT;
    writer.write(&buf)
    } else {
    let mut buf = [0u8; 4];
    buf[0] = ((code >> 18) & 0x07) as u8 | TAG_FOUR_B;
    buf[1] = ((code >> 12) & 0x3F) as u8 | TAG_CONT;
    buf[2] = ((code >> 6) & 0x3F) as u8 | TAG_CONT;
    buf[3] = (code & 0x3F) as u8 | TAG_CONT;
    writer.write(&buf)
    }
    }

  • Decode:

    bincode/src/de/impls.rs

    Lines 425 to 452 in 55fd029

    impl<Context> Decode<Context> for char {
    fn decode<D: Decoder<Context = Context>>(decoder: &mut D) -> Result<Self, DecodeError> {
    let mut array = [0u8; 4];
    // Look at the first byte to see how many bytes must be read
    decoder.reader().read(&mut array[..1])?;
    let width = utf8_char_width(array[0]);
    if width == 0 {
    return Err(DecodeError::InvalidCharEncoding(array));
    }
    // Normally we have to `.claim_bytes_read` before reading, however in this
    // case the amount of bytes read from `char` can vary wildly, and it should
    // only read up to 4 bytes too much.
    decoder.claim_bytes_read(width)?;
    if width == 1 {
    return Ok(array[0] as char);
    }
    // read the remaining pain
    decoder.reader().read(&mut array[1..width])?;
    let res = core::str::from_utf8(&array[..width])
    .ok()
    .and_then(|s| s.chars().next())
    .ok_or(DecodeError::InvalidCharEncoding(array))?;
    Ok(res)
    }
    }

I assume this is a bug in the specification, and if so, it would be helpful to have it rectified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions