This repository was archived by the owner on Aug 15, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 304
This repository was archived by the owner on Aug 15, 2025. It is now read-only.
char specification mismatch with implementation #789
Copy link
Copy link
Open
Description
char in the specification is described as being encoded as a u32:
Line 58 in 55fd029
| - `char` is encoded as a 32-bit unsigned integer representing its Unicode Scalar Value |
But it appears that the actual implementation just encodes and decodes them as multi-byte UTF-8 codepoint sequences:
-
Encode:
Lines 290 to 294 in 55fd029
impl Encode for char { fn encode<E: Encoder>(&self, encoder: &mut E) -> Result<(), EncodeError> { encode_utf8(encoder.writer(), *self) } }
\/
Lines 325 to 349 in 55fd029
fn encode_utf8(writer: &mut impl Writer, c: char) -> Result<(), EncodeError> { let code = c as u32; if code < MAX_ONE_B { writer.write(&[c as u8]) } else if code < MAX_TWO_B { let mut buf = [0u8; 2]; buf[0] = ((code >> 6) & 0x1F) as u8 | TAG_TWO_B; buf[1] = (code & 0x3F) as u8 | TAG_CONT; writer.write(&buf) } else if code < MAX_THREE_B { let mut buf = [0u8; 3]; buf[0] = ((code >> 12) & 0x0F) as u8 | TAG_THREE_B; buf[1] = ((code >> 6) & 0x3F) as u8 | TAG_CONT; buf[2] = (code & 0x3F) as u8 | TAG_CONT; writer.write(&buf) } else { let mut buf = [0u8; 4]; buf[0] = ((code >> 18) & 0x07) as u8 | TAG_FOUR_B; buf[1] = ((code >> 12) & 0x3F) as u8 | TAG_CONT; buf[2] = ((code >> 6) & 0x3F) as u8 | TAG_CONT; buf[3] = (code & 0x3F) as u8 | TAG_CONT; writer.write(&buf) } } -
Decode:
Lines 425 to 452 in 55fd029
impl<Context> Decode<Context> for char { fn decode<D: Decoder<Context = Context>>(decoder: &mut D) -> Result<Self, DecodeError> { let mut array = [0u8; 4]; // Look at the first byte to see how many bytes must be read decoder.reader().read(&mut array[..1])?; let width = utf8_char_width(array[0]); if width == 0 { return Err(DecodeError::InvalidCharEncoding(array)); } // Normally we have to `.claim_bytes_read` before reading, however in this // case the amount of bytes read from `char` can vary wildly, and it should // only read up to 4 bytes too much. decoder.claim_bytes_read(width)?; if width == 1 { return Ok(array[0] as char); } // read the remaining pain decoder.reader().read(&mut array[1..width])?; let res = core::str::from_utf8(&array[..width]) .ok() .and_then(|s| s.chars().next()) .ok_or(DecodeError::InvalidCharEncoding(array))?; Ok(res) } }
I assume this is a bug in the specification, and if so, it would be helpful to have it rectified.
Metadata
Metadata
Assignees
Labels
No labels