chickadee » srfi-207

SRFI-207: String-notated bytevectors

Abstract

To ease the human reading and writing of Scheme code involving binary data that for mnemonic reasons corresponds as a whole or in part to ASCII-coded text, a notation for bytevectors is defined which allows printable ASCII characters to be used literally without being converted to their corresponding integer forms. In addition, this SRFI provides a set of procedures known as the bytestring library for constructing a bytevector from a sequence of integers, characters, strings, and/or bytevectors, and for manipulating bytevectors as if they were strings as far as possible.

For more information, see: SRFI-207: String-notated bytevectors

Rationale

Binary file formats are usually not self-describing, and if they are, the descriptive portion is itself binary, which makes it hard for human beings to interpret. To assist with this problem, it is common to have a human-readable section at the beginning of the file, or in some cases at the beginning of each distinct section of the file. For historical reasons and to avoid text encoding complications, it is usual for this human-readable section to be expressed as ASCII text.

For example, ZIP files begin with the hex bytes 50 4B which are the ASCII encoding for the characters "PK", the initials of Phil Katz, the inventor of ZIP format. As another example, the GIF image format begins with 47 49 46 38 39 61, the ASCII encoding for "GIF89a", where "89a" is the format version. A third example is the PNG image format, where the file header begins 89 50 4E 47. The first byte is intentionally non-ASCII, but the next three are "PNG". Furthermore, a PNG file is divided into chunks, each of which contains a 4-byte "chunk type" code. The letters in the chunk type are mnemonics for its purpose, such as "PLTE" for a palette, "bKGD" for a default background color, and "iTXt" for descriptive text in UTF-8.

When bytevectors contain string data of this kind, it is much more tractable for human programmers to deal with them in the form #u8"recursion" than in the form #u8(114 101 99 117 114 115 105 111 110). This is true even when non-ASCII bytes are incorporated into the bytevector: the complete 8-bit PNG file header can be written as #u8"\ x89;PNG\r\n\x1A;\n" instead of #u8(0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A).

In addition, this SRFI provides bytevectors with additional procedures that closely resemble those provided for strings. For example, bytevectors can be padded or trimmed, compared case-sensitively or case-insensitively, searched, joined, and split.

In this specification it is assumed that bytevectors are as defined in R7RS-small section 6.9. Implementations may also consider them equivalent to R6RS bytevectors (R6RS 4.3.4) or SRFI 4 u8vectors, depending which kind of homogeneous vectors of unsigned 8-bit integers an implementation supports.

Specification

Most of the procedures of this SRFI begin with bytestring- in order to distinguish them from other bytevector procedures. This does not mean that they accept or return a separate bytestring type: bytestrings and bytevectors are exactly the same type.

External notation

The basic form of a string-notated bytevector is:

   #u8" content "

To avoid character encoding issues within string-notated bytevectors, only printable ASCII characters (that is, Unicode codepoints in the range from U+0020 to U+007E inclusive) are allowed to be used within the content of a string-notated bytevector. All other characters must be expressed through mnemonic or inline hex escapes, and " and \ must also be escaped as in normal Scheme strings.

Within the content of a string-notated bytevector:

   Seq.  Integer
   \a    7
   \b    8
   \t    9
   \n    10
   \r    13
   \|    124

Note: The \| sequence is provided so that string parsing, symbol parsing, and string-notated bytevector parsing can all use the same sequences. However, we give a complete definition of the valid lexical syntax in this SRFI rather than inheriting the native syntax of strings, so that it is clear that #u8"ι" and #u8"\xE000;" are invalid.

When the Scheme reader encounters a string-notated bytevector, it produces a datum as if that bytevector had been written out in full. That is, #u8"A" is exactly equivalent to #u8(65).

A Scheme implementation which supports string-notated bytevectors may not by default use this notation when any of the write family of procedures is called upon a bytevector or upon another datum containing a bytevector. A future SRFI is expected to add a configurable version of the write procedure which may enable the use of this notation in this context.

Formal syntax

The formal syntax of Scheme (defined in R7RS-small 7.1) is amended as follows.

⟨string-notated bytevector⟩ → #u8" ⟨string-notated bytevector element⟩* "
⟨string-notated bytevector element⟩ → ⟨any printable ASCII character other than " or \⟩
| ⟨mnemonic escape⟩ | \" | \\
| \⟨intraline whitespace⟩*⟨line ending⟩⟨intraline whitespace⟩*
| ⟨inline hex escape⟩

Constructors

bytestring arg procedure

Converts args into a sequence of small integers and returns them as a bytevector as follows:

  • If arg is an exact integer in the range 0-255 inclusive, it is added to the result.
  • If arg is an ASCII character (that is, its codepoint is in the range 0-127 inclusive), it is converted to its codepoint and added to the result.
  • If arg is a bytevector, its elements are added to the result.
  • If arg is a string of ASCII characters, it is converted to a sequence of codepoints which are added to the result.

Otherwise, an error satisfying bytestring-error? is signaled.

Examples:

(bytestring "lor" #\r #x65 #u8(#x6d)) ⇒ #u8"lorem"
(bytestring "η" #\space #u8(#x65 #x71 #x75 #x69 #x76)) ⇒ error

(make-bytestring list)

If the elements of list are suitable arguments for bytestring, returns the bytevector that would be the result of applying bytestring to list. Otherwise, an error satisfying bytestring-error? is signaled.

(make-bytestring! bytevector at list)

If the elements of list are suitable arguments for bytestring, writes the bytes of the bytevector that would be the result of calling make-bytevector into bytevector starting at index at.

(define bstring (make-bytevector 10 #x20))
(make-bytestring! bstring 2 '(#\s #\c "he" #u8(#x6d #x65))
bstring ⇒ #u8"  scheme  "

Conversion

bytevector->hex-string bytevectorprocedure
hex-string->bytevector stringprocedure

Converts between a bytevector and a string containing pairs of hexadecimal digits. If string is not pairs of hexadecimal digits, an error satisfying bytestring-error? is raised.

(bytevector->hex-string #u8"Ford") ⇒ "467f7264"
(hex-string->bytevector "5a6170686f64") ⇒ #u8"Zaphod"
bytevector->base64 bytevector #!optional digitsprocedure
base64->bytevector string #!optional digitsprocedure

Converts between a bytevector and its base-64 encoding as a string. The 64 digits are represented by the characters 0-9, A-Z, a-z, and the symbols + and /. However, there are different variants of base-64 encoding which use different representations of the 62nd and 63rd digit. If the optional argument digits (a two-character string) is provided, those two characters will be used as the 62nd and 63rd digit instead. Details can be found in RFC 4648. If string is not in base-64 format, an error satisfying bytestring-error? is raised. However, characters that satisfy char-whitespace? are silently ignored.

(bytevector->base64 #u8(1 2 3 4 5 6)) ⇒ "AQIDBAUG"
(bytevector->base64 #u8"Arthur Dent") ⇒ "QXJ0aHVyIERlbnQ="
(base64->bytevector "+/     /+") ⇒ #u8(#xfb #xff #xfe)
bytestring->list bytevector #!optional start endprocedure

Converts all or part of a bytevector into a list of the same length containing characters for elements in the range 32 to 127 and exact integers for all other elements.

(bytestring->list #u8(#x41 #x42 1 2) 1 3) ⇒ (#\B 1)
make-bytestring-generator arg procedure

Returns a generator that when invoked will return consecutive bytes of the bytevector that bytestring would create when applied to args, but without creating any bytevectors. The args are validated before any bytes are generated; if they are ill-formed, an error satisfying bytestring-error? is raised.

(generator->list (make-bytestring-generator "lorem"))
  ⇒ (#x6c #x6f #x72 #x65 #x6d)

Selection

bytestring-pad bytevector len char-or-u8procedure
bytestring-pad-right bytevector len char-or-u8procedure

Returns a newly allocated bytevector with the contents of bytevector plus sufficient additional bytes at the beginning/end containing char-or-u8 (which can be either an ASCII character or an exact integer in the range 0-255) such that the length of the result is at least len.

(bytestring-pad #u8"Zaphod" 10 #\_) ⇒ #u8"____Zaphod"
(bytestring-pad-right #u8(#x80 #x7f) 8 0) ⇒ #u8(#x80 #x7f 0 0 0 0 0 0)
bytestring-trim bytevector predprocedure
bytestring-trim-right bytevector predprocedure
bytestring-trim-both bytevector predprocedure

Returns a newly allocated bytevector with the contents of bytevector, except that consecutive bytes at the beginning / the end / both the beginning and the end that satisfy pred are not included.

(bytestring-trim #u8"   Trillian" (lambda (b) (= b #x20)))
  ⇒ #u8"Trillian"
(bytestring-trim-both #u8(0 0 #x80 #x7f 0 0 0) zero?) ⇒ #u8(#x80 #x7f)

Replacement

(bytestring-replace bytevector1 bytevector2 start1 end1 [start2 end2])procedure

Returns a newly allocated bytevector with the contents of bytevector1, except that the bytes indexed by start1 and end1 are not included but are replaced by the bytes of bytevector2 indexed by start2 and end2.

(bytestring-replace #u8"Vogon torture" #u8"poetry" 6 13)
  ⇒ #u8"Vogon poetry"

Comparison

To compare bytevectors for equality, use the procedure bytevector=? from either the R6RS library (rnrs bytevectors) or the equivalent R7RS library (scheme bytevector).

bytestring<? bytevector1 bytevector2procedure
bytestring>? bytevector1 bytevector2procedure
bytestring<=? bytevector1 bytevector2procedure
bytestring>=? bytevector1 bytevector2procedure

Returns #t if bytevector1 is less than / greater than / less than or equal to / greater than or equal to bytevector2. Comparisons are lexicographical: shorter bytevectors compare before longer ones, all elements being equal.

(bytestring<? #u8"Heart Of Gold" #u8"Heart of Gold") ⇒ #t
(bytestring<=? #u8(#x81 #x95) #u8(#x80 #xa0)) ⇒ #f
(bytestring>? #u8(1 2 3) #u8(1 2)) ⇒ #t

Searching

bytestring-index bytevector pred #!optional start endprocedure
bytestring-index-right bytevector pred #!optional start endprocedure

Searches bytevector from start to end / from end to start for the first byte that satisfies pred, and returns the index into bytevector containing that byte. In either direction, start is inclusive and end is exclusive. If there are no such bytes, returns #f.

(bytestring-index #u8(#x65 #x72 #x83 #x6f) (lambda (b) (> b #x7f))) ⇒ 2
(bytestring-index #u8"Beeblebrox" (lambda (b) (> b #x7f))) ⇒ #f
(bytestring-index-right #u8"Zaphod" odd?) ⇒ 4
bytestring-break bytevector predprocedure
bytestring-span bytevector predprocedure

Returns two values, a bytevector containing the maximal sequence of characters (searching from the beginning of bytevector to the end) that do not satisfy / do satisfy pred, and another bytevector containing the remaining characters.

(bytestring-break #u8(#x50 #x4b 0 0 #x1 #x5) zero?)
  ⇒ #u8(#x50 #x4b)
    #u8(0 0 #x1 #x5)
(bytestring-span #u8"ABCDefg" (lambda (b) (and (> b 40) (< b 91))))
  ⇒ #u8"ABCD"
    #u8"efg"

Joining and splitting

bytestring-join bytevector-list delimiter #!optional grammarprocedure

Pastes the bytevectors in bytevector-list together using the delimiter, which can be anything suitable as an argument to bytestring. The grammar argument is a symbol that determines how the delimiter is used, and defaults to infix. It is an error for grammar to be any symbol other than these four:

  • infix means an infix or separator grammar: inserts the delimiter between list elements. An empty list will produce an empty bytevector.
  • strict-infix means the same as infix if the list is non-empty, but will signal an error satisfying bytestring-error? if given an empty list.
  • suffix means a suffix or terminator grammar: inserts the delimiter after every list element.
  • prefix means a prefix grammar: inserts the delimiter before every list element.
(bytestring-join '(#u8"Heart" #u8"of" #u8"Gold") #x20) ⇒ #u8"Heart of Gold"
(bytestring-join '(#u8(#xef #xbb) #u8(#xbf)) 0 'prefix) ⇒ #u8(0 #xef #xbb 0 #xbf)
(bytestring-join '() 0 'strict-infix) ⇒ error
bytestring-split bytevector delimiter #!optional grammarprocedure

Divides the elements of bytevector and returns a list of newly allocated bytevectors using the delimiter (an ASCII character or exact integer in the range 0-255 inclusive). Delimiter bytes are not included in the result bytevectors.

The grammar argument is used to control how bytevector is divided. It has the same default and meaning as in bytestring-join, except that infix and strict-infix mean the same thing. That is, if grammar is prefix or suffix, then ignore any delimiter in the first or last position of bytevector respectively.

(bytestring-split #u8"Beeblebrox" #x62) ⇒ (#u8"Bee" #u8"le" #u8"rox")
(bytestring-split #u8(1 0 2 0) 0 'suffix) ⇒ (#u8(1) #u8(2))

I/O

read-textual-bytestring prefix #!optional portprocedure

Reads a string in the external format described in this SRFI from port and return it as a bytevector. If the prefix argument is false, this procedure assumes that "#u8" has already been read from port. If port is omitted, it defaults to the value of (current-input-port). If the characters read are not in the external format, an error satisfying bytestring-error? is raised.

(call-with-port (open-input-string "#u8\"AB\\xad;\\xf0;\\x0d;CD\"")
                (lambda (port)
                  (read-textual-bytestring #t port)))
  ⇒ #u8(#x41 #x42 #xad #xf0 #x0d #x43 #x44)
write-textual-bytestring bytevector #!optional portprocedure

Writes bytevector in the external format described in this SRFI to port. Bytes representing non-graphical ASCII characters are unencoded: all other bytes are encoded with a single letter if possible, otherwise with a \x escape. If port is omitted, it defaults to the value of (current-output-port).

(call-with-port (open-output-string)
                (lambda (port)
                  (write-textual-bytestring
                   #u8(#x9 #x41 #x72 #x74 #x68 #x75 #x72 #xa)
                   port)
                  (get-output-string port)))
  ⇒ "#u8\"\\tArthur\\n\""
write-binary-bytestring port arg procedure

Outputs each arg to the binary output port port using the same interpretations as bytestring, but without creating any bytevectors. The args are validated before any bytes are written to port; if they are ill-formed, an error satisfying bytestring-error? is raised.

(call-with-port (open-output-bytevector)
                (lambda (port)
                  (write-binary-bytestring port #\Z #x61 #x70 "hod")
                  (get-output-bytevector port)))
  ⇒ #u8"Zaphod"

Exception

bytestring-error? objprocedure

Returns #t if obj is an object signaled by any of the following procedures, in the circumstances described above:

Implementation

There is a sample implementation of the procedures, but not the notation, in the repository of this SRFI.

Note: The String-notated bytevector read syntax is not currently supported in Chicken Scheme.

Acknowledgements

Daphne Preston-Kendal devised the string notation for bytevectors; John Cowan, the procedure library; Wolfgang Corcoran-Mathe, the sample implementation of the procedures.

The notation is inspired by the notation used in Python since version 2.6 for bytes objects, which are fundamentally similar in purpose to Scheme bytevectors, especially in R7RS. In addition, many of the procedures are closely analogous to those of SRFI 152.

Thanks is also due to the participants in the SRFI mailing list. In particular: Lassi Kortela corrected an embarrassing technical error; Marc Nieper-Wißkirchen explained why the write procedure ought not to be allowed to use this notation by default.

Authors

by Daphne Preston-Kendal (external notation), John Cowan (procedure design), Wolfgang Corcoran-Mathe (implementation)

Ported to Chicken Scheme 5 by Sergey Goldgaber

Copyright

© 2020 Daphne Preston-Kendal, John Cowan, and Wolfgang Corcoran-Mathe.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Version history

0.1 - Ported to Chicken Scheme 5

Contents »