chickadee » chicken » data-representation

Data representation

There exist two different kinds of data objects in the CHICKEN system: immediate and non-immediate objects.

Immediate objects

Immediate objects are represented by a single machine word, 32 or 64 bits depending on the architecture. They come in four different flavors:

fixnums, that is, small exact integers, where the lowest order bit is set to 1. This gives fixnums a range of 31 bits for the actual numeric value (63 bits on 64-bit architectures).

characters, where the four lowest-order bits are equal to C_CHARACTER_BITS, currently 1010. The Unicode code point of the character is encoded in the next 24 bits.

booleans, where the four lowest-order bits are equal to C_BOOLEAN_BITS, currently 0110. The next bit is one for #t and zero for #f.

other values: the empty list, the value of unbound identifiers, the undefined value (void), and end-of-file. The four lowest-order bits are equal to C_SPECIAL_BITS, currently 1110. The next four bits contain an identifying number for this type of object, one of: C_SCHEME_END_OF_LIST, currently 0000; C_SCHEME_UNDEFINED, currently 0001; C_SCHEME_UNBOUND, currently 0010; or C_SCHEME_END_OF_FILE, currently 0011.

Non-immediate objects

Collectively, the two lowest-order bits are known as the immediate mark bits. When the lowest bit is set, the object is a fixnum, as described above, and the next bit is part of its value. When the lowest bit is clear but the next bit is set, it is an immediate object other than a fixnum. If neither bit is set, the object is non-immediate, as described below.

Non-immediate objects are blocks of data represented by a pointer into the heap. The pointer's immediate mark bits must be zero to indicate the object is non-immediate; this guarantees the data block is aligned on a 4-byte boundary, at minimum. Alignment of data words is required on modern architectures anyway, so we get the ability to distinguish between immediate and non-immediate objects for free.

The first word of the data block contains a header, which gives information about the type of the object. The header is a single machine word.

The 24 (56 on 64-bit systems) lowest-order bits contain the length of the data object, which is either the number of bytes in a string or byte-vector, or the number of elements for a vector or record type. This allows a maximum size for string or byte-vectors of 2^24 bytes, or approximately 16 MB, on 32-bit systems, and 2^56 bytes, or approximately 72 PB, on 64-bit systems.

The remaining bits are placed in the high-order end of the header. The four highest-order bits are used for garbage collection or internal data type dispatching.

C_GC_FORWARDING_BIT
Flag used for forwarding garbage collected object pointers.
C_BYTEBLOCK_BIT
Flag that specifies whether this data object contains raw bytes (a string or blob) or pointers to other data objects.
C_SPECIALBLOCK_BIT
Flag that specifies whether this object contains a special non-object pointer value in its first slot. An example for this kind of objects are closures, which are a vector-type object with the code-pointer as the first item. This is also used to turn a pair's car into a weak reference in the symbol table, to allow its symbol to be collected.
C_8ALIGN_BIT
Flag that specifies whether the data area of this block should be aligned on an 8-byte boundary (floating-point numbers, for example).

After these four bits comes a 4-bit type code representing one of the following types:

vectors: vector objects with type bits C_VECTOR_TYPE, currently 0000.

symbols: vector objects with type bits C_SYMBOL_TYPE, currently 0001. The three slots contain the toplevel variable value, the print-name (a string), and the property list of the symbol. When manipulating these slots, the symbol table's container needs to be manipulated as well, to control garbage collection of the symbol: if the symbol is undefined and has no property list, the symbol table's container should be a weak reference (C_WEAK_PAIR_TYPE), otherwise it should be a strong reference (C_PAIR_TYPE).

strings: byte-vector objects with type bits C_STRING_TYPE, currently 0010.

pairs: vector-like object with type bits C_PAIR_TYPE, currently 0011. The car and the cdr are contained in the first and second slots, respectively.

closures: special vector objects with type bits C_CLOSURE_TYPE, currently 0100. The first slot contains a pointer to a compiled C function. Any extra slots contain the free variables (since a flat closure representation is used).

flonums: byte-vector objects with type bits C_FLONUM_BITS, currently 0101. Slots one and two (or a single slot on 64 bit architectures) contain a 64-bit floating-point number, in the representation used by the host system's C compiler.

bignums: special vector objects with type bits C_BIGNUM_TYPE, currently 0110. This contains only one slot, which points to a bytevector that contains the number's limbs in a special format: The first word contains a 1 if the number is negative, 0 if it is positive. The remaining words form the bignum's limbs. A bytevector is used because the limbs are stored in the raw machine format, which would be invalid Scheme objects when viewed as slots in a vector.

ports: special vector objects with type bits C_PORT_TYPE, currently 0111. The first slot contains a pointer to a file- stream, if this is a file-pointer, or NULL if not. The other slots contain housekeeping data used for this port.

structures: vector objects with type bits C_STRUCTURE_TYPE, currently 1000. The first slot contains a symbol that specifies the kind of structure this record is an instance of. The other slots contain the actual record items.

blob: a raw sequence of bytes with type bits C_BYTEVECTOR_TYPE.

pointer-vectors: vector objects of native pointers - these are actually structures where the first slot holds a blob containing the 32- or 64-bit pointer values.

locatives: special vector objects with type bits C_LOCATIVE_TYPE, currently 1010. A locative object holds 4 slots: a raw pointer to the location inside the object referred to by the locative, the offset in bytes from the start of the object referred to, the type of the location (whether it refers to an unboxed numeric value or a normal object slot that holds a pointer to Scheme data) and a flag indicating whether this locative is "weak". If the locative is non-weak, slot #4 holds a pointer to the object referred to.

pointers: special vector objects with type bits C_POINTER_TYPE, currently 1001. The single slot contains a machine pointer.

tagged pointers: special vector objects with type bits C_TAGGED_POINTER_TYPE, currently 1011, Tagged pointers are similar to pointers, but the object contains an additional slot with a tag (an arbitrary data object) that identifies the type of the pointer.

ratnums: vector-like objects with type-bits C_RATNUM_TYPE, currently 1100. The first slot contains the numerator (which can be positive or negative), the second slot contains the denominator, which is always positive. These numbers are always simplified, so their gcd will always be 1.

lambda infos: byte-vector objects with type-bits C_LAMBDA_INFO_TYPE, currently 1101.

cplxnums: vector-like objects with type-bits C_CPLXNUM_TYPE, currently 1110. The first slot contains the real part, the second slot contains the imaginary part of the complex number. These two numbers are of matching exactness: Either both are flonums or none are.

The actual data follows immediately after the header. Note that block addresses are always aligned to the native machine-word boundary.

Data objects may be allocated outside of the garbage collected heap, as long as their layout follows the above mentioned scheme. But care has to be taken not to mutate these objects with heap-data (i.e. non-immediate objects), because this will confuse the garbage collector.

For more information see the header file chicken.h.


Previous: C interface

Next: Modules