chickadee » charconv

Outdated egg!

This is an egg for CHICKEN 4, the unsupported old release. You're almost certainly looking for the CHICKEN 5 version of this egg, if it exists.

If it does not exist, there may be equivalent functionality provided by another egg; have a look at the egg index. Otherwise, please consider porting this egg to the current version of CHICKEN.

charconv

Description

Character encoding utilities

Author

Alex Shinn

Requirements

Documentation

This module provides a convenience layer over top of the iconv module, as well as automatic detection of character encoding schemes. It implicitly assumes you are using UTF8 internally for your strings (you can use the utf8 module to change string semantics to use UTF8 as well). Given that, all you need to do is specify the external encoding you are working with.

Input/output procedures

The following are direct analogs of the equivalent R5RS procedures:

open-encoded-input-file FILE ENCprocedure
call-with-encoded-input-file FILE ENC PROCprocedure
with-input-from-encoded-file FILE ENC THUNKprocedure
open-encoded-output-file FILE ENCprocedure
call-with-encoded-output-file FILE ENC PROCprocedure
with-output-to-encoded-file FILE ENC THUNKprocedure

Example:

(use charconv)
(with-input-from-encoded-file "/usr/share/edict/edict" "EUC-JP" read-line)
read-encoded-string ENC #!optional N PORTprocedure

An analog of string using byte-count (not character count). May read additional bytes to ensure you read along a character boundary. If you really want exactly N bytes regardless of character boundaries, you should combine read-string with ces-convert below.

Utility procedures

The following are copied from the Gauche API. CES stands for Character Encoding Scheme.

ces-equivalent? CES-A CES-Bprocedure

Returns #t if CES-A and CES-B are equivalent (aliases), #f otherwise.

ces-upper-compatible? CES-A CES-Bprocedure

Returns #t if a string encoded in CES-B can be considered a string in CES-A without conversion.

ces-convert STR FROM #!optional TOprocedure

Return a new string of STR converted from encoding FROM to encoding TO.

Detection procedures

detect-file-encoding FILE #!optional LOCALEprocedure
detect-encoding STRING #!optional LOCALEprocedure

The detection procedures can correctly identify most common 'types' of encodings, such as UTF-8/16/32, EUC-*, ISO-2022-*, Shift_JIS or single-byte, without any need for specifying the locale. However, currently it doesn't include any statistical or linguistic routines, without which it can't distinguish between EUC-JP and EUC-KR, or between any of the single-byte encodings (including ISO-8859-*). In these cases you can specify a locale, such that in the event of a single-byte encoding a "de" locale would result in the default German single-byte encoding, ISO-8859-1.

The detect-file-encoding procedure also recognizes the Emacs-style

 -*- coding: foo -*-

signature in either of the first two lines.

Automatic detection

You can also use the automatic detection implicitly in the input procedures by specifying an encoding of "*" or "*<LOCALE>". For example,

(open-encoded-input-file file "*")    ; guess with no locale
(open-encoded-input-file file "*DE")  ; guess with a German locale

For compatibility with the Gauche convention, the encoding "*JP" is equivalent to "*JA", the Japanese locale.

Changelog

License

 Copyright (c) 2004-2005, Alex Shinn
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
 conditions are met:
 
   Redistributions of source code must retain the above copyright notice, this list of conditions and the following
     disclaimer. 
   Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
     disclaimer in the documentation and/or other materials provided with the distribution. 
   Neither the name of the author nor the names of its contributors may be used to endorse or promote
     products derived from this software without specific prior written permission. 
 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 POSSIBILITY OF SUCH DAMAGE.

Contents »