Character encoding utilities
This module provides a convenience layer over top of the iconv module, as well as automatic detection of character encoding schemes. It implicitly assumes you are using UTF8 internally for your strings (you can use the utf8 module to change string semantics to use UTF8 as well). Given that, all you need to do is specify the external encoding you are working with.
The following are direct analogs of the equivalent R5RS procedures:
- (open-encoded-input-file FILE ENC) procedure
- (call-with-encoded-input-file FILE ENC PROC) procedure
- (with-input-from-encoded-file FILE ENC THUNK) procedure
- (open-encoded-output-file FILE ENC) procedure
- (call-with-encoded-output-file FILE ENC PROC) procedure
- (with-output-to-encoded-file FILE ENC THUNK) procedure
(use charconv) (with-input-from-encoded-file "/usr/share/edict/edict" "EUC-JP" read-line)
- (read-encoded-string ENC [N [PORT]]) procedure
An analog of string using byte-count (not character count). May read additional bytes to ensure you read along a character boundary. If you really want exactly N bytes regardless of character boundaries, you should combine read-string with ces-convert below.
The following are copied from the Gauche API. CES stands for Character Encoding Scheme.
- (ces-equivalent? CES-A CES-B) procedure
Returns #t if CES-A and CES-B are equivalent (aliases), #f otherwise.
- (ces-upper-compatible? CES-A CES-B) procedure
Returns #t if a string encoded in CES-B can be considered a string in CES-A without conversion.
- (ces-convert STR FROM [TO]) procedure
Return a new string of STR converted from encoding FROM to encoding TO.
- (detect-file-encoding FILE [LOCALE]) procedure
- (detect-encoding STRING [LOCALE]) procedure
The detection procedures can correctly identify most common 'types' of encodings, such as UTF-8/16/32, EUC-*, ISO-2022-*, Shift_JIS or single-byte, without any need for specifying the locale. However, currently it doesn't include any statistical or linguistic routines, without which it can't distinguish between EUC-JP and EUC-KR, or between any of the single-byte encodings (including ISO-8859-*). In these cases you can specify a locale, such that in the event of a single-byte encoding a "de" locale would result in the default German single-byte encoding, ISO-8859-1.
The detect-file-encoding procedure also recognizes the Emacs-style
-*- coding: foo -*-
signature in either of the first two lines.
You can also use the automatic detection implicitly in the input procedures by specifying an encoding of "*" or "*<LOCALE>". For example,
(open-encoded-input-file file "*") ; guess with no locale (open-encoded-input-file file "*DE") ; guess with a German locale
For compatibility with the Gauche convention, the encoding "*JP" is equivalent to "*JA", the Japanese locale.
- 1.3.3 fixed missing (require-library srfi-69) (reported by Hugo Arregui)
- 1.2 Fixing bug in pad-euc-input. Signalling errors when trying to wrap a port with an unknown encoding.
- 1.1 Adapted to SRFI-69-compatible hash-tables
- 1.0 Initial release
Copyright (c) 2004-2005, Alex Shinn All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the author nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.