accents-substitute
TOC »
Description
Substitutes accented characters (Latin1 and UTF-8) in strings by either non accented ASCII characters or HTML entities.
The current supported accented characters for both latin1 and UTF-8 are: ã, Ã, á, Á, â, Â, à, À, ä, Ä, é, É, ê, Ê, è, È, ë, Ë, í, Í, î, Î, ì, Ì, ï, Ï, õ, Õ, ó, Ó, ô, Ô, ò, Ò, ö, Ö, ú, Ú, û, Û, ù, Ù, ü, Ü, ç and Ç.
The following characters are supported in UTF-8 only: İ, ı, Ğ, ğ, Ş, ş.
Author
Repository
https://github.com/mario-goulart/accents-substitute
Requirements
None
Usage
This extensions provides three modules: accents-substitute, accents-substitute-latin1 and accents-substitute-utf8.
If you want to replace accented characters in Latin-1 strings, use:
(require-extension accents-substitute-latin1) (accents-substitute "ação") => "acao" (accents-substitute "ação" mode: 'html) => "ação"
If you want to replace accented characters in UTF-8 strings, use:
(require-extension accents-substitute-utf8) (accents-substitute "ação") => "acao" (accents-substitute "ação" mode: 'html) => "ação"
If you want to replace both Latin-1 and UTF-8 accents, you can use the accents-substitute which exports both accents-substitute-latin1 and accents-substitute-utf8 procedures.
Procedures
Modules accents-substitute-latin1 and accents-substitute-utf8
accents-substitute
- accents-substitute str #!key modeprocedure
Substitute accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).
Module accents-substitute
This is just a convenience module which exports the accents-substitute procedure from both accents-substitute-latin1 and accents-substitute-utf8 modules, renamed according to the module name.
accents-substitute-latin1
- accents-substitute-latin1 str #!key modeprocedure
Substitute Latin-1 accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).
accents-substitute-utf8
- accents-substitute-utf8 str #!key modeprocedure
Substitute UTF-8 accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).
Example
Below you can see the code of a practical command line tool which uses accents-substitute.
Here's how to use it:
Usage: accents-substitute [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ] Default values: mode: ascii encoding: utf8
Here's the code:
#!/bin/sh #| -*- scheme -*- exec csi -s $0 "$@" |# (use (rename accents-substitute-latin1 (accents-substitute accents-substitute-latin1)) (rename accents-substitute-utf8 (accents-substitute accents-substitute-utf8))) (use posix regex (srfi 1 13)) (define (command-line-argument option args) ;; Return the argument associated to the command line option OPTION ;; in ARGS or #f if OPTION is not found in ARGS or doesn't have any ;; argument. (let ((val (any (cut string-match (string-append option "=(.*)") <>) args))) (and val (cadr val)))) (define (usage #!optional exit-code) (print "Usage: " (pathname-strip-directory (program-name)) " [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]") (print "\nDefault values:\n" " mode: ascii\n" " encoding: utf8") (when exit-code (exit exit-code))) (let* ((args (command-line-arguments)) (mode (command-line-argument "--mode" args)) (encoding (command-line-argument "--encoding" args)) (paramless-args (remove (cut string-prefix? "--" <>) args)) (accents-substitute accents-substitute-utf8)) (when (or (member "-h" args) (member "--help" args)) (usage 0)) (when (and encoding (not (member encoding '("utf8" "latin1")))) (print "'" encoding "' is not a valid encoding.") (exit 1)) (when (and mode (not (member mode '("ascii" "html")))) (print "'" mode "' is not a valid mode.") (exit 1)) (when (equal? encoding "latin1") (set! accents-substitute accents-substitute-latin1)) (let ((port (if (null? paramless-args) (current-input-port) (open-input-file (car paramless-args))))) (let loop () (let ((line (read-line port))) (unless (eof-object? line) (print (accents-substitute line mode: (and mode (string->symbol mode)))) (loop)))) (unless (null? paramless-args) (close-input-port port))))
License
Copyright (c) 2010-2018, Mario Domenech Goulart All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. The name of the authors may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Version history
Version 0.7
- Drop dependency on the regex egg (use irregex unit)
- CHICKEN 5 support
Version 0.6
- Fix .release-info
Version 0.5
- Fix build order of modules
Version 0.4
- Compiled with -O3
- Install the accents-substitute module, which provides procedures to substitute Latin-1 and UTF-8 accented-characters (as a side-effect, this change also fixes the chicken-install reinstallation problem)
Version 0.3
- Added UTF-8 support for turkish characters (İ, ı, Ğ, ğ, Ş, ş). Thanks to Mehmet Köse.
Version 0.2
- Use pre compiled regexes for html mode (a lot faster). Added regex requirement for compatibility with chickens >= 4.6.2.
Version 0.1
- Initial release