accents-substitute

Description

Substitutes accented characters (Latin1 and UTF-8) in strings by either non accented ASCII characters or HTML entities.

The current supported accented characters for both latin1 and UTF-8 are: ã, Ã, á, Á, â, Â, à, À, ä, Ä, é, É, ê, Ê, è, È, ë, Ë, í, Í, î, Î, ì, Ì, ï, Ï, õ, Õ, ó, Ó, ô, Ô, ò, Ò, ö, Ö, ú, Ú, û, Û, ù, Ù, ü, Ü, ç and Ç.

The following characters are supported in UTF-8 only: İ, ı, Ğ, ğ, Ş, ş.

Author

Mario Domenech Goulart

Repository

https://github.com/mario-goulart/accents-substitute

Requirements

None

Usage

This extensions provides three modules: accents-substitute, accents-substitute-latin1 and accents-substitute-utf8.

If you want to replace accented characters in Latin-1 strings, use:

(require-extension accents-substitute-latin1)

(accents-substitute "ação")
=> "acao"

(accents-substitute "ação" mode: 'html)
=> "a&ccedil;&atilde;o"

If you want to replace accented characters in UTF-8 strings, use:

(require-extension accents-substitute-utf8)

(accents-substitute "ação")
=> "acao"

(accents-substitute "ação" mode: 'html)
=> "a&ccedil;&atilde;o"

If you want to replace both Latin-1 and UTF-8 accents, you can use the accents-substitute which exports both accents-substitute-latin1 and accents-substitute-utf8 procedures.

Procedures

Modules accents-substitute-latin1 and `accents-substitute-utf8`

accents-substitute

accents-substitute str #!key modeprocedure: Substitute accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

Module accents-substitute

This is just a convenience module which exports the accents-substitute procedure from both accents-substitute-latin1 and accents-substitute-utf8 modules, renamed according to the module name.

accents-substitute-latin1

accents-substitute-latin1 str #!key modeprocedure: Substitute Latin-1 accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

accents-substitute-utf8

accents-substitute-utf8 str #!key modeprocedure: Substitute UTF-8 accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

Example

Below you can see the code of a practical command line tool which uses accents-substitute.

Here's how to use it:

Usage: accents-substitute [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]

Default values:
   mode: ascii
   encoding: utf8

Here's the code:

#!/bin/sh
#| -*- scheme -*-
exec csi -s $0 "$@"
|#

(use
 (rename
  accents-substitute-latin1
  (accents-substitute accents-substitute-latin1))
 (rename
  accents-substitute-utf8
  (accents-substitute accents-substitute-utf8)))

(use posix regex (srfi 1 13))

(define (command-line-argument option args)
  ;; Return the argument associated to the command line option OPTION
  ;; in ARGS or #f if OPTION is not found in ARGS or doesn't have any
  ;; argument.
  (let ((val (any (cut string-match (string-append option "=(.*)") <>) args)))
    (and val (cadr val))))

(define (usage #!optional exit-code)
  (print "Usage: " (pathname-strip-directory (program-name))
         " [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]")
  (print "\nDefault values:\n"
         "    mode: ascii\n"
         "    encoding: utf8")
  (when exit-code (exit exit-code)))

(let* ((args (command-line-arguments))
       (mode (command-line-argument "--mode" args))
       (encoding (command-line-argument "--encoding" args))
       (paramless-args (remove (cut string-prefix? "--" <>) args))
       (accents-substitute accents-substitute-utf8))

  (when (or (member "-h" args) (member "--help" args))
    (usage 0))

  (when (and encoding (not (member encoding '("utf8" "latin1"))))
    (print "'" encoding "' is not a valid encoding.")
    (exit 1))

  (when (and mode (not (member mode '("ascii" "html"))))
    (print "'" mode "' is not a valid mode.")
    (exit 1))

  (when (equal? encoding "latin1")
    (set! accents-substitute accents-substitute-latin1))

  (let ((port (if (null? paramless-args)
                  (current-input-port)
                  (open-input-file (car paramless-args)))))
    (let loop ()
      (let ((line (read-line port)))
        (unless (eof-object? line)
          (print (accents-substitute line mode: (and mode (string->symbol mode))))
          (loop))))
    (unless (null? paramless-args)
      (close-input-port port))))

License

 Copyright (c) 2010-2018, Mario Domenech Goulart
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
 are met:
 1. Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
 2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in the
    documentation and/or other materials provided with the distribution.
 3. The name of the authors may not be used to endorse or promote products
    derived from this software without specific prior written permission.
 
 THIS SOFTWARE IS PROVIDED BY THE AUTHORS ``AS IS'' AND ANY EXPRESS
 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
 GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
 IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
 IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Version history

Version 0.7

Drop dependency on the regex egg (use irregex unit)
CHICKEN 5 support

Version 0.6

Fix .release-info

Version 0.5

Fix build order of modules

Version 0.4

Compiled with -O3
Install the accents-substitute module, which provides procedures to substitute Latin-1 and UTF-8 accented-characters (as a side-effect, this change also fixes the chicken-install reinstallation problem)

Version 0.3

Added UTF-8 support for turkish characters (İ, ı, Ğ, ğ, Ş, ş). Thanks to Mehmet Köse.

Version 0.2

Use pre compiled regexes for html mode (a lot faster). Added regex requirement for compatibility with chickens >= 4.6.2.

Version 0.1

Initial release

chickadee » accents-substitute

Identifier search