chickadee » accents-substitute

accents-substitute

Description

Substitutes accented characters (Latin1 and UTF-8) in strings by either non accented ASCII characters or HTML entities.

The current supported accented characters for both latin1 and UTF-8 are: ã, Ã, á, Á, â, Â, à, À, ä, Ä, é, É, ê, Ê, è, È, ë, Ë, í, Í, î, Î, ì, Ì, ï, Ï, õ, Õ, ó, Ó, ô, Ô, ò, Ò, ö, Ö, ú, Ú, û, Û, ù, Ù, ü, Ü, ç and Ç.

The following characters are supported in UTF-8 only: İ, ı, Ğ, ğ, Ş, ş.

Author

Mario Domenech Goulart

Repository

https://github.com/mario-goulart/accents-substitute

Requirements

None

Usage

This extensions provides three modules: accents-substitute, accents-substitute-latin1 and accents-substitute-utf8.

If you want to replace accented characters in Latin-1 strings, use:

(require-extension accents-substitute-latin1)

(accents-substitute "ação")
=> "acao"

(accents-substitute "ação" mode: 'html)
=> "ação"

If you want to replace accented characters in UTF-8 strings, use:

(require-extension accents-substitute-utf8)

(accents-substitute "ação")
=> "acao"

(accents-substitute "ação" mode: 'html)
=> "ação"

If you want to replace both Latin-1 and UTF-8 accents, you can use the accents-substitute which exports both accents-substitute-latin1 and accents-substitute-utf8 procedures.

Procedures

Modules accents-substitute-latin1 and accents-substitute-utf8

accents-substitute
accents-substitute str #!key modeprocedure

Substitute accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

Module accents-substitute

This is just a convenience module which exports the accents-substitute procedure from both accents-substitute-latin1 and accents-substitute-utf8 modules, renamed according to the module name.

accents-substitute-latin1
accents-substitute-latin1 str #!key modeprocedure

Substitute Latin-1 accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

accents-substitute-utf8
accents-substitute-utf8 str #!key modeprocedure

Substitute UTF-8 accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

Example

Below you can see the code of a practical command line tool which uses accents-substitute.

Here's how to use it:

Usage: accents-substitute [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]

Default values:
   mode: ascii
   encoding: utf8

Here's the code:

#!/bin/sh
#| -*- scheme -*-
exec csi -s $0 "$@"
|#

(use
 (rename
  accents-substitute-latin1
  (accents-substitute accents-substitute-latin1))
 (rename
  accents-substitute-utf8
  (accents-substitute accents-substitute-utf8)))

(use posix regex (srfi 1 13))

(define (command-line-argument option args)
  ;; Return the argument associated to the command line option OPTION
  ;; in ARGS or #f if OPTION is not found in ARGS or doesn't have any
  ;; argument.
  (let ((val (any (cut string-match (string-append option "=(.*)") <>) args)))
    (and val (cadr val))))

(define (usage #!optional exit-code)
  (print "Usage: " (pathname-strip-directory (program-name))
         " [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]")
  (print "\nDefault values:\n"
         "    mode: ascii\n"
         "    encoding: utf8")
  (when exit-code (exit exit-code)))

(let* ((args (command-line-arguments))
       (mode (command-line-argument "--mode" args))
       (encoding (command-line-argument "--encoding" args))
       (paramless-args (remove (cut string-prefix? "--" <>) args))
       (accents-substitute accents-substitute-utf8))

  (when (or (member "-h" args) (member "--help" args))
    (usage 0))

  (when (and encoding (not (member encoding '("utf8" "latin1"))))
    (print "'" encoding "' is not a valid encoding.")
    (exit 1))

  (when (and mode (not (member mode '("ascii" "html"))))
    (print "'" mode "' is not a valid mode.")
    (exit 1))

  (when (equal? encoding "latin1")
    (set! accents-substitute accents-substitute-latin1))

  (let ((port (if (null? paramless-args)
                  (current-input-port)
                  (open-input-file (car paramless-args)))))
    (let loop ()
      (let ((line (read-line port)))
        (unless (eof-object? line)
          (print (accents-substitute line mode: (and mode (string->symbol mode))))
          (loop))))
    (unless (null? paramless-args)
      (close-input-port port))))

License

 Copyright (c) 2010-2018, Mario Domenech Goulart
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
 are met:
 1. Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
 2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in the
    documentation and/or other materials provided with the distribution.
 3. The name of the authors may not be used to endorse or promote products
    derived from this software without specific prior written permission.
 
 THIS SOFTWARE IS PROVIDED BY THE AUTHORS ``AS IS'' AND ANY EXPRESS
 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
 GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
 IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
 IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Version history

Version 0.7

Version 0.6

Version 0.5

Version 0.4

Version 0.3

Version 0.2

Version 0.1

Contents »