Outdated egg!
This is an egg for CHICKEN 4, the unsupported old release. You're almost certainly looking for the CHICKEN 5 version of this egg, if it exists.
If it does not exist, there may be equivalent functionality provided by another egg; have a look at the egg index. Otherwise, please consider porting this egg to the current version of CHICKEN.
levenshtein
Levenshtein edit distance
TOC »
Documentation
Levenshtein is a collection of procedures providing various forms of the Levenshtein edit distance calculation.
The Levenshtein edit distance has been used for areas as diverse as soil sample and language dialect analysis. Not just for text strings.
8-bit Values Only
Performs edit distance calculation for byte strings & blobs. All return the total edit cost.
levenshtein-distance/byte
Usage
(use levenshtein-byte)
- levenshtein-distance/byte SOURCE TARGETprocedure
Calculates the edit distance from the SOURCE to the TARGET. All costs are unitary.
levenshtein-distance/transpose-byte
Usage
(use levenshtein-transpose-byte)
- levenshtein-distance/transpose-byte SOURCE TARGETprocedure
Calculates the edit distance from the SOURCE to the TARGET, taking into account the Transpose operation. All costs are unitary.
By using the Transpose operation the total edit cost is not at least the difference of the sizes of the two strings.
Any "Sequence"
A functor implementing an edit distance algorithm parameterized by a cost and sequence operation modules.
See Examples for Usage
levenshtein-distance/sequence
- (levenshtein-distance/sequence SOURCE TARGET [:insert-cost INSERT-COST] [#:delete-cost DELETE-COST] [#:substitute-cost SUBSTITUTE-COST] (#:get-work-vector GET-WORK-VECTOR) [#:elm-eql ELM-EQL] [#:limit-cost LIMIT-COST])procedure
- SOURCE
- string.
- TARGET
- string.
- INSERT-COST
- number, default 1.
- DELETE-COST
- number, default 1.
- SUBSTITUTE-COST
- number, default 1.
- ELM-EQL
- procedure; (-> object object boolean), default eqv?. The equality predicate.
- GET-WORK-VECTOR
- procedure, default make-vector.
- LIMIT-COST
- number or #f, default #f. Quit when cost over limit.
The SOURCE & TARGET must be the same type, which the instantiating sequence module supports.
Note that the element comparison procedure is passed via the argument list, and not via the sequence implementation module. Annoying when using strings but useful when using vectors.
Only Vector - Baroque & Slow
A functor implementing an edit distance algorithm parameterized by a cost operation module.
Performs edit distance calculation for vectors. Allows definition of new edit operations. Will keep track of edit operations performed. Primarily a toy.
See Examples for Usage
levenshtein-distance/vector*
- (levenshtein-distance/vector* SOURCE TARGET [EDIT-OPER ...] [#:elm-eql ELM-EQL] [#:operations? OPERATIONS])procedure
Calculates the edit distance from the source vector SOURCE to the target vector TARGET. Returns the total edit cost or (values <total edit cost> <performed operations matrix>).
- SOURCE
- vector.
- TARGET
- vector.
- EDIT-OPER
- levenshtein-operator. Edit operation definitions to apply. Defaults are the basic Insert, Delete, and Substitute.
- ELM-EQL
- procedure; (-> object object boolean), default char=?. The equality predicate.
- OPERATIONS
- boolean. Include the matrix of edit operations performed? Default #f.
Interface Implementation Modules
levenshtein-cost-fixnum.scm levenshtein-cost-generic.scm levenshtein-cost-numbers.scm
levenshtein-sequence-string.scm levenshtein-sequence-utf8.scm levenshtein-sequence-vector.scm
Edit Operators
Edit operation specification. A set of base operations is predefined, but may be overridden. The base set is identified by the keys Insert, Delete, Substitute, and Transpose. A printer and reader are provided for edit operations.
Usage
(use levenshtein-operators)
levenshtein-operator
- levenshtein-operatorrecord
- levenshtein-operator-key OPERprocedure
- levenshtein-operator-name OPERprocedure
- levenshtein-operator-cost OPERprocedure
- levenshtein-operator-above OPERprocedure
- levenshtein-operator-left OPERprocedure
make-levenshtein-operator
- make-levenshtein-operator KEY NAME COST ABOVE LEFTprocedure
Returns a new edit operator.
- KEY
- symbol. Key for the operation.
- NAME
- string. Describes the operation.
- COST
- number. The cost of the operation.
- ABOVE
- non-negative-fixnum. How far back in the source.
- LEFT
- non-negative-fixnum. How far back in the target.
levenshtein-operator?
- levenshtein-operator? OBJECTprocedure
Is the OBJECT a levenshtein operator?
clone-levenshtein-operator
- (clone-levenshtein-operator EDIT-OPERATION [#:key KEY] [#:name NAME] [#:cost COST] [#:above ABOVE] [#:left LEFT])procedure
Returns a duplicate of the EDIT-OPERATION, with field values provided by the optional keyword arguments. EDIT-OPERATION may be the key of the already defined edit operation.
levenshtein-operator-ref
- levenshtein-operator-ref KEYprocedure
Get the definition of an edit operation.
levenshtein-operator-set!
- levenshtein-operator-set! EDIT-OPERATIONprocedure
Define an edit operation.
levenshtein-operator-delete!
- levenshtein-operator-delete! EDIT-OPERATIONprocedure
Removes the EDIT-OPERATION definition. EDIT-OPERATION may be the KEY of the already defined edit operation.
levenshtein-operator-reset
- levenshtein-operator-resetprocedure
Restore defined edit operations to the base set.
levenshtein-operator=?
- levenshtein-operator=? A Bprocedure
Are the levenshtein-operator A & levenshtein-operator B equal for all fields?
Path Iterator
Usage
(use levenshtein-path-iterator)
levenshtein-path-iterator
- levenshtein-path-iterator PATH-MATRIXprocedure
Creates an optimal edit distance operation path iterator over the performed operations matrix PATH-MATRIX. The matrix is usually the result of an invocation of (levenshtein-distance/vector* ... operations: #t).
Each invocation of the iterator will generate a list of the form: ((cost source-index target-index levenshtein-operator) ...). The last invocation will return #f.
Path Matrix Print
Usage
(use levenshtein-print)
print-levenshtein-matrix
- print-levenshtein-matrix PATH-MATRIXprocedure
Displays a readable representation of the PATH-MATRIX on the current-output-port.
Notes
- The functors are not available before Chicken 4.7.2, and so all support modules are also unavailable.
- To instantiate any of the functors the implementation module(s) must be included. Since the modules are installed in Chicken Repository, and the repository is not usually along the include path, the repository must be added to the include path. Using the environment variable CHICKEN_INCLUDE_PATH is convenient.
Bugs & Limitations
- levenshtein-print assumes a levenshtein-operator-key print-name is <= 15 characters and that the cost prints in <= 2 characters.
Examples
(use levenshtein-byte) (use levenshtein-transpose-byte) (levenshtein-distance/byte "ctas" "cats") ;=> 2 (levenshtein-distance/transpose-byte "ctas" "cats") ;=> 1 ;cause of transpose
; Instantiate the distance measure algorithm (use levenshtein-path-iterator) (import levenshtein-vector-functor) (include "levenshtein-cost-fixnum") (module levenshtein-vector-fixnum = (levenshtein-vector-functor levenshtein-cost-fixnum)) (import (prefix levenshtein-vector-fixnum fx:)) (define iter (levenshtein-path-iterator (fx:levenshtein-distance/vector* "YWCQPGK" "LAWYQQKPGKA" operations: #t)) ; ignoring interpreter feedback & we know the distance is 6 (define r0 (iter)) (define t0 r0) (define r1 (iter)) (define r2 (iter)) (define r3 (iter)) (define r4 (iter)) (define r5 (iter)) (iter) ; r0 now has #f, since the iterator finishes by returning to the initial caller, ; which is the body of '(define r0 (iter))', thus re-binding r0. However, t0 has ; the original returned value.
(import levenshtein-sequence-functor) (include "levenshtein-cost-fixnum") (include "levenshtein-sequence-vector") (module levenshtein-sequence-fixnum-vector = (levenshtein-sequence-functor levenshtein-cost-fixnum levenshtein-sequence-vector)) (import (prefix levenshtein-sequence-fixnum-vector fxvc:)) ; Now have 'fxvc:levenshtein-distance/sequence', be sure to verify the #:elm-eql ; keyword parameter default is what is wanted (include "levenshtein-cost-numbers") (include "levenshtein-sequence-utf8") (module levenshtein-sequence-numbers-utf8 = (levenshtein-sequence-functor levenshtein-cost-numbers levenshtein-sequence-utf8)) (import (prefix levenshtein-sequence-numbers-utf8 fnu8:)) ; Now have 'fnu8:levenshtein-distance/sequence', be sure to pass the #:elm-eql ; keyword parameter char=?
Requirements
check-errors vector-lib srfi-63 numbers utf8 miscmacros moremacros setup-helper
Author
Version history
- 1.0.3
- Added types. Re-flow.
- 1.0.2
- Added an "egg tag".
- 1.0.1
- Drop "format-compiler".
- 1.0.0
- Chicken 4 release.
License
Copyright (c) 2012-2017, Kon Lovett. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ASIS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.