chickadee » srfi-13 » make-kmp-restart-vector

(make-kmp-restart-vector s [c= start end]) -> integer-vectorprocedure

Build a Knuth-Morris-Pratt "restart vector," which is useful for quickly searching character sequences for the occurrence of string S (or the substring of S demarcated by the optional START/END parameters, if provided). C= is a character-equality function used to construct the restart vector. It defaults to char=?; use char-ci=? instead for case-folded string search.

The definition of the restart vector RV for string S is: If we have matched chars 0..I-1 of S against some search string SS, and S[I] doesn't match SS[K], then reset I := RV[I], and try again to match SS[K]. If RV[I] = -1, then punt SS[K] completely, and move on to SS[K+1] and S[0].

In other words, if you have matched the first I chars of S, but the I+1'th char doesn't match, RV[I] tells you what the next-longest prefix of S is that you have matched.

The following string-search function shows how a restart vector is used to search. Note the attractive feature of the search process: it is "on line," that is, it never needs to back up and reconsider previously seen data. It simply consumes characters one-at-a-time until declaring a complete match or reaching the end of the sequence. Thus, it can be easily adapted to search other character sequences (such as ports) that do not provide random access to their contents.

(define (find-substring pattern source start end)
  (let ((plen (string-length pattern))
        (rv (make-kmp-restart-vector pattern)))
 
    ;; The search loop. SJ & PJ are redundant state.
    (let lp ((si start) (pi 0)
             (sj (- end start))     ; (- end si)  -- how many chars left.
             (pj plen))             ; (- plen pi) -- how many chars left.
 
      (if (= pi plen) (- si plen)                   ; Win.
 
          (and (<= pj sj)                           ; Lose.
 
               (if (char=? (string-ref source si)           ; Test.
                           (string-ref pattern pi))
                   (lp (+ 1 si) (+ 1 pi) (- sj 1) (- pj 1)) ; Advance.
 
                   (let ((pi (vector-ref rv pi)))           ; Retreat.
                     (if (= pi -1)
                         (lp (+ si 1)  0   (- sj 1)  plen)  ; Punt.
                         (lp si        pi  sj        (- plen pi))))))))))

The optional START/END parameters restrict the restart vector to the indicated substring of PAT; RV is END - START elements long. If START > 0, then RV is offset by START elements from PAT. That is, RV[I] describes pattern element PAT[I + START]. Elements of RV are themselves indices that range just over [0, END-START), not [START, END).

Rationale: the actual value of RV is "position independent" -- it does not depend on where in the PAT string the pattern occurs, but only on the actual characters comprising the pattern.