UP | HOME

Suffix array

Regular expression search of a suffix array

Recursively enumerate all possible strings matching a regular expression character by character, alternating the enumeration with narrowing the possible matches by reference to the suffix array.

(Examples assume ASCII strings and an alphabet ordered like ASCII, in particular all digits < all uppercase letters < all lowercase letters. Extension to Unicode is trivial given a suffix array sorted in codepoint order; dealing with a suffix array in (locale-dependent) collation order may admittedly be trickier, but should still be possible.)

Given a regular expression, write a procedure all-derivatives which returns a list mapping character ranges from the alphabet for that regular expression to unique derivatives returned by taking the derivative of that regular expression against the characters in those ranges.

The trivial regular expression /hello world/ has only one derivative, against the character ‘h’. all-derivatives returns only the character range [‘h’..‘g’) mapped to the new regexp /ello world/.

The case-insensitive version /[Hh][Ee][Ll][Ll][Oo] [Ww][Oo][Rr][Ll][Dd]/ can be derived against either ‘H’ or ‘h’. all-derivatives returns the character ranges [‘H’..‘G’) and [‘h’..‘g’), both mapped to /[Ee][Ll][Ll][Oo] [Ww][Oo][Rr][Ll][Dd]/. (It is trivial for both /ello world/ instances to be the same object in memory, and for it to share storage with the original /hello world/ regexp.)

The regular expression /hello world|hi mum/ can be derived only against the character ‘h’. all-derivatives returns the character range [‘h’..‘g’) in two distinct instances: one mapping it to the regexp /ello world/, and another mapping it to the regexp /i mum/.

The regular expression /[a-f][0-9]/ can be derived against the range [‘a’..‘g’) and that range maps to /[0-9]/.

TODO: Finish describing this.