Documentation
¶
Index ¶
- type AhoCorasick
- type AhoCorasickBuilder
- func (b *AhoCorasickBuilder) Build(patterns []string) *AhoCorasick
- func (b *AhoCorasickBuilder) SetAsciiCaseInsensitive(asciiCaseInsensitive bool) *AhoCorasickBuilder
- func (b *AhoCorasickBuilder) SetByteClasses(byteClasses bool) *AhoCorasickBuilder
- func (b *AhoCorasickBuilder) SetDenseDepth(denseDepth *uint) *AhoCorasickBuilder
- func (b *AhoCorasickBuilder) SetKind(kind *AhoCorasickKind) *AhoCorasickBuilder
- func (b *AhoCorasickBuilder) SetMatchKind(matchKind MatchKind) *AhoCorasickBuilder
- func (b *AhoCorasickBuilder) SetPrefilter(prefilter bool) *AhoCorasickBuilder
- func (b *AhoCorasickBuilder) SetStartKind(startKind StartKind) *AhoCorasickBuilder
- type AhoCorasickKind
- type Match
- type MatchKind
- type StartKind
Examples ¶
- AhoCorasick.FindAll (Basic)
- AhoCorasick.FindAll (Leftmost_first)
- AhoCorasick.FindAll (Leftmost_longest)
- AhoCorasick.FindFirst (Basic)
- AhoCorasick.FindFirst (Leftmost_first)
- AhoCorasick.FindFirst (Leftmost_longest)
- AhoCorasick.GetKind
- AhoCorasick.IsMatch
- AhoCorasickBuilder.SetAsciiCaseInsensitive
- AhoCorasickBuilder.SetMatchKind (Leftmost_first)
- AhoCorasickBuilder.SetMatchKind (Leftmost_longest)
- AhoCorasickBuilder.SetMatchKind (Standard_semantics)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AhoCorasick ¶
type AhoCorasick struct {
// contains filtered or unexported fields
}
AhoCorasick is an automaton for searching multiple strings in linear time.
The AhoCorasick type supports a few basic ways of constructing an automaton, with the default being NewAhoCorasick. However, there are a fair number of configurable options that can be set by using AhoCorasickBuilder instead. Such options include, but are not limited to, how matches are determined, simple case insensitivity, whether to use a AhoCorasickKindDFA or not and various knobs for controlling the space-vs-time trade-offs taken when building the automaton.
func NewAhoCorasick ¶
func NewAhoCorasick(patterns []string) *AhoCorasick
NewAhoCorasick creates a new Aho-Corasick automaton using the default configuration.
The default configuration optimizes for less space usage, but at the expense of longer search times. To change the configuration, use AhoCorasickBuilder.
This uses the default [matchkind.MatchKindStandard] match semantics, which reports a match as soon as it is found. This corresponds to the standard match semantics supported by textbook descriptions of the Aho-Corasick algorithm.
func (*AhoCorasick) FindAll ¶
func (ac *AhoCorasick) FindAll(input string) []Match
FindAll returns an iterator of non-overlapping matches, using the match semantics that this automaton was constructed with.
input may be any type that is cheaply convertible to an Input. This includes, but is not limited to, &str and &[u8].
This is the infallible version of [AhoCorasick.TryFindIter].
Example (Basic) ¶
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindStandard).Build([]string{"append", "appendage", "app"})
haystack := "append the app to the appendage"
matches := automaton.FindAll(haystack)
for _, match := range matches {
fmt.Println(match.PatternIndex, haystack[match.Start:match.End], match.Start, match.End)
}
Output: 2 app 0 3 2 app 11 14 2 app 22 25
Example (Leftmost_first) ¶
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindLeftMostFirst).Build([]string{"append", "appendage", "app"})
haystack := "append the app to the appendage"
matches := automaton.FindAll(haystack)
for _, match := range matches {
fmt.Println(match.PatternIndex, haystack[match.Start:match.End], match.Start, match.End)
}
Output: 0 append 0 6 2 app 11 14 0 append 22 28
Example (Leftmost_longest) ¶
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindLeftMostLongest).Build([]string{"append", "appendage", "app"})
haystack := "append the app to the appendage"
matches := automaton.FindAll(haystack)
for _, match := range matches {
fmt.Println(match.PatternIndex, haystack[match.Start:match.End], match.Start, match.End)
}
Output: 0 append 0 6 2 app 11 14 1 appendage 22 31
func (*AhoCorasick) FindFirst ¶
func (ac *AhoCorasick) FindFirst(input string) *Match
FindFirst returns the location of the first match according to the match semantics that this automaton was constructed with.
input may be any type that is cheaply convertible to an Input. This includes, but is not limited to, &str and &[u8].
This is the infallible version of [AhoCorasick.TryFind].
Example (Basic) ¶
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindStandard).Build([]string{"b", "abc", "abcd"})
haystack := "abcd"
match := automaton.FindFirst(haystack)
fmt.Println(haystack[match.Start:match.End])
Output: b
Example (Leftmost_first) ¶
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindLeftMostFirst).Build([]string{"b", "abc", "abcd"})
haystack := "abcd"
match := automaton.FindFirst(haystack)
fmt.Println(haystack[match.Start:match.End])
Output: abc
Example (Leftmost_longest) ¶
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindLeftMostLongest).Build([]string{"b", "abc", "abcd"})
haystack := "abcd"
match := automaton.FindFirst(haystack)
fmt.Println(haystack[match.Start:match.End])
Output: abcd
func (*AhoCorasick) GetKind ¶
func (ac *AhoCorasick) GetKind() AhoCorasickKind
GetKind returns the kind of the AhoCorasick automaton used by this searcher.
Knowing the Aho-Corasick kind is principally useful for diagnostic purposes. In particular, if no specific kind was given to AhoCorasickBuilder.SetKind, then one is automatically chosen and this routine will report which one.
Note that the heuristics used for choosing which [ahocorasickkind.AhoCorasickKind] may be changed in a semver compatible release.
Example ¶
automaton := NewAhoCorasick([]string{"foo", "bar", "quux", "baz"})
fmt.Println(automaton.GetKind() == AhoCorasickKindDFA)
Output: true
func (*AhoCorasick) IsMatch ¶
func (ac *AhoCorasick) IsMatch(input string) bool
IsMatch returns true if and only if this automaton matches the haystack at any position.
Input may be any type that is cheaply convertible to an Input. This includes, but is not limited to, &str and &[u8].
Aside from convenience, when AhoCorasick was built with leftmost-first or leftmost-longest semantics, this might result in a search that visits less of the haystack than AhoCorasick.FindFirst would otherwise. (For standard semantics, matches are always immediately returned once they are seen, so there is no way for this to do less work in that case.)
Note that there is no corresponding fallible routine for this method. If you need a fallible version of this, then [AhoCorasick.TryFind] can be used with Input::earliest enabled.
Example ¶
automaton := NewAhoCorasick([]string{"foo", "bar", "quux", "baz"})
fmt.Println(automaton.IsMatch("xxx bar xxx"))
fmt.Println(automaton.IsMatch("xxx qux xxx"))
Output: true false
type AhoCorasickBuilder ¶
type AhoCorasickBuilder struct {
// contains filtered or unexported fields
}
AhoCorasickBuilder is a builder for configuring an AhoCorasick automaton.
func NewAhoCorasickBuilder ¶
func NewAhoCorasickBuilder() *AhoCorasickBuilder
NewAhoCorasickBuilder creates a new builder for configuring an AhoCorasick automaton.
The builder provides a way to configure a number of things, including ASCII case insensitivity and what kind of match semantics are used.
func (*AhoCorasickBuilder) Build ¶
func (b *AhoCorasickBuilder) Build(patterns []string) *AhoCorasick
Build creates an AhoCorasick automaton using the configuration set on this builder.
A builder may be reused to create more automatons.
func (*AhoCorasickBuilder) SetAsciiCaseInsensitive ¶
func (b *AhoCorasickBuilder) SetAsciiCaseInsensitive(asciiCaseInsensitive bool) *AhoCorasickBuilder
SetAsciiCaseInsensitive enables ASCII-aware case-insensitive matching.
When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only.
Enabling this option does not change the search algorithm, but it may increase the size of the automaton.
Example ¶
automaton := NewAhoCorasickBuilder().SetAsciiCaseInsensitive(true).Build([]string{"FOO", "bAr", "BaZ"})
fmt.Println(len(automaton.FindAll("foo bar baz")))
Output: 3
func (*AhoCorasickBuilder) SetByteClasses ¶
func (b *AhoCorasickBuilder) SetByteClasses(byteClasses bool) *AhoCorasickBuilder
SetByteClasses sets a debug setting for whether to attempt to shrink the size of the automaton’s alphabet or not.
This option is enabled by default and should never be disabled unless one is debugging the underlying automaton.
When enabled, some (but not all) AhoCorasick automatons will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the automaton.
The advantage of this map is that the size of the transition table can be reduced drastically from #states * 256 * sizeof(u32) to #states * k * sizeof(u32) where k is the number of equivalence classes (rounded up to the nearest power of 2). As a result, total space usage can decrease substantially. Moreover, since a smaller alphabet is used, automaton compilation becomes faster as well.
WARNING: This is only useful for debugging automatons. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.
func (*AhoCorasickBuilder) SetDenseDepth ¶
func (b *AhoCorasickBuilder) SetDenseDepth(denseDepth *uint) *AhoCorasickBuilder
SetDenseDepth sets the limit on how many states use a dense representation for their transitions. Other states will generally use a sparse representation.
A dense representation uses more memory but is generally faster, since the next transition in a dense representation can be computed in a constant number of instructions. A sparse representation uses less memory but is generally slower, since the next transition in a sparse representation requires executing a variable number of instructions.
This setting is only used when an AhoCorasick implementation is used that supports the dense versus sparse representation trade off. Not all do.
This limit is expressed in terms of the depth of a state, i.e., the number of transitions from the starting state of the automaton. The idea is that most of the time searching will be spent near the starting state of the automaton, so states near the start state should use a dense representation. States further away from the start state would then use a sparse representation.
By default, this is set to a low but non-zero number. Setting this to 0 is almost never what you want, since it is likely to make searches very slow due to the start state itself being forced to use a sparse representation. However, it is unlikely that increasing this number will help things much, since the most active states have a small depth. More to the point, the memory usage increases super-linearly as this number increases.
func (*AhoCorasickBuilder) SetKind ¶
func (b *AhoCorasickBuilder) SetKind(kind *AhoCorasickKind) *AhoCorasickBuilder
SetKind sets the type of underlying automaton to use.
Currently, there are four choices:
[ahocorasickkind.AhoCorasickKindNonContinuousNFA] instructs the searcher to use a noncontinuous::NFA. A noncontinuous NFA is the fastest to be built, has moderate memory usage and is typically the slowest to execute a search.
[ahocorasickkind.AhoCorasickKindContinuousNFA] instructs the searcher to use a contiguous::NFA. A contiguous NFA is a little slower to build than a noncontinuous NFA, has excellent memory usage and is typically a little slower than a AhoCorasickKindDFA for a search.
[ahocorasickkind.AhoCorasickKindDFA] instructs the searcher to use a dfa::AhoCorasickKindDFA. A AhoCorasickKindDFA is very slow to build, uses exorbitant amounts of memory, but will typically execute searches the fastest.
nil (the default) instructs the searcher to choose the “best” Aho-Corasick implementation. This choice is typically based primarily on the number of patterns.
Setting this configuration does not change the time complexity for constructing the Aho-Corasick automaton (which is O(p) where p is the total number of patterns being compiled). Setting this to [ahocorasickkind.AhoCorasickKindDFA] does however reduce the time complexity of non-overlapping searches from O(n + p) to O(n), where n is the length of the haystack.
In general, you should probably stick to the default unless you have some kind of reason to use a specific Aho-Corasick implementation. For example, you might choose [ahocorasickkind.AhoCorasickKindDFA] if you don’t care about memory usage and want the fastest possible search times.
Setting this guarantees that the searcher returned uses the chosen implementation. If that implementation could not be constructed, then an error will be returned. In contrast, when None is used, it is possible for it to attempt to construct, for example, a contiguous NFA and have it fail. In which case, it will fall back to using a noncontinuous NFA.
If nil is given, then one may use AhoCorasick.GetKind to determine which AhoCorasick implementation was chosen.
Note that the heuristics used for choosing which [ahocorasickkind.AhoCorasickKind] may be changed in a semver compatible release.
func (*AhoCorasickBuilder) SetMatchKind ¶
func (b *AhoCorasickBuilder) SetMatchKind(matchKind MatchKind) *AhoCorasickBuilder
SetMatchKind sets the desired match semantics.
The default is [matchkind.MatchKindStandard], which corresponds to the match semantics supported by the standard textbook description of the Aho-Corasick algorithm. Namely, matches are reported as soon as they are found. Moreover, this is the only way to get overlapping matches or do stream searching.
The other kinds of match semantics that are supported are [matchkind.MatchKindLeftMostFirst] and [matchkind.MatchKindLeftMostLongest]. The former corresponds to the match you would get if you were to try to match each pattern at each position in the haystack in the same order that you give to the automaton. That is, it returns the leftmost match corresponding to the earliest pattern given to the automaton. The latter corresponds to finding the longest possible match among all leftmost matches.
For more details on match semantics, see the documentation for MatchKind.
Note that setting this to [matchkind.MatchKindLeftMostFirst] or [matchkind.MatchKindLeftMostLongest] will cause some search routines on AhoCorasick to return an error (or panic if you’re using the infallible API). Notably, this includes stream and overlapping searches.
Example (Leftmost_first) ¶
haystack := "abcd"
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindLeftMostFirst).Build([]string{"b", "abc", "abcd"})
match := automaton.FindFirst(haystack)
fmt.Println(haystack[match.Start:match.End])
Output: abc
Example (Leftmost_longest) ¶
haystack := "abcd"
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindLeftMostLongest).Build([]string{"b", "abc", "abcd"})
match := automaton.FindFirst(haystack)
fmt.Println(haystack[match.Start:match.End])
Output: abcd
Example (Standard_semantics) ¶
haystack := "abcd"
automaton := NewAhoCorasickBuilder().SetMatchKind(MatchKindStandard).Build([]string{"b", "abc", "abcd"})
match := automaton.FindFirst(haystack)
fmt.Println(haystack[match.Start:match.End])
Output: b
func (*AhoCorasickBuilder) SetPrefilter ¶
func (b *AhoCorasickBuilder) SetPrefilter(prefilter bool) *AhoCorasickBuilder
SetPrefilter enables heuristic prefilter optimizations.
When enabled, searching will attempt to quickly skip to match candidates using specialized literal search routines. A prefilter cannot always be used, and is generally treated as a heuristic. It can be useful to disable this if the prefilter is observed to be suoptimal for a particular workload.
Currently, prefilters are typically only active when building searchers with a small (less than 100) number of patterns.
This is enabled by default.
func (*AhoCorasickBuilder) SetStartKind ¶
func (b *AhoCorasickBuilder) SetStartKind(startKind StartKind) *AhoCorasickBuilder
SetStartKind sets the starting state configuration for the automaton.
Every Aho-Corasick automaton is capable of having two start states: one that is used for unanchored searches and one that is used for anchored searches. Some automatons, like the NFAs, support this with almost zero additional cost. Other automatons, like the AhoCorasickKindDFA, require two copies of the underlying transition table to support both simultaneously.
Because there may be an added non-trivial cost to supporting both, it is possible to configure which starting state configuration is needed.
Indeed, since anchored searches tend to be somewhat more rare, only unanchored searches are supported by default. Thus, [startkind.StartKindUnanchored] is the default.
Note that when this is set to [startkind.StartKindUnanchored], then running an anchored search will result in an error (or a panic if using the infallible APIs). Similarly, when this is set to [startkind.StartKindAnchored], then running an unanchored search will result in an error (or a panic if using the infallible APIs). When [startkind.StartKindBoth] is used, then both unanchored and anchored searches are always supported.
Also note that even if an AhoCorasick searcher is using an NFA internally (which always supports both unanchored and anchored searches), an error will still be reported for a search that isn’t supported by the configuration set via this method. This means, for example, that an error is never dependent on which internal implementation of AhoCorasick is used.
type AhoCorasickKind ¶
type AhoCorasickKind int
AhoCorasickKind is principally used as an input to the AhoCorasickBuilder.SetStartKind method. Its documentation goes into more detail about each choice.
const ( AhoCorasickKindNonContinuousNFA AhoCorasickKind = 1 // Use a noncontiguous NFA. AhoCorasickKindContinuousNFA AhoCorasickKind = 2 // Use a contiguous NFA. AhoCorasickKindDFA AhoCorasickKind = 3 // Use a AhoCorasickKindDFA. Warning: DFAs typically use a large amount of memory. )
type Match ¶
type Match struct {
// The ending position of the match.
End uint
// Returns the ID of the pattern that matched.
//
// The ID of a pattern is derived from the position in which it was originally inserted into the corresponding searcher. The first pattern has identifier 0, and each subsequent pattern is 1, 2 and so on.
PatternIndex uint
// The starting position of the match.
Start uint
}
Match represents a match found by an AhoCorasick automaton.
type MatchKind ¶
type MatchKind int
MatchKind is a knob for controlling the match semantics of an Aho-Corasick automaton.
There are two generally different ways that Aho-Corasick automatons can report matches. The first way is the “standard” approach that results from implementing most textbook explanations of Aho-Corasick. The second way is to report only the leftmost non-overlapping matches. The leftmost approach is in turn split into two different ways of resolving ambiguous matches: leftmost-first and leftmost-longest.
The MatchKindStandard match kind is the default and is the only one that supports overlapping matches and stream searching. (Trying to find overlapping or streaming matches using leftmost match semantics will result in an error in fallible APIs and a panic when using infallibe APIs.) The MatchKindStandard match kind will report matches as they are seen. When searching for overlapping matches, then all possible matches are reported. When searching for non-overlapping matches, the first match seen is reported. For example, for non-overlapping matches, given the patterns abcd and b and the haystack abcdef, only a match for b is reported since it is detected first. The abcd match is never reported since it overlaps with the b match.
In contrast, the leftmost match kind always prefers the leftmost match among all possible matches. Given the same example as above with abcd and b as patterns and abcdef as the haystack, the leftmost match is abcd since it begins before the b match, even though the b match is detected before the abcd match. In this case, the b match is not reported at all since it overlaps with the abcd match.
The difference between leftmost-first and leftmost-longest is in how they resolve ambiguous matches when there are multiple leftmost matches to choose from. Leftmost-first always chooses the pattern that was provided earliest, whereas leftmost-longest always chooses the longest matching pattern. For example, given the patterns a and ab and the subject string ab, the leftmost-first match is a but the leftmost-longest match is ab. Conversely, if the patterns were given in reverse order, i.e., ab and a, then both the leftmost-first and leftmost-longest matches would be ab. Stated differently, the leftmost-first match depends on the order in which the patterns were given to the Aho-Corasick automaton. Because of that, when leftmost-first matching is used, if a pattern A that appears before a pattern B is a prefix of B, then it is impossible to ever observe a match of B.
If you’re not sure which match kind to pick, then stick with the standard kind, which is the default. In particular, if you need overlapping or streaming matches, then you must use the standard kind. The leftmost kinds are useful in specific circumstances. For example, leftmost-first can be very useful as a way to implement match priority based on the order of patterns given and leftmost-longest can be useful for dictionary searching such that only the longest matching words are reported.
Relationship with regular expression alternations ¶
Understanding match semantics can be a little tricky, and one easy way to conceptualize non-overlapping matches from an Aho-Corasick automaton is to think about them as a simple alternation of literals in a regular expression. For example, let’s say we wanted to match the strings Sam and Samwise, which would turn into the regex Sam|Samwise. It turns out that regular expression engines have two different ways of matching this alternation. The first way, leftmost-longest, is commonly found in POSIX compatible implementations of regular expressions (such as grep). The second way, leftmost-first, is commonly found in backtracking implementations such as Perl. (Some regex engines, such as RE2 and Rust’s regex engine do not use backtracking, but still implement leftmost-first semantics in an effort to match the behavior of dominant backtracking regex engines such as those found in Perl, Ruby, Python, Javascript and PHP.)
That is, when matching Sam|Samwise against Samwise, a POSIX regex will match Samwise because it is the longest possible match, but a Perl-like regex will match Sam since it appears earlier in the alternation. Indeed, the regex Sam|Samwise in a Perl-like regex engine will never match Samwise since Sam will always have higher priority. Conversely, matching the regex Samwise|Sam against Samwise will lead to a match of Samwise in both POSIX and Perl-like regexes since Samwise is still longest match, but it also appears earlier than Sam.
The “standard” match semantics of Aho-Corasick generally don’t correspond to the match semantics of any large group of regex implementations, so there’s no direct analogy that can be made here. MatchKindStandard match semantics are generally useful for overlapping matches, or if you just want to see matches as they are detected.
The main conclusion to draw from this section is that the match semantics can be tweaked to precisely match either Perl-like regex alternations or POSIX regex alternations.
const ( // Use standard match semantics, which support overlapping matches. When used with non-overlapping matches, matches are reported as they are seen. MatchKindStandard MatchKind = 1 // Use leftmost-longest match semantics, which reports leftmost matches. When there are multiple possible leftmost matches, the longest match is chosen. // //This does not support overlapping matches or stream searching. If this match kind is used, attempting to find overlapping matches or stream matches will fail. MatchKindLeftMostLongest MatchKind = 2 // Use leftmost-first match semantics, which reports leftmost matches. When there are multiple possible leftmost matches, the match corresponding to the pattern that appeared earlier when constructing the automaton is reported. // //This does not support overlapping matches or stream searching. If this match kind is used, attempting to find overlapping matches or stream matches will fail. MatchKindLeftMostFirst MatchKind = 3 )
type StartKind ¶
type StartKind int
The kind of anchored starting configurations to support in an Aho-Corasick searcher.
Depending on which searcher is used internally by AhoCorasick, supporting both unanchored and anchored searches can be quite costly. For this reason, AhoCorasickBuilder::start_kind can be used to configure whether your searcher supports unanchored, anchored or both kinds of searches.
This searcher configuration knob works in concert with the search time configuration Input::anchored. Namely, if one requests an unsupported anchored mode, then the search will either panic or return an error, depending on whether you’re using infallible or fallibe APIs, respectively.
AhoCorasick by default only supports unanchored searches.
const ( StartKindBoth StartKind = 1 // Support both anchored and unanchored searches. StartKindUnanchored StartKind = 2 // Support only unanchored searches. Requesting an anchored search will return an error in fallible APIs and panic in infallible APIs. StartKindAnchored StartKind = 3 // Support only anchored searches. Requesting an unanchored search will return an error in fallible APIs and panic in infallible APIs. )