Algorithms - Trie and KMP String Search - efficient prefix and pattern matching
Intro
At the moment I am looking for a job, so in my research into coding test platforms, the problems that cannot be solved with just a little thinking and some loops are 80% covered by just 8 categories of problems:
- Optimization with overlapping subproblems (coin change, longest increasing subsequence, knapsack variants, matrix chain) - solved with Dynamic Programming (memoization/tabulation).
- Pathfinding, connectivity, or level-order processing in grids/graphs (number of islands, shortest path in maze, rotten oranges) - solved with Graph Traversal: BFS and DFS.
- Hierarchical data processing (tree diameter, in order traversal, BST validation, lowest common ancestor) - solved with Tree Traversals and Properties (pre/in/post-order, level-order, recursion/DFS on trees).
- Exhaustive search with pruning (permutations, subsets, N-Queens, sudoku solver, word search) - solved with Backtracking.
- Local optimal choices that work globally (jump game, task scheduler, fractional knapsack) - solved by Greedy Algorithms.
- Efficient "top-k" or priority processing (merge k sorted lists, kth largest element, task scheduler with cooldown) - solved with Heaps / Priority Queues.
- Efficient union of sets / cycle detection in graphs (redundant connection, number of provinces, Kruskal's MST) - solved with Union-Find (Disjoint Set Union) with path compression & union-by-rank.
- Searching in sorted/monotonic spaces (search in rotated sorted array, minimum in rotated array, binary search on answer for capacity problems) - solved by Binary Search (advanced variants).
Bonus:
- Efficient prefix and pattern matching - solved by Trie and KMP String Search
- Efficient range queries and updates - solved by Segment Tree and Fenwick Tree
- Shortest paths with negative weights and all-pairs - solved by Bellman-Ford and Floyd-Warshall
Two classic algorithms solve dealing with lots of strings elegantly: the Trie (prefix tree) for multiple prefix-based lookups, and KMP (Knuth-Morris-Pratt) for fast single-pattern searching in text. Today I am going to cover them both.
The Trie - Find the index of first occurrence (Prefix tree)
In computer science, a trie, also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store their associated key. Instead, each node's position within the trie determines its associated key, with the connections between nodes defined by individual characters rather than the entire key.
So if a trie was an actual tree, then each branch would have a letter stamped on it. To find a "cat" you go on the "c" branch, then follow on the "a" subbranch and then on the "t" subbranch of "a". The computer science cat will be lying on that very branch - not to be confused with the spherical cow, that's another animal.
When to use
- Insert many words and frequently check if a word exists
- Check if any word starts with a given prefix (autocomplete!)
- Search for words with wildcards or collect all words with a common prefix
Typical problems: dictionary implementations, word search on boards, replace words in sentences.
The point of a try is that sharing prefixes saves space and makes prefix queries very fast.
All operations have O(L) complexity, where L = word length:
- Insert
- Search (exact word)
- StartsWith (prefix exists)
Here is an implementation of a trie:
var trie = new Trie();
trie.Insert("apple");
Console.WriteLine(trie.Search("apple")); // true
Console.WriteLine(trie.Search("app")); // false
Console.WriteLine(trie.StartsWith("app")); // true
trie.Insert("app");
Console.WriteLine(trie.Search("app")); // true
public class TrieNode
{
public Dictionary<char, TrieNode> Children { get; } = [];
public bool IsEndOfWord { get; set; }
}
public class Trie
{
private readonly TrieNode root = new();
public Trie() { }
public void Insert(string word)
{
var node = root;
foreach (var c in word)
{
if (!node.Children.TryGetValue(c, out var child))
{
child = new TrieNode();
node.Children[c] = child;
}
node = child;
}
node.IsEndOfWord = true;
}
public bool Search(string word)
{
var node = GetNode(word);
return node != null && node.IsEndOfWord;
}
public bool StartsWith(string prefix)
{
return GetNode(prefix) != null;
}
private TrieNode GetNode(string s)
{
var node = root;
foreach (var c in s)
{
if (!node.Children.TryGetValue(c, out node))
return null;
}
return node;
}
} The trie stands as one of the most elegant data structures for associative retrieval over strings. Independently conceived in the late 1950s, it was first formally described by René de la Briandais in 1959 as a way to compress lexicographic dictionaries, but the term "trie" itself - deliberately extracted from the middle syllable of "retrieval" - was coined by Edward Fredkin in his influential 1960 paper "Trie Memory" published in Communications of the ACM. Fredkin presented it as an alternative to unordered lists, ordered lists, or simple pigeonhole hashing for fast prefix-based lookup, insertion, and membership testing in large sets of variable-length keys.
Mathematically, a trie can be viewed as a rooted, ordered tree that realizes a deterministic finite automaton (DFA) whose states correspond to prefixes of the stored strings. Each edge is labeled with a single symbol from a finite alphabet Σ (of size typically |Σ| = 26 for English lowercase, 256 for bytes, etc.), and the path from the root to any node spells a unique prefix. A special end-of-word flag (or terminal marker) distinguishes nodes that complete valid keys. The height of any path equals the length of its corresponding string, so the time complexity for insertion, exact search, and prefix search is strictly O(L), where L is the length of the key, independent of the total number of strings N stored in the structure (unlike balanced binary search trees, whose operations are O(log N + L)). This prefix-independent bound arises because the trie performs exactly one transition per character, with no backtracking or comparison of entire keys.
Space complexity, however, reveals a classic time–space tradeoff. In the worst case (no shared prefixes), a trie requires Θ(∑ L_i) nodes and up to Θ(|Σ| × number of nodes) pointers, which can be far larger than a simple list or hash table. Yet in practice, for natural-language vocabularies or DNA sequences with significant prefix overlap, the compression is dramatic: the number of nodes often grows sublinearly with N, approaching a value proportional to the total number of distinct characters across all strings. Variants such as compressed tries (or Patricia tries / radix trees) further mitigate space by merging chains of unary nodes, achieving tighter bounds in many applications. Modern analyses (e.g., of hybrid or burst tries) frequently model expected node count and access cost using probabilistic methods over random string sets, confirming the trie's asymptotic superiority for prefix-heavy workloads despite its memory footprint.
Knuth-Morris-Pratt - Find the index of first occurrence (strStr)
KMP avoids unnecessary backtracking by precomputing the Longest Proper Prefix which is also Suffix (LPS / pi array) for the pattern.
Core steps:
- Build LPS array for the pattern (what is the longest prefix that is also a suffix at each position)
- Use the LPS to skip characters when mismatch happens - never backtrack in the text
When to use
- Find if/where a pattern appears in a long text (especially if you do it repeatedly)
- Avoid the O(n*m) worst-case of naive string search
- Typical problems: implement strStr() / find substring index, repeated string matching,
Yes, you in the back! "But Siderite, isn't Boyer-Moore (with Galil rule) much better than Knuth-Morris-Pratt? Why are we even talking about this?"
That is an excellent question. The answer is that KMP is easier to implement under pressure and comprehend in a teaching context. Here are more reasons for KMP:
- guaranteed linear time O(n + m)
- boils down to the concept of LPS / failure / prefix table, which is a single array computed in O(m) and used in a clean two-pointer sliding window
- it teaches prefix/suffix overlap, failure transitions, and amortized analysis in a compact way.
- classic problems like "Implement strStr()", repeated substring patterns, or finding the first occurrence often map cleanly to KMP.
- KMP (1977) is taught in almost every undergraduate algorithms textbook (CLRS dedicates space to it), while Boyer-Moore (1977, but refined later) is often presented as an "advanced" or "practical optimization" follow-up.
And speaking of strStr, let's see an implementation:
int StrStr(string haystack, string needle)
{
if (needle.Length == 0) return 0;
if (haystack.Length < needle.Length) return -1;
var lps = ComputeLPS(needle);
var i = 0; // index in haystack
var j = 0; // index in needle
while (i < haystack.Length)
{
if (haystack[i] == needle[j])
{
i++;
j++;
}
if (j == needle.Length)
return i - j; // found at i - j
else if (i < haystack.Length && haystack[i] != needle[j])
{
if (j != 0)
j = lps[j - 1]; // smart skip
else
i++;
}
}
return -1;
}
int[] ComputeLPS(string pattern)
{
var lps = new int[pattern.Length];
var len = 0; // length of previous longest prefix suffix
var i = 1;
while (i < pattern.Length)
{
if (pattern[i] == pattern[len])
{
len++;
lps[i] = len;
i++;
}
else
{
if (len != 0)
len = lps[len - 1];
else
{
lps[i] = 0;
i++;
}
}
}
return lps;
}
var haystack = "This is a Knuth-Morris-Pratt algorithm implementation";
Console.WriteLine(StrStr(haystack, "implementation")); //39
Complexity:
- Building LPS: O(m)
- Searching: O(n)
- Total: O(n + m)
KMP is useful for repeated searches on the same pattern (build LPS once), or very long texts with many partial matches. If the problem has multiple patterns, consider Aho-Corasick (Trie + KMP ideas combined), but that's more of a practical real life problem than an interview one, so who cares, right?
The LPS (which stands for Longest Prefix Suffix) is a list created just from the pattern string before any searching starts. For each position in the pattern, the list records how long the beginning part of the pattern matches the ending part ending at that position (using the longest such match that is not the whole string up to there). When the search process finds a mismatch between pattern and text, instead of starting over from the beginning of the pattern, the algorithm uses the LPS value to skip ahead and resume checking from a later spot in the pattern.
In the example above, the LPS is just an array of zeroes, all except for the second 'i' in 'implementation' where the value is 1. So when a character mismatches that 'i', you start from index 1 and length 1, because 'implementation' starts with an 'i'.
You may be interested to know that the IndexOf implementation in .NET is neither KMP or BM, instead it's an optimized ordinal or culture-aware search that is implemented in native code (inside the CLR and CoreCLR runtime). So if you want either of these for specific cases, you will have to implement them yourself anyway. Or use a NuGet...
Conclusion
If you're into algorithms, the ones on strings are beautiful! People have been thinking about these for almost a century and there are some really cool implementations and ideas. I can't even begin to cover such a vast field.
Comments
Be the first to post a comment