Day 01 - Pattern Matching Double Digit Numbers

## [Problem](https://adventofcode.com/2023/day/1) The problem boils down to scanning lines of text from a file and summing up a two-digit number. This two digit number is created from the first digit 1 through 9 found in the string and the last. For part two it is the same thing, but the digits can also be spelled out as words i.e. "one", "two". The input is 1000 lines of lowercase alphanumeric characters ranging from 2 to 56 characters with an average around 20 characters in my input. > [!NOTE] Zero Free > "zero" and "0" are not found in my input. > All inputs are valid and have at least a single digit as a number or a word. > [!Example] Example line of input > `sbzvmddhnjtwollnjv33d2lbcscstqt` ## [Solutions](https://github.com/mic-max/advent/blob/master/2023/d01.py) ### Simple I used a simple algorithm to solve part 1. For each line iterate through the character until a digit is found. Since it would be the first digit of a two-digit number add it multiplied by 10 to the result and stop iterating. Now for that same line iterate its characters in reverse and add the first digit to the resulting sum. > [!NOTE] ord vs int > It seemed using `ord(c) - 48` sped it up by ~16%. This makes since because `int` is a more generalized function and accepts strings instead of characters. > > I wonder what the performance implications of not doing the `- 48` after each `ord` and instead subtracting out all the extra values added with `res -= len(L) * (480 + 48)`. ```python res = 0 for line in L: for c in line: if c.isdigit(): res += (ord(c) - 48) * 10 break for c in reversed(line): if c.isdigit(): res += ord(c) - 48 break return res ``` Part 2 isn't as simple since the substrings to search for aren't single characters. For each line I now iterate through all possible substrings to be looked for (of which there are 18, the characters `1-9` and `one` through `nine`) and remember the smallest non `-1` result from `string.find` and `string.rfind`. There's a little weird modulus operation happening but not too cryptic overall. ```python res = 0 digits = list('123456789') for i in range(1, 10): digits.append(num2words.num2words(i)) for line in L: min_value = 0 min_pos = len(line) - 1 max_value = 0 max_pos = 0 for i, num in enumerate(digits): if (pos := line.find(num, 0, min_pos + 1)) != -1: min_pos = pos min_value = (i % 9) + 1 if (pos2 := line.rfind(num, max_pos)) != -1: max_pos = pos2 max_value = (i % 9) + 1 res += min_value * 10 res += max_value return res ``` ### Regex Would it really by Advent of Code without a regex solution? Don't worry it's not an incomprehensible code golf attempt. This solution is actually quite simple first doing some setup a dictionary is created that maps all string representations to their integer values. Then iterate over every line and perform a `re.search` for all the string representations OR'd together. The match will be converted to a number and multiplied by 10 before adding to our result. Now a neat trick to basically reuse that regex search method but in reverse since no `re.rsearch` exists. We reverse the string using Python's handy, [performant](https://www.digitalocean.com/community/tutorials/python-reverse-string), and newcomer friendly syntax `[::-1]` for the line as well as all the targets. Then reverse the first match and using the lookup dictionary convert it to an int to add to our result. Simple as that. ```python nums = {} for i in range(1, 10): nums[str(i)] = i if part_two: nums[num2words.num2words(i)] = i res = 0 for line in L: m = re.search("|".join(nums.keys()), line) res += nums[m[0]] * 10 m = re.search("|".join([x[::-1] for x in nums.keys()]), line[::-1]) res += nums[m[0][::-1]] return res ``` #### Note >As the target string is scanned, REs separated by `'|'` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once _A_ matches, _B_ will not be tested further, even if it would produce a longer overall match. In other words, the `'|'` operator is never greedy. >Source: https://docs.python.org/3/library/re.html Slightly caveat to this solution, if any target was a proper prefix of another, it would have to come subsequent in the regular expression. ### Aho-Corasick This is an algorithm that will convert our dictionary of words into a finite state automaton which allows to quick searching of multiple patterns in a text. Read [this](https://cp-algorithms.com/string/aho_corasick.html) for more information on how it converts a trie into an automaton with suffix and terminal links. Two automatons will be created, one forward and one reverse using the same principle seen already. Add all digits and digit words to each automaton with the value of the int they represent for easy parsing later, reversing the words for the reverse automaton. Now we have two efficient data structures we can use to string match. Using the `pyahocorasick` library simply use them to iterate through the input string and break from the loop after the first match. Add to the result, multiplying by 10 for the forward match and we're done! ```python a1 = ahocorasick.Automaton() a2 = ahocorasick.Automaton() for i in range(1, 10): a1.add_word(str(i), i) a2.add_word(str(i), i) if part_two: word = num2words.num2words(i) a1.add_word(word, i) a2.add_word(word[::-1], i) a1.make_automaton() a2.make_automaton() res = 0 for line in L: for _, x in a1.iter(line): res += x * 10 break for _, x in a2.iter(line[::-1]): res += x break return res ``` This data structure really shows how choosing the best tool for the job makes everything simpler. ## Performance | Part 1 | Time | Times Slower | | ------------ | -------- | ------------ | | Simple | 0.000739 | 1.00 | | Aho-Corasick | 0.000744 | 1.01 | | Regex | 0.003144 | 4.25 | | Part 2 | Time | Times Slower | | ------------ | -------- | ------------ | | Aho-Corasick | 0.000879 | 1.00 | | Regex | 0.004710 | 5.36 | | Simple | 0.006071 | 6.91 | Keeping it simple for part 1 worked out and all the automaton setup work is useless for searching for single character targets. Sometimes its still faster though... Maybe since the library is coded in C... Overall, the performance win handily goes to the automaton. Given longer dictionary strings with more suffixes this solution would pull far into the lead. ## Competitive When writing an advent of code you don't know part 2 until you have solved part 1. This makes it beneficial to not write super specific and hard to modify code for part 1. For instance assuming the target strings were all going to be a single character can end up with good efficient code at the cost of larger code changes when that assumption no longer holds. This can be seen in the `simple` functions code. The code changes needed for the regex and Aho-Corasick solutions are much simpler in comparison and only related to the setup and the algorithm is unchanged.