The Bitter Lesson Video - Technical Notes

In the HARPY speech recognition system's approach, knowledge was represented as finite state transition network where phrases are broken down into constituent phonemic elements and sequences of these phones are captured in the network. The original HARPY system's network contained around 15,000 nodes. The exact datasets used for HARPY development were not used here; as a substitute, the Brown Corpus was used as the basis for the set of possible phrases. The Brown Corpus used contains 57,340 strings.

To produce a reasonable number of nodes, 300 strings were randomly sampled from the full corpus, and the first 50 words were kept, producing a network of about 6,000 nodes. A slightly modified version of the CMU Pronouncing Dictionary was used to map phrases to their phonemes / phones.

For network traversal, the Linear Predictive Coding (LPC) coefficient were computed for segments of the incoming speech signal. These LPC coefficients define an all-pole filter, and the frequency response of this filter (calculated using the Fourier Transform) gives the spectral envelope of the vocal tract for the given speech segment. The LPC coefficients order dictates the spectral accuracy of the results (the higher the order, the higher the accuracy). We used an LPC coefficients order of 12 for the visualizations.

The spectral envelopes (incoming signal as well as templates for the defined phones in the network) are compared using the Itakura-Saito distance to determine the closest match and most likely path.