The theory and practice of NLP come from rich and lively scholarly research as well as corporate research. This amateur NLP project includes a Shiny app that guesses one word based on the frequency of its superseding co-occurrence with another word.
In other words, the algorithm uses a statistical model that combines tokenized uni-, bi- and tri-grams to offer a prediction of the next possible word.
The raw data include news, blog and twitter posts from a dataset provided by Coursera.
This algorithm is unique in its willingness to work with “bad words”, because the author values spelling over censorship and because the profanity filter is not formally required in the final project.
The app uses a sample of one-, two- and three- word combinations of around 1MB that allows zippy loading and lightning-fast word guessing. A total of 127,666 observations are feeding the algorithm. It is optimized for predicting fairly trivial phrases like “How are you…” ==> “doing” and “My heart is…” ==> “breaking”. Quad- and quint-grams can be added along with a larger sample at the expense of computing resources and time.
To test the Next Word Best Guess (NWBG) app, just type in a word (or two or three) in the properly labeled box and set the slider to reflect the number of guesses you want to see. The predictions will populate the space below. In this version of the app, you can ignore the input box at the bottom. Other than that, try it out and have fun!