Cuss Word Generation in Toki Pona
Posted: Mon Feb 01, 2010 12:43 am
pakala nena uta!
I'm trying to create a cuss word generator in toki pona. This is from the observation that if you generate words with sounds with the same frequency as the rest of the words in a given language, the most common words are cuss words. So a random English word generator would generated 4 letter words most often because they embody the most common phonotactic patterns. (Sorry I can't find the ref to this anymore, I read it in a book somewhere)
I already wrote one version that didn't use a markov chain, and the results weren't interesting: nana, lana, nala, nani, nina, which just reflects that for the distribution stats I was using, n,l,a,i were the most common letters, so valid combinations of n,l,a,i were most commonly generated. Since four letter words in English aren't so tightly bunched in cluster of common sounds-- some of the most common cuss words don't even have English's most common vowel the schwa, I figure I'm doing something wrong-- probably not using markov chains.
Markov Chain transitions matrix for toki pona
Should the transition matrix for toki pona be syllable to syllable (odds of "ka" following "la" is 1% percent) or letter by letter, (odds of a following k is 1%)? And for further speculation--given that toki pona has a mostly closed set of isolated morphemes, how would toki pona's cuss? Would they possibly use common strings of words instead of common strings of phonemes? Maybe I should also work out a word -> word transition matrix (the odds of walo following laso is .001%).
Final question, what corpus are people using? If one doesn't exist, I plan to compile one from this site and the wikia site, since those have permissive enough licenses for republishing content.
I'm trying to create a cuss word generator in toki pona. This is from the observation that if you generate words with sounds with the same frequency as the rest of the words in a given language, the most common words are cuss words. So a random English word generator would generated 4 letter words most often because they embody the most common phonotactic patterns. (Sorry I can't find the ref to this anymore, I read it in a book somewhere)
I already wrote one version that didn't use a markov chain, and the results weren't interesting: nana, lana, nala, nani, nina, which just reflects that for the distribution stats I was using, n,l,a,i were the most common letters, so valid combinations of n,l,a,i were most commonly generated. Since four letter words in English aren't so tightly bunched in cluster of common sounds-- some of the most common cuss words don't even have English's most common vowel the schwa, I figure I'm doing something wrong-- probably not using markov chains.
Markov Chain transitions matrix for toki pona
Should the transition matrix for toki pona be syllable to syllable (odds of "ka" following "la" is 1% percent) or letter by letter, (odds of a following k is 1%)? And for further speculation--given that toki pona has a mostly closed set of isolated morphemes, how would toki pona's cuss? Would they possibly use common strings of words instead of common strings of phonemes? Maybe I should also work out a word -> word transition matrix (the odds of walo following laso is .001%).
Final question, what corpus are people using? If one doesn't exist, I plan to compile one from this site and the wikia site, since those have permissive enough licenses for republishing content.