so, some more results today. but first a few word for google:
word frequencies in toki pona.
word rates in toki pona.
word abundances in toki pona.
when i tried google for these queries i got nothing, so i just want to improve google a little bit.
i analysed a new corpus from toki pona community in LJ and kama sona section of this forum (except gilgamesh). 2188 tp words and 71 proper name (including a handful of nanpa suli). i made a correlation plot of frequences of gilgamesh data versus this new one (i call it mixed). o lukin
(to avoid log(0) problem and the data point overlapping, the value (1/5000)+(RAND/5000) was added to every data point)
we can see a big spread between 2 data sets. only a handful of words have a comparable frequences. presumably they comprise the topic-independent core of language: li e ni pi tomo tawa kama pilin... having this, i see no big reason to analyse a big corpus. this analysis could give the average results only, but they are impractical, because we have neither average topic, nor avreage people in tp community, everyone is bright individuality. so by minimal of two frequencies. jan Kipo can consider minimum as the logical function "and", e.g. in fuzzy logic. having this value high mean that the word is useful for both topics. pona is #26, pali is #32, pana is #50, poka is #54, poki is # 104. from this words, pali is the closest to the core group (variation factor is 1.34 only). unpa is the last in the list. in general topics, people hesitate to mention unpa and olin.
i think this method should be used to find the "best" words. what do you think?