janKipo wrote:Nice stats. What corpus (I'm sure it says in there, but I don't even read Cyrillic very well)?
I can't find the corpus the guy used in the thread. And well, when people are saying..."мой мозг, мой мозг! Он сейчас взорвётся*"...maybe you don't want to know...
* lawa insa mi! lawa insa mi li open wawa sama pi poki pakala wawa!
Lots of talk about Ziph's law. One guy says, if you calculate the distribution of words, it breaks Ziph's law (the tail should be longer). I think its obvious why... it's because some words aren't really pairs of words, they're compound words. If the distribution treated jan, pona, and jan pona (when used to mean friend) as 3 separate words, we'd probably get Ziph's law. As it is, all we have is the distributions of words and parts of words. At least one post in that thread seemed to be agreeing with me.
Interestingly Ziph's law shows up in ecology when counting which species are most succesful. Makes sense that some words are more fit for use than others and some words fill a incredibly narrow niche.