Page 1 of 1

Linguistics question

Posted: Mon Jan 25, 2016 2:01 pm
by janMato
How does one calculate the probability distribution for the odds a character will appear in a word?

I'm trying to create a new tp word generator. An even distribution creates words that 50% of the time end with n, which is not what we see in the dictionary, nor in a typical text.

Say it had three words, a, ab, and bb. Then There is a 2/5 chance of an a.

But what if "a" was 40% of typical texts, "ab" 59% of typical texts and "bb" was rare, 1% of typical texts.

Also, we have tons of particles. If I leave particles in a text to calculate the odds of a letter appearing in a new word, then the letters of li/pi/mi/e are going to be way over represented.

Maybe I'll just wing it for now, but I was curious if there was a real answer for this.

Re: Linguistics question

Posted: Tue Jan 26, 2016 11:01 am
by janKipo
Well, for existing words, this is just a matter of counting. You can get figures for words or particular syllables of words (with or without particles) or whatever you want. The question is whether this is determinative; that is, do you want your word generator to continue this pattern fairly exactly or to vary in some way (you might look at new words versus old to see if there is a change in patterns, for example). You could go finer by considering what letter follows what, either immediately or at one or two syllables remove. What occurs without a previous consonant or before a terminal /n/ and so on. Of course, this is only for dictionary entries. For running text you have to add in factors from a word frequency count (which we had at one time, didn't we?). And, of course, you have to decide what word list to use (120, 124, 126, 129, 130, some other).