Linguistics question

Discuss any other topic in here.
Diskutu ĉiujn aliajn temojn ĉi tie.
Post Reply
janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Linguistics question

Post by janMato »

How does one calculate the probability distribution for the odds a character will appear in a word?

I'm trying to create a new tp word generator. An even distribution creates words that 50% of the time end with n, which is not what we see in the dictionary, nor in a typical text.

Say it had three words, a, ab, and bb. Then There is a 2/5 chance of an a.

But what if "a" was 40% of typical texts, "ab" 59% of typical texts and "bb" was rare, 1% of typical texts.

Also, we have tons of particles. If I leave particles in a text to calculate the odds of a letter appearing in a new word, then the letters of li/pi/mi/e are going to be way over represented.

Maybe I'll just wing it for now, but I was curious if there was a real answer for this.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: Linguistics question

Post by janKipo »

Well, for existing words, this is just a matter of counting. You can get figures for words or particular syllables of words (with or without particles) or whatever you want. The question is whether this is determinative; that is, do you want your word generator to continue this pattern fairly exactly or to vary in some way (you might look at new words versus old to see if there is a change in patterns, for example). You could go finer by considering what letter follows what, either immediately or at one or two syllables remove. What occurs without a previous consonant or before a terminal /n/ and so on. Of course, this is only for dictionary entries. For running text you have to add in factors from a word frequency count (which we had at one time, didn't we?). And, of course, you have to decide what word list to use (120, 124, 126, 129, 130, some other).
Post Reply