A ranged possibility: two-letter codes

Signs and symbols: Writing systems (hieroglyphs, nail writing) and Signed Toki Pona; unofficial scripts too
Signoj kaj simboloj: Skribsistemoj (hieroglifoj, ungoskribado) kaj la Tokipona Signolingvo; ankaŭ por neoficialaj skribsistemoj
jan-ante
Posts: 541
Joined: Fri Oct 02, 2009 4:05 pm

Re: A ranged possibility: two-letter codes

Post by jan-ante »

janKipo wrote: Well, slash -- and back-slash -- also dash and equals are easy to write long hand and are lower case on (American) keyboards. I agree that slash and backslash should not both be used, but slash and dash take care of two major breaks, with maybe equals for the less common 'la'. And, again, I like just 'p' for 'pi'.
that is
la l /
li i \
e e -
pi p =
o o !
on the numerical keypad one can use * and + as low case. btw, for example, on german keyboard = is the upper case, @ is Alt+Q. so perhaps it is better to have several alternative styles for separators, but forbide the mixture of thereof.
What corpus (I'm sure it says in there, but I don't even read Cyrillic very well)?
the author (Maxim Solokhin) was asked but did not reply.i suppose it is a 10kb corpus from yahoo message board and some texts from jan Pije and Yves Prudhomme's sites, as mentioned here.
Some at the low end are rather surprising, but the result of the sorts of things we talk about here
yes i expected to see poka much higher.
i analysed the legend of Gilgamesh by jak Ote (1700 tp words and 126 proper names), these are the top results:

1 li_______* ___226___ 0,132941176
2 jan_____jn___ 183___ 0,107647059
3 e________~___ 119___ 0,07
4 ona_____on___ 101___ 0,059411765
5 ni______ni___ 70___ 0,041176471
6 tawa____tw___ 64___ 0,037647059
7 pi_______@___ 51___ 0,03
8 soweli__ow___ 42___ 0,024705882
9 kama___km___ 37___ 0,021764706
10 tomo___tm___ 34___ 0,02
11 toki_____tk___ 33___ 0,019411765
12 mi_____mi___ 31___ 0,018235294
13 sina____sa___ 30___ 0,017647059
14 sama___sm___ 28___ 0,016470588
15 wawa__ww___ 26___ 0,015294118
16 o________!___ 25___ 0,014705882
17 sewi____sw___ 25___ 0,014705882
18 ala_____aa___ 23___ 0,013529412
19 la_______^___ 23___ 0,013529412
20 ma____ma___ 22___0,012941176

without table support it is hard to add more results, but pona is #34, poka is #38, pali is #44, pana is #62
from this, the version 01 could be better. as for separators, we should give a suitable keys for li, e, pi, other two are not that important.
interestingly, the Zipf law is observed untill "weka" (#53)
User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: A ranged possibility: two-letter codes

Post by jan Ote »

Interesting results. It's an epic, so I expected high rank of 'wawa', 'utala' and 'sewi'. Moreover, it shows that this part of story is about 'soweli', 'sama', 'ma tomo'. The word 'lawa' has a higher rank (#22) than it would have in a good corpus (#33 on Maxim Solokhin's list). Also 'unpa' is more frequent than in other texts.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: A ranged possibility: two-letter codes

Post by janKipo »

The corpus is still very small (relatively speaking) so most content items will fluctuate wildly (another dragon tale would boost 'akesi' for example -- isn't there a snake coming in Kikamesi?), but the relative importance of the structural words should be pretty stable (though 'li' may rise a bit when we get farther from first person accounts). The new words have also not been thoroughly introduced -- and retro'd -- which will make some small differences, as would catching what little there is in the way of spontaneous conversation.
User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: A ranged possibility: two-letter codes

Post by jan Ote »

janKipo wrote:(another dragon tale would boost 'akesi' for example -- isn't there a snake coming in Kikamesi?)
Yes, there is one. There are men-scorpions too. Even Humbaba could be an jan akesi, because he's a kind of terrible monster. ona li lukin ike li ike.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: A ranged possibility: two-letter codes

Post by janKipo »

'ike lukin' ?
User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: A ranged possibility: two-letter codes

Post by jan Ote »

janKipo wrote:'ike lukin' ?
Yes, thank you.
jan-ante
Posts: 541
Joined: Fri Oct 02, 2009 4:05 pm

Re: A ranged possibility: two-letter codes

Post by jan-ante »

so, some more results today. but first a few word for google:
---
word frequencies in toki pona.
word rates in toki pona.
word abundances in toki pona.
---
when i tried google for these queries i got nothing, so i just want to improve google a little bit.
i analysed a new corpus from toki pona community in LJ and kama sona section of this forum (except gilgamesh). 2188 tp words and 71 proper name (including a handful of nanpa suli). i made a correlation plot of frequences of gilgamesh data versus this new one (i call it mixed). o lukin

Image
(to avoid log(0) problem and the data point overlapping, the value (1/5000)+(RAND/5000) was added to every data point)
we can see a big spread between 2 data sets. only a handful of words have a comparable frequences. presumably they comprise the topic-independent core of language: li e ni pi tomo tawa kama pilin... having this, i see no big reason to analyse a big corpus. this analysis could give the average results only, but they are impractical, because we have neither average topic, nor avreage people in tp community, everyone is bright individuality. so i ranked the words by minimal of two frequencies. jan Kipo can consider minimum as the logical function "and", e.g. in fuzzy logic. having this value high mean that the word is useful for both topics. pona is #26, pali is #32, pana is #50, poka is #54, poki is # 104. from this words, pali is the closest to the core group (variation factor is 1.34 only). unpa is the last in the list. in general topics, people hesitate to mention unpa and olin.
i think this method should be used to find the "best" words. what do you think?
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: A ranged possibility: two-letter codes

Post by janKipo »

Hell, I'm just impressed by your getting the stats that fast. I think that after the first decile or so, the results will fluctuate for words down to about the last decile, when they settle down again (until we get over some hangups like on 'unpa') I just report my immediate intuitive assignment and preferences, to be fiddled by stats and hard facts.
janJan
Posts: 3
Joined: Sun Nov 29, 2009 10:19 am

Re: A ranged possibility: two-letter codes

Post by janJan »

I find this two-letter code quite enticing ; I had looked at a simple "remove the vowels" solution, saw that was not one to one, and, before trying to tackle that, I discovered your solution. Is there a current stable version for the table ?
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: A ranged possibility: two-letter codes

Post by janKipo »

Nothing official (obviously) but the last given seems to work OK. we just argue aesthetics: what code is "better" for what word, how far from the simple vowel drop is it permitted to go, and so on. Join with your taste or wait for a more thorough corpus scan (as if that would really decide anything).
Post Reply