Page 1 of 2

Toki pona corpus project needs your help!

Posted: Mon Feb 01, 2010 7:30 pm
by janMato
I've been working on doing some corpus linguistics for toki pona. In the last hour, I've downloaded 220KB of long (as in paragraph or longer) texts of toki pona. I've substantial samples from Japanese, English, Russian, French, and German speakers.

Much of it is public domain or CC, so I'll repost a zip file shortly. The rest is unknown license, but I'll make it available to anyone who wants it for computational linguistics work, contact me directly and we'll arrange an exchange in a dark alley where copyright doesn't apply. If this was all on a generous license, we could republish it all as a comprehensive toki pona anthology on lulu.

As you might know, toki pona had a burning-of-the-libraries-of-Alexandria moment when Geocities went down. All of the following fan sites are gone.
geocities.com/stephentoddpope/tokiponabible.html
geocities.com/yohsweb/
geocities.com/yves_prudhomme/toki_pona/
geocities.com/stephentoddpope/tokiponahome
geocities.com/girlinside123/toki.html

Not only did we lose some irreplaceable works of early toki pona culture, but we've lost the "tenpo ale la mi pali e lipu ni. lipu ni li pini ala." animated gifs. tan ni kin la mi pana anpa e telo oko.

Corpus Research Resources and how you can help
Stand and be counted. Please post medium or long texts you have written to the net, preferably with at least one version without interlinear-gloss (ie. translations between each sentence)

Set your work free. If you do post somewhere on the web, please post with a generous license, public domain, CC or GPL documentation so that we can copy your works willy nilly without a guilty conscience. Otherwise we have to buy the linguistic corpus data from our pot dealer in dark alleys.

Post your work to the forum. When you post to this forum, it becomes covered by a CC license per your TOS with jan Sonja.

alasa tawa nimi pi toki pona. I usually can find most toki pona on the net by using toki pona key words, many of them do not exist in any language. There are significant gaps though in finding toki pona from Asian countries, probably because the search engines I know of aren't indexing hangul, chinese, etc.

http://www.suburbandestiny.com/?p=639 <-- Have I missed any toki pona pages on the net?
Search engine restricted to site listed above: http://www.google.com/cse/home?cx=01053 ... rmsq7cfp8g

Re: Toki pona corpus project needs your help!

Posted: Mon Feb 01, 2010 8:20 pm
by janKipo
Thank you. I have fallen way behind on this and never reached your coverage. I do , however, have a lot of shorter items, casual realtime (well, almost) exchanges and the like (I don't have the IRC logs). This does include a few items from the lost sites (I think -- the names are familiar), but not complete sets. Shall I ZIP what I have to you? I haven't the time nor patience at the moment to set up various searches and counts, all of which need doing. as well as all the indexing that will be of use to Sonja as she works on The Book.

Re: Toki pona corpus project needs your help!

Posted: Mon Feb 01, 2010 9:31 pm
by janMato
janKipo wrote:Thank you. I have fallen way behind on this and never reached your coverage. I do , however, have a lot of shorter items, casual realtime (well, almost) exchanges and the like (I don't have the IRC logs). This does include a few items from the lost sites (I think -- the names are familiar), but not complete sets. Shall I ZIP what I have to you? I haven't the time nor patience at the moment to set up various searches and counts, all of which need doing. as well as all the indexing that will be of use to Sonja as she works on The Book.
Absolutely! With the quantity of good material I've found, I'm optimistic about compiling an anthology in my spare time. Private message me if you need my email address again.

Re: Toki pona corpus project needs your help!

Posted: Tue Feb 02, 2010 11:51 pm
by janMato
48,000 tokip pona words counted.
In my corpus ali is more common than ale.
kala, seli, mu are the least successful words
kipisi, monsuta, pu don't show in the results yet.

This is the application I used: http://neon.niederlandistik.fu-berlin.de/textstat/ Very, very handy.

Code: Select all

Word	count	tp
li	5115	10.52%
e	3504	7.21%
jan	2348	4.83%
mi	2038	4.19%
ni	1724	3.55%
toki	1390	2.86%
pi	1345	2.77%
ona	1307	2.69%
tawa	1305	2.68%
lon	1180	2.43%
ma	1167	2.40%
la	1084	2.23%
mute	1020	2.10%
tenpo	1015	2.09%
ala	949	1.95%
pona	912	1.88%
sina	891	1.83%
lili	799	1.64%
kama	604	1.24%
sona	565	1.16%
tan	557	1.15%
tomo	557	1.15%
suli	541	1.11%
ken	509	1.05%
jo	498	1.02%
o	488	1.00%
pilin	484	1.00%
lawa	475	0.98%
pali	464	0.95%
wile	460	0.95%
sewi	455	0.94%
ike	431	0.89%
telo	407	0.84%
lukin	402	0.83%
tu	356	0.73%
ali	348	0.72%
pana	340	0.70%
wan	320	0.66%
taso	318	0.65%
sama	306	0.63%
nasin	297	0.61%
kasi	290	0.60%
sike	289	0.59%
ante	283	0.58%
soweli	278	0.57%
suno	272	0.56%
ijo	271	0.56%
seme	260	0.53%
kulupu	257	0.53%
en	242	0.50%
nimi	234	0.48%
musi	222	0.46%
weka	219	0.45%
moku	217	0.45%
kepeken	199	0.41%
meli	199	0.41%
moli	198	0.41%
pini	198	0.41%
sitelen	193	0.40%
utala	192	0.39%
wawa	181	0.37%
nanpa	178	0.37%
lipu	176	0.36%
poka	166	0.34%
kin	152	0.31%
awen	145	0.30%
seli	144	0.30%
kon	141	0.29%
pakala	141	0.29%
mani	138	0.28%
loje	136	0.28%
nasa	132	0.27%
anpa	130	0.27%
kalama	128	0.26%
a	127	0.26%
olin	127	0.26%
sin	113	0.23%
pimeja	112	0.23%
luka	109	0.22%
mije	108	0.22%
ilo	102	0.21%
len	100	0.21%
poki	88	0.18%
sijelo	85	0.17%
kili	82	0.17%
nena	79	0.16%
mama	78	0.16%
kiwen	74	0.15%
anu	70	0.14%
lape	70	0.14%
linja	70	0.14%
akesi	69	0.14%
palisa	63	0.13%
sinpin	63	0.13%
waso	59	0.12%
ale	58	0.12%
noka	58	0.12%
mun	55	0.11%
open	50	0.10%
kute	49	0.10%
insa	48	0.10%
lete	48	0.10%
lupa	48	0.10%
jelo	47	0.10%
oko	43	0.09%
suwi	41	0.08%
laso	31	0.06%
uta	31	0.06%
walo	31	0.06%
ko	30	0.06%
monsi	30	0.06%
pipi	30	0.06%
pan	29	0.06%
jaki	28	0.06%
supa	23	0.05%
kule	22	0.05%
esun	21	0.04%
unpa	18	0.04%
kala	17	0.03%
selo	11	0.02%
mu	10	0.02%
	48631	

Re: Toki pona corpus project needs your help!

Posted: Wed Feb 03, 2010 1:03 am
by janMato
Here is a zip file of all the current redistributable toki pona text I can find. So far 50% of the collected corpus is redistributable.

Shout outs to the following awesome people who have published with a license compatible with republication.
Michael Freedman - BY-NC-ND
Everyone who posted to wikia - CC-BY-SA
John Clifford - (by forum post implying he was willing to contribute, but not specific license yet)
Joop Kiefte - Public Domain
Bryant Knight - Public Domain
Sonja Kisa - BY-NC-SA
Dave Raftery - Creative Commons
Rowa Giso (not sure about their name) - AFAIK, these texts are a derivative of B Knight's works, which is now public domain.

Posts to this forum are covered by CC per your TOS with jan Sonja (thanks for pointing that out jan Ote!) I suspect that re-licensing magic only works when the original copyright holder posts, though.

However, as far as I can tell, tokilili, yahoo, live journal, etc are all under copyright of the original contributors, the sites that post TOS usually say that the site gets limited license to run the mailing list or what have you, but there's no redistribution rights for folk like me...but like any copyright, it only matters to the extent that some can afford to enforce. But that is a story for another day.

I excluded anything that was a translation of a copyrighted work and wasn't small enough to be considered fair use.

Once I get my toki pona website up again, I'll publish it there, too.

The next step is to rummage through this and start correcting the grammar and removing extraneous English, although the No-Deriv's licenses on some of the texts worry me about if I can do that.

Re: Toki pona corpus project needs your help!

Posted: Wed Feb 03, 2010 6:24 am
by jan Ote
http://en.tokipona.org/wiki/Copyright wrote:All original text, images, sounds and videos on the Toki Pona website are licensed under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Unported Licence. Anything you contribute to this website's wiki and forums will also be published under this licence.
There are some tp texts on my tp site: http://tokipl.wikidot.com/teksty
Current license for the site: CC-NC-SA, the same license has been used as for tp Wiki and this forum. Well, janSonja have chosen 'unported' version. Under this license derivative works are allowed, then: corrected and improved versions can be published (from all people mentioned by janMato only M.Freedman used CC-NC-ND).
Texts there:
  • ma tomo Pape -- by Bryant Knight, from Wikipedia article
  • sike wan -- by Bryant Knight, from his site
  • jan lawa Oliki -- a modified version of a text by soweli Elepanto, sent to tp forum
  • kala -- by François Schwicker (jan Kanso), from the forum archive
  • toki suli Intenasijonale -- by jan-ante, text sent to tp forum
The rest five or so are by mi en jan lili mi and are the final versions of texts already published and revised here.

Please send me e PM or write here if you need a plain ascii.

Re: Toki pona corpus project needs your help!

Posted: Wed Feb 03, 2010 9:35 am
by janMato
jan Ote wrote:
http://en.tokipona.org/wiki/Copyright wrote:All original text, images, sounds and videos on the Toki Pona website are licensed under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Unported Licence. Anything you contribute to this website's wiki and forums will also be published under this licence.
Thanks for pointing that out! I'll have to take time to scavenge this forum for suitable texts-- it could end up being as important as wikia. The yahoo forums text posted to this forum though, I considering as "unknown license."

270kb total, 144kb republishable

jan lili sina li sitelen e nimi pi toki pona? pona a! jan li jo e sike suno pi seme nanpa?

Re: Toki pona corpus project needs your help!

Posted: Wed Feb 03, 2010 11:20 am
by jan Ote
janMato wrote:jan lili sina li sitelen e nimi pi toki pona? pona a! jan li jo e sike suno pi seme nanpa?
jan lili mi li suli. tenpo sike tu wan kama la ona li ken pali lon tomo pali, li ken tawa tomo mani, li ken tawa weka tan tomo mi li ken jo e tomo ona.

tenpo mute la mi tu li toki lili kepeken toki pona lon tomo. tenpo pini la ona li toki e toki musi pi ''kala ma" tawa mi. mi toki e ni: "o sitelen e toki ni tawa mi! tan nasin ni la jan ante ken sona e ni".

jan lili mi li sitelen e toki ante. jan ante li pali e sitelen musi li sitelen e nimi lon sitelen ni kepeken toki Inli. ni li sitelen pi nimi Inli Manka. jan lili mi li kama jo e sitelen nimi ni li sitelen e nimi sin kepeken toki pona. ona li pana e ni tawa mi. mi lukin. taso ona li ken ala pana e ni tawa jan ante. jan ante ken ala lukin e ni. mama pi sitelen Manka ni taso li ken pana e ni tawa jan ante. ni li nasa.

Re: Toki pona corpus project needs your help!

Posted: Tue Mar 23, 2010 6:19 am
by jan Ote
9 toki pona text, already known on the forum
Creative Commons BY-NC-SA
  • kala ma li lon ala tan seme? (40 words)
  • meli anu mije (60 words)
  • soweli en kili (70 words)
  • pipi musi en pipi pali (100 words)
  • toki suli Intenasijonale (200 words)
  • jan lawa Oliki (300 words)
  • tan jan Eloto. jan pi ma seme li jan nanpa wan? (400 words)
  • jan Kikamesi. jan Enkitu li kama (1800 words)
  • jan Kikamesi. utala pi jan Kuwawa (2800 words)
by jan-ante, jan soweli Elepanto, jan Mika, jan Ote
All texts in toki pona only; lines starting with "+" are titles and headers.

Re: Toki pona corpus project needs your help!

Posted: Mon Apr 16, 2018 2:05 pm
by linguafrakka
Thought it would be worth mentioning here that I've recently brought the Toki Pona language to life over on Glosbe.com. The site originally had only 24 translated words, but I've now beefed the site up with over 1,200 English words and their corresponding word/phrase in Toki Pona. I made sure to provide thorough English definitions for all translated words to ensure there is no semantic ambiguity regarding English words with multiple corresponding Toki Pona words (ex. "cause" could be "kama", "pana", or "tan" based on context). Totally recommend checking out the site as a handy reference and contribution-hub.

https://glosbe.com/en/mis_tok


(Sorry if my post was badly formatted. Created an account here just to spread the word about Glosbe.)