I've been working on doing some corpus linguistics for toki pona. In the last hour, I've downloaded 220KB of long (as in paragraph or longer) texts of toki pona. I've substantial samples from Japanese, English, Russian, French, and German speakers.
Much of it is public domain or CC, so I'll repost a zip file shortly. The rest is unknown license, but I'll make it available to anyone who wants it for computational linguistics work, contact me directly and we'll arrange an exchange in a dark alley where copyright doesn't apply. If this was all on a generous license, we could republish it all as a comprehensive toki pona anthology on lulu.
As you might know, toki pona had a burning-of-the-libraries-of-Alexandria moment when Geocities went down. All of the following fan sites are gone.
geocities.com/stephentoddpope/tokiponabible.html
geocities.com/yohsweb/
geocities.com/yves_prudhomme/toki_pona/
geocities.com/stephentoddpope/tokiponahome
geocities.com/girlinside123/toki.html
Not only did we lose some irreplaceable works of early toki pona culture, but we've lost the "tenpo ale la mi pali e lipu ni. lipu ni li pini ala." animated gifs. tan ni kin la mi pana anpa e telo oko.
Corpus Research Resources and how you can help
Stand and be counted. Please post medium or long texts you have written to the net, preferably with at least one version without interlinear-gloss (ie. translations between each sentence)
Set your work free. If you do post somewhere on the web, please post with a generous license, public domain, CC or GPL documentation so that we can copy your works willy nilly without a guilty conscience. Otherwise we have to buy the linguistic corpus data from our pot dealer in dark alleys.
Post your work to the forum. When you post to this forum, it becomes covered by a CC license per your TOS with jan Sonja.
alasa tawa nimi pi toki pona. I usually can find most toki pona on the net by using toki pona key words, many of them do not exist in any language. There are significant gaps though in finding toki pona from Asian countries, probably because the search engines I know of aren't indexing hangul, chinese, etc.
http://www.suburbandestiny.com/?p=639 <-- Have I missed any toki pona pages on the net?
Search engine restricted to site listed above: http://www.google.com/cse/home?cx=01053 ... rmsq7cfp8g
Toki pona corpus project needs your help!
Toki pona corpus project needs your help!
Last edited by janMato on Sun Sep 05, 2010 5:04 pm, edited 2 times in total.
Re: Toki pona corpus project needs your help!
Thank you. I have fallen way behind on this and never reached your coverage. I do , however, have a lot of shorter items, casual realtime (well, almost) exchanges and the like (I don't have the IRC logs). This does include a few items from the lost sites (I think -- the names are familiar), but not complete sets. Shall I ZIP what I have to you? I haven't the time nor patience at the moment to set up various searches and counts, all of which need doing. as well as all the indexing that will be of use to Sonja as she works on The Book.
Re: Toki pona corpus project needs your help!
Absolutely! With the quantity of good material I've found, I'm optimistic about compiling an anthology in my spare time. Private message me if you need my email address again.janKipo wrote:Thank you. I have fallen way behind on this and never reached your coverage. I do , however, have a lot of shorter items, casual realtime (well, almost) exchanges and the like (I don't have the IRC logs). This does include a few items from the lost sites (I think -- the names are familiar), but not complete sets. Shall I ZIP what I have to you? I haven't the time nor patience at the moment to set up various searches and counts, all of which need doing. as well as all the indexing that will be of use to Sonja as she works on The Book.
Re: Toki pona corpus project needs your help!
48,000 tokip pona words counted.
In my corpus ali is more common than ale.
kala, seli, mu are the least successful words
kipisi, monsuta, pu don't show in the results yet.
This is the application I used: http://neon.niederlandistik.fu-berlin.de/textstat/ Very, very handy.
In my corpus ali is more common than ale.
kala, seli, mu are the least successful words
kipisi, monsuta, pu don't show in the results yet.
This is the application I used: http://neon.niederlandistik.fu-berlin.de/textstat/ Very, very handy.
Code: Select all
Word count tp
li 5115 10.52%
e 3504 7.21%
jan 2348 4.83%
mi 2038 4.19%
ni 1724 3.55%
toki 1390 2.86%
pi 1345 2.77%
ona 1307 2.69%
tawa 1305 2.68%
lon 1180 2.43%
ma 1167 2.40%
la 1084 2.23%
mute 1020 2.10%
tenpo 1015 2.09%
ala 949 1.95%
pona 912 1.88%
sina 891 1.83%
lili 799 1.64%
kama 604 1.24%
sona 565 1.16%
tan 557 1.15%
tomo 557 1.15%
suli 541 1.11%
ken 509 1.05%
jo 498 1.02%
o 488 1.00%
pilin 484 1.00%
lawa 475 0.98%
pali 464 0.95%
wile 460 0.95%
sewi 455 0.94%
ike 431 0.89%
telo 407 0.84%
lukin 402 0.83%
tu 356 0.73%
ali 348 0.72%
pana 340 0.70%
wan 320 0.66%
taso 318 0.65%
sama 306 0.63%
nasin 297 0.61%
kasi 290 0.60%
sike 289 0.59%
ante 283 0.58%
soweli 278 0.57%
suno 272 0.56%
ijo 271 0.56%
seme 260 0.53%
kulupu 257 0.53%
en 242 0.50%
nimi 234 0.48%
musi 222 0.46%
weka 219 0.45%
moku 217 0.45%
kepeken 199 0.41%
meli 199 0.41%
moli 198 0.41%
pini 198 0.41%
sitelen 193 0.40%
utala 192 0.39%
wawa 181 0.37%
nanpa 178 0.37%
lipu 176 0.36%
poka 166 0.34%
kin 152 0.31%
awen 145 0.30%
seli 144 0.30%
kon 141 0.29%
pakala 141 0.29%
mani 138 0.28%
loje 136 0.28%
nasa 132 0.27%
anpa 130 0.27%
kalama 128 0.26%
a 127 0.26%
olin 127 0.26%
sin 113 0.23%
pimeja 112 0.23%
luka 109 0.22%
mije 108 0.22%
ilo 102 0.21%
len 100 0.21%
poki 88 0.18%
sijelo 85 0.17%
kili 82 0.17%
nena 79 0.16%
mama 78 0.16%
kiwen 74 0.15%
anu 70 0.14%
lape 70 0.14%
linja 70 0.14%
akesi 69 0.14%
palisa 63 0.13%
sinpin 63 0.13%
waso 59 0.12%
ale 58 0.12%
noka 58 0.12%
mun 55 0.11%
open 50 0.10%
kute 49 0.10%
insa 48 0.10%
lete 48 0.10%
lupa 48 0.10%
jelo 47 0.10%
oko 43 0.09%
suwi 41 0.08%
laso 31 0.06%
uta 31 0.06%
walo 31 0.06%
ko 30 0.06%
monsi 30 0.06%
pipi 30 0.06%
pan 29 0.06%
jaki 28 0.06%
supa 23 0.05%
kule 22 0.05%
esun 21 0.04%
unpa 18 0.04%
kala 17 0.03%
selo 11 0.02%
mu 10 0.02%
48631
Re: Toki pona corpus project needs your help!
Here is a zip file of all the current redistributable toki pona text I can find. So far 50% of the collected corpus is redistributable.
Shout outs to the following awesome people who have published with a license compatible with republication.
Michael Freedman - BY-NC-ND
Everyone who posted to wikia - CC-BY-SA
John Clifford - (by forum post implying he was willing to contribute, but not specific license yet)
Joop Kiefte - Public Domain
Bryant Knight - Public Domain
Sonja Kisa - BY-NC-SA
Dave Raftery - Creative Commons
Rowa Giso (not sure about their name) - AFAIK, these texts are a derivative of B Knight's works, which is now public domain.
Posts to this forum are covered by CC per your TOS with jan Sonja (thanks for pointing that out jan Ote!) I suspect that re-licensing magic only works when the original copyright holder posts, though.
However, as far as I can tell, tokilili, yahoo, live journal, etc are all under copyright of the original contributors, the sites that post TOS usually say that the site gets limited license to run the mailing list or what have you, but there's no redistribution rights for folk like me...but like any copyright, it only matters to the extent that some can afford to enforce. But that is a story for another day.
I excluded anything that was a translation of a copyrighted work and wasn't small enough to be considered fair use.
Once I get my toki pona website up again, I'll publish it there, too.
The next step is to rummage through this and start correcting the grammar and removing extraneous English, although the No-Deriv's licenses on some of the texts worry me about if I can do that.
Shout outs to the following awesome people who have published with a license compatible with republication.
Michael Freedman - BY-NC-ND
Everyone who posted to wikia - CC-BY-SA
John Clifford - (by forum post implying he was willing to contribute, but not specific license yet)
Joop Kiefte - Public Domain
Bryant Knight - Public Domain
Sonja Kisa - BY-NC-SA
Dave Raftery - Creative Commons
Rowa Giso (not sure about their name) - AFAIK, these texts are a derivative of B Knight's works, which is now public domain.
Posts to this forum are covered by CC per your TOS with jan Sonja (thanks for pointing that out jan Ote!) I suspect that re-licensing magic only works when the original copyright holder posts, though.
However, as far as I can tell, tokilili, yahoo, live journal, etc are all under copyright of the original contributors, the sites that post TOS usually say that the site gets limited license to run the mailing list or what have you, but there's no redistribution rights for folk like me...but like any copyright, it only matters to the extent that some can afford to enforce. But that is a story for another day.
I excluded anything that was a translation of a copyrighted work and wasn't small enough to be considered fair use.
Once I get my toki pona website up again, I'll publish it there, too.
The next step is to rummage through this and start correcting the grammar and removing extraneous English, although the No-Deriv's licenses on some of the texts worry me about if I can do that.
- Attachments
-
- Redistributable Toki Pona Corpus.zip
- (53.29 KiB) Downloaded 790 times
Last edited by janMato on Wed Feb 03, 2010 9:17 am, edited 1 time in total.
Re: Toki pona corpus project needs your help!
There are some tp texts on my tp site: http://tokipl.wikidot.com/tekstyhttp://en.tokipona.org/wiki/Copyright wrote:All original text, images, sounds and videos on the Toki Pona website are licensed under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Unported Licence. Anything you contribute to this website's wiki and forums will also be published under this licence.
Current license for the site: CC-NC-SA, the same license has been used as for tp Wiki and this forum. Well, janSonja have chosen 'unported' version. Under this license derivative works are allowed, then: corrected and improved versions can be published (from all people mentioned by janMato only M.Freedman used CC-NC-ND).
Texts there:
- ma tomo Pape -- by Bryant Knight, from Wikipedia article
- sike wan -- by Bryant Knight, from his site
- jan lawa Oliki -- a modified version of a text by soweli Elepanto, sent to tp forum
- kala -- by François Schwicker (jan Kanso), from the forum archive
- toki suli Intenasijonale -- by jan-ante, text sent to tp forum
Please send me e PM or write here if you need a plain ascii.
Re: Toki pona corpus project needs your help!
Thanks for pointing that out! I'll have to take time to scavenge this forum for suitable texts-- it could end up being as important as wikia. The yahoo forums text posted to this forum though, I considering as "unknown license."jan Ote wrote:http://en.tokipona.org/wiki/Copyright wrote:All original text, images, sounds and videos on the Toki Pona website are licensed under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Unported Licence. Anything you contribute to this website's wiki and forums will also be published under this licence.
270kb total, 144kb republishable
jan lili sina li sitelen e nimi pi toki pona? pona a! jan li jo e sike suno pi seme nanpa?
Re: Toki pona corpus project needs your help!
jan lili mi li suli. tenpo sike tu wan kama la ona li ken pali lon tomo pali, li ken tawa tomo mani, li ken tawa weka tan tomo mi li ken jo e tomo ona.janMato wrote:jan lili sina li sitelen e nimi pi toki pona? pona a! jan li jo e sike suno pi seme nanpa?
tenpo mute la mi tu li toki lili kepeken toki pona lon tomo. tenpo pini la ona li toki e toki musi pi ''kala ma" tawa mi. mi toki e ni: "o sitelen e toki ni tawa mi! tan nasin ni la jan ante ken sona e ni".
jan lili mi li sitelen e toki ante. jan ante li pali e sitelen musi li sitelen e nimi lon sitelen ni kepeken toki Inli. ni li sitelen pi nimi Inli Manka. jan lili mi li kama jo e sitelen nimi ni li sitelen e nimi sin kepeken toki pona. ona li pana e ni tawa mi. mi lukin. taso ona li ken ala pana e ni tawa jan ante. jan ante ken ala lukin e ni. mama pi sitelen Manka ni taso li ken pana e ni tawa jan ante. ni li nasa.
Re: Toki pona corpus project needs your help!
9 toki pona text, already known on the forum
Creative Commons BY-NC-SA
All texts in toki pona only; lines starting with "+" are titles and headers.
Creative Commons BY-NC-SA
- kala ma li lon ala tan seme? (40 words)
- meli anu mije (60 words)
- soweli en kili (70 words)
- pipi musi en pipi pali (100 words)
- toki suli Intenasijonale (200 words)
- jan lawa Oliki (300 words)
- tan jan Eloto. jan pi ma seme li jan nanpa wan? (400 words)
- jan Kikamesi. jan Enkitu li kama (1800 words)
- jan Kikamesi. utala pi jan Kuwawa (2800 words)
All texts in toki pona only; lines starting with "+" are titles and headers.
- Attachments
-
- suno-pona-corpus.zip
- 9 tp text (short, medium, long), Creative Commons BY-NC-SA
- (9.2 KiB) Downloaded 705 times
-
- Posts: 3
- Joined: Mon Apr 16, 2018 1:53 pm
Re: Toki pona corpus project needs your help!
Thought it would be worth mentioning here that I've recently brought the Toki Pona language to life over on Glosbe.com. The site originally had only 24 translated words, but I've now beefed the site up with over 1,200 English words and their corresponding word/phrase in Toki Pona. I made sure to provide thorough English definitions for all translated words to ensure there is no semantic ambiguity regarding English words with multiple corresponding Toki Pona words (ex. "cause" could be "kama", "pana", or "tan" based on context). Totally recommend checking out the site as a handy reference and contribution-hub.
https://glosbe.com/en/mis_tok
(Sorry if my post was badly formatted. Created an account here just to spread the word about Glosbe.)
https://glosbe.com/en/mis_tok
(Sorry if my post was badly formatted. Created an account here just to spread the word about Glosbe.)