Toki pona corpus project needs your help!

Discuss any other topic in here.
Diskutu ĉiujn aliajn temojn ĉi tie.
janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Toki pona corpus project needs your help!

Postby janMato » Mon Feb 01, 2010 7:30 pm

I've been working on doing some corpus linguistics for toki pona. In the last hour, I've downloaded 220KB of long (as in paragraph or longer) texts of toki pona. I've substantial samples from Japanese, English, Russian, French, and German speakers.

Much of it is public domain or CC, so I'll repost a zip file shortly. The rest is unknown license, but I'll make it available to anyone who wants it for computational linguistics work, contact me directly and we'll arrange an exchange in a dark alley where copyright doesn't apply. If this was all on a generous license, we could republish it all as a comprehensive toki pona anthology on lulu.

As you might know, toki pona had a burning-of-the-libraries-of-Alexandria moment when Geocities went down. All of the following fan sites are gone.
geocities.com/stephentoddpope/tokiponabible.html
geocities.com/yohsweb/
geocities.com/yves_prudhomme/toki_pona/
geocities.com/stephentoddpope/tokiponahome
geocities.com/girlinside123/toki.html

Not only did we lose some irreplaceable works of early toki pona culture, but we've lost the "tenpo ale la mi pali e lipu ni. lipu ni li pini ala." animated gifs. tan ni kin la mi pana anpa e telo oko.

Corpus Research Resources and how you can help
Stand and be counted. Please post medium or long texts you have written to the net, preferably with at least one version without interlinear-gloss (ie. translations between each sentence)

Set your work free. If you do post somewhere on the web, please post with a generous license, public domain, CC or GPL documentation so that we can copy your works willy nilly without a guilty conscience. Otherwise we have to buy the linguistic corpus data from our pot dealer in dark alleys.

Post your work to the forum. When you post to this forum, it becomes covered by a CC license per your TOS with jan Sonja.

alasa tawa nimi pi toki pona. I usually can find most toki pona on the net by using toki pona key words, many of them do not exist in any language. There are significant gaps though in finding toki pona from Asian countries, probably because the search engines I know of aren't indexing hangul, chinese, etc.

http://www.suburbandestiny.com/?p=639 <-- Have I missed any toki pona pages on the net?
Search engine restricted to site listed above: http://www.google.com/cse/home?cx=01053 ... rmsq7cfp8g
Last edited by janMato on Sun Sep 05, 2010 5:04 pm, edited 2 times in total.

janKipo
Posts: 2911
Joined: Fri Oct 09, 2009 2:20 pm

Re: Toki pona corpus project needs your help!

Postby janKipo » Mon Feb 01, 2010 8:20 pm

Thank you. I have fallen way behind on this and never reached your coverage. I do , however, have a lot of shorter items, casual realtime (well, almost) exchanges and the like (I don't have the IRC logs). This does include a few items from the lost sites (I think -- the names are familiar), but not complete sets. Shall I ZIP what I have to you? I haven't the time nor patience at the moment to set up various searches and counts, all of which need doing. as well as all the indexing that will be of use to Sonja as she works on The Book.

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: Toki pona corpus project needs your help!

Postby janMato » Mon Feb 01, 2010 9:31 pm

janKipo wrote:Thank you. I have fallen way behind on this and never reached your coverage. I do , however, have a lot of shorter items, casual realtime (well, almost) exchanges and the like (I don't have the IRC logs). This does include a few items from the lost sites (I think -- the names are familiar), but not complete sets. Shall I ZIP what I have to you? I haven't the time nor patience at the moment to set up various searches and counts, all of which need doing. as well as all the indexing that will be of use to Sonja as she works on The Book.


Absolutely! With the quantity of good material I've found, I'm optimistic about compiling an anthology in my spare time. Private message me if you need my email address again.

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: Toki pona corpus project needs your help!

Postby janMato » Tue Feb 02, 2010 11:51 pm

48,000 tokip pona words counted.
In my corpus ali is more common than ale.
kala, seli, mu are the least successful words
kipisi, monsuta, pu don't show in the results yet.

This is the application I used: http://neon.niederlandistik.fu-berlin.de/textstat/ Very, very handy.

Code: Select all

Word   count   tp
li   5115   10.52%
e   3504   7.21%
jan   2348   4.83%
mi   2038   4.19%
ni   1724   3.55%
toki   1390   2.86%
pi   1345   2.77%
ona   1307   2.69%
tawa   1305   2.68%
lon   1180   2.43%
ma   1167   2.40%
la   1084   2.23%
mute   1020   2.10%
tenpo   1015   2.09%
ala   949   1.95%
pona   912   1.88%
sina   891   1.83%
lili   799   1.64%
kama   604   1.24%
sona   565   1.16%
tan   557   1.15%
tomo   557   1.15%
suli   541   1.11%
ken   509   1.05%
jo   498   1.02%
o   488   1.00%
pilin   484   1.00%
lawa   475   0.98%
pali   464   0.95%
wile   460   0.95%
sewi   455   0.94%
ike   431   0.89%
telo   407   0.84%
lukin   402   0.83%
tu   356   0.73%
ali   348   0.72%
pana   340   0.70%
wan   320   0.66%
taso   318   0.65%
sama   306   0.63%
nasin   297   0.61%
kasi   290   0.60%
sike   289   0.59%
ante   283   0.58%
soweli   278   0.57%
suno   272   0.56%
ijo   271   0.56%
seme   260   0.53%
kulupu   257   0.53%
en   242   0.50%
nimi   234   0.48%
musi   222   0.46%
weka   219   0.45%
moku   217   0.45%
kepeken   199   0.41%
meli   199   0.41%
moli   198   0.41%
pini   198   0.41%
sitelen   193   0.40%
utala   192   0.39%
wawa   181   0.37%
nanpa   178   0.37%
lipu   176   0.36%
poka   166   0.34%
kin   152   0.31%
awen   145   0.30%
seli   144   0.30%
kon   141   0.29%
pakala   141   0.29%
mani   138   0.28%
loje   136   0.28%
nasa   132   0.27%
anpa   130   0.27%
kalama   128   0.26%
a   127   0.26%
olin   127   0.26%
sin   113   0.23%
pimeja   112   0.23%
luka   109   0.22%
mije   108   0.22%
ilo   102   0.21%
len   100   0.21%
poki   88   0.18%
sijelo   85   0.17%
kili   82   0.17%
nena   79   0.16%
mama   78   0.16%
kiwen   74   0.15%
anu   70   0.14%
lape   70   0.14%
linja   70   0.14%
akesi   69   0.14%
palisa   63   0.13%
sinpin   63   0.13%
waso   59   0.12%
ale   58   0.12%
noka   58   0.12%
mun   55   0.11%
open   50   0.10%
kute   49   0.10%
insa   48   0.10%
lete   48   0.10%
lupa   48   0.10%
jelo   47   0.10%
oko   43   0.09%
suwi   41   0.08%
laso   31   0.06%
uta   31   0.06%
walo   31   0.06%
ko   30   0.06%
monsi   30   0.06%
pipi   30   0.06%
pan   29   0.06%
jaki   28   0.06%
supa   23   0.05%
kule   22   0.05%
esun   21   0.04%
unpa   18   0.04%
kala   17   0.03%
selo   11   0.02%
mu   10   0.02%
   48631   

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: Toki pona corpus project needs your help!

Postby janMato » Wed Feb 03, 2010 1:03 am

Here is a zip file of all the current redistributable toki pona text I can find. So far 50% of the collected corpus is redistributable.

Shout outs to the following awesome people who have published with a license compatible with republication.
Michael Freedman - BY-NC-ND
Everyone who posted to wikia - CC-BY-SA
John Clifford - (by forum post implying he was willing to contribute, but not specific license yet)
Joop Kiefte - Public Domain
Bryant Knight - Public Domain
Sonja Kisa - BY-NC-SA
Dave Raftery - Creative Commons
Rowa Giso (not sure about their name) - AFAIK, these texts are a derivative of B Knight's works, which is now public domain.

Posts to this forum are covered by CC per your TOS with jan Sonja (thanks for pointing that out jan Ote!) I suspect that re-licensing magic only works when the original copyright holder posts, though.

However, as far as I can tell, tokilili, yahoo, live journal, etc are all under copyright of the original contributors, the sites that post TOS usually say that the site gets limited license to run the mailing list or what have you, but there's no redistribution rights for folk like me...but like any copyright, it only matters to the extent that some can afford to enforce. But that is a story for another day.

I excluded anything that was a translation of a copyrighted work and wasn't small enough to be considered fair use.

Once I get my toki pona website up again, I'll publish it there, too.

The next step is to rummage through this and start correcting the grammar and removing extraneous English, although the No-Deriv's licenses on some of the texts worry me about if I can do that.
Attachments
Redistributable Toki Pona Corpus.zip
(53.29 KiB) Downloaded 194 times
Last edited by janMato on Wed Feb 03, 2010 9:17 am, edited 1 time in total.

User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: Toki pona corpus project needs your help!

Postby jan Ote » Wed Feb 03, 2010 6:24 am

http://en.tokipona.org/wiki/Copyright wrote:All original text, images, sounds and videos on the Toki Pona website are licensed under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Unported Licence. Anything you contribute to this website's wiki and forums will also be published under this licence.


There are some tp texts on my tp site: http://tokipl.wikidot.com/teksty
Current license for the site: CC-NC-SA, the same license has been used as for tp Wiki and this forum. Well, janSonja have chosen 'unported' version. Under this license derivative works are allowed, then: corrected and improved versions can be published (from all people mentioned by janMato only M.Freedman used CC-NC-ND).
Texts there:
  • ma tomo Pape -- by Bryant Knight, from Wikipedia article
  • sike wan -- by Bryant Knight, from his site
  • jan lawa Oliki -- a modified version of a text by soweli Elepanto, sent to tp forum
  • kala -- by François Schwicker (jan Kanso), from the forum archive
  • toki suli Intenasijonale -- by jan-ante, text sent to tp forum
The rest five or so are by mi en jan lili mi and are the final versions of texts already published and revised here.

Please send me e PM or write here if you need a plain ascii.

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: Toki pona corpus project needs your help!

Postby janMato » Wed Feb 03, 2010 9:35 am

jan Ote wrote:
http://en.tokipona.org/wiki/Copyright wrote:All original text, images, sounds and videos on the Toki Pona website are licensed under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Unported Licence. Anything you contribute to this website's wiki and forums will also be published under this licence.


Thanks for pointing that out! I'll have to take time to scavenge this forum for suitable texts-- it could end up being as important as wikia. The yahoo forums text posted to this forum though, I considering as "unknown license."

270kb total, 144kb republishable

jan lili sina li sitelen e nimi pi toki pona? pona a! jan li jo e sike suno pi seme nanpa?

User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: Toki pona corpus project needs your help!

Postby jan Ote » Wed Feb 03, 2010 11:20 am

janMato wrote:jan lili sina li sitelen e nimi pi toki pona? pona a! jan li jo e sike suno pi seme nanpa?

jan lili mi li suli. tenpo sike tu wan kama la ona li ken pali lon tomo pali, li ken tawa tomo mani, li ken tawa weka tan tomo mi li ken jo e tomo ona.

tenpo mute la mi tu li toki lili kepeken toki pona lon tomo. tenpo pini la ona li toki e toki musi pi ''kala ma" tawa mi. mi toki e ni: "o sitelen e toki ni tawa mi! tan nasin ni la jan ante ken sona e ni".

jan lili mi li sitelen e toki ante. jan ante li pali e sitelen musi li sitelen e nimi lon sitelen ni kepeken toki Inli. ni li sitelen pi nimi Inli Manka. jan lili mi li kama jo e sitelen nimi ni li sitelen e nimi sin kepeken toki pona. ona li pana e ni tawa mi. mi lukin. taso ona li ken ala pana e ni tawa jan ante. jan ante ken ala lukin e ni. mama pi sitelen Manka ni taso li ken pana e ni tawa jan ante. ni li nasa.

User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: Toki pona corpus project needs your help!

Postby jan Ote » Tue Mar 23, 2010 6:19 am

9 toki pona text, already known on the forum
Creative Commons BY-NC-SA
  • kala ma li lon ala tan seme? (40 words)
  • meli anu mije (60 words)
  • soweli en kili (70 words)
  • pipi musi en pipi pali (100 words)
  • toki suli Intenasijonale (200 words)
  • jan lawa Oliki (300 words)
  • tan jan Eloto. jan pi ma seme li jan nanpa wan? (400 words)
  • jan Kikamesi. jan Enkitu li kama (1800 words)
  • jan Kikamesi. utala pi jan Kuwawa (2800 words)
by jan-ante, jan soweli Elepanto, jan Mika, jan Ote
All texts in toki pona only; lines starting with "+" are titles and headers.
Attachments
suno-pona-corpus.zip
9 tp text (short, medium, long), Creative Commons BY-NC-SA
(9.2 KiB) Downloaded 165 times

linguafrakka
Posts: 3
Joined: Mon Apr 16, 2018 1:53 pm

Re: Toki pona corpus project needs your help!

Postby linguafrakka » Mon Apr 16, 2018 2:05 pm

Thought it would be worth mentioning here that I've recently brought the Toki Pona language to life over on Glosbe.com. The site originally had only 24 translated words, but I've now beefed the site up with over 1,200 English words and their corresponding word/phrase in Toki Pona. I made sure to provide thorough English definitions for all translated words to ensure there is no semantic ambiguity regarding English words with multiple corresponding Toki Pona words (ex. "cause" could be "kama", "pana", or "tan" based on context). Totally recommend checking out the site as a handy reference and contribution-hub.

https://glosbe.com/en/mis_tok


(Sorry if my post was badly formatted. Created an account here just to spread the word about Glosbe.)


Return to “ijo ante | miscellaneous | diversaj”

Who is online

Users browsing this forum: No registered users and 1 guest