Page 1 of 1

New Google tool

Posted: Fri Dec 17, 2010 5:31 pm
by jan-ante
just try it:
http://ngrams.googlelabs.com/
it would be great to have toki pona in list one day..

Re: New Google tool

Posted: Fri Dec 17, 2010 6:12 pm
by janMato
toki pona doesn't even register. In fact, the only fake language name that registers is Klingon and Esperanto

Some more conlang phrases.

It seems "aux lang" was a common phrase during the war-- maybe it had a military meaning outside of the fake language sense.

Re: New Google tool

Posted: Sat Dec 18, 2010 10:42 am
by jan-ante
janMato wrote:toki pona doesn't even register. In fact, the only fake language name that registers is Klingon and Esperanto
no, i mean the tool to trace thelanguage evolution over time
e.g. in english you can see the frequancy drop for many service words, as well as in russian. this may reflect the compactisanion of language structure, use of longer chains of modifiers, etc. but for some words english and russian evolve differently
so, what about tp? measuring f(li) we could estimate how does the length of tp sentence changes. f(pi) might indicate the change in modifier chains complexity, etc

Re: New Google tool

Posted: Sat Dec 18, 2010 6:43 pm
by janMato
That is interesting how the Russian Revolution had such an impact on basic things like how many clause introducing words people used in published texts.
jan-ante wrote:so, what about tp? measuring f(li) we could estimate how does the length of tp sentence changes. f(pi) might indicate the change in modifier chains complexity, etc
I wrote some code trying to come up with some measure of toki pona for readability scores.

I plan to write some code that will assign readbility metrics to each text file in the toki pona corpus-- mostly so that I can sort them and publish them in a graded reader. If I extend that code to spit out the date of the source material, then a graph wouldn't be too much more work.

Outside of sentence length, what metric would you use to measure complexity, or what other metrics would be of interest?

Re: New Google tool

Posted: Sun Dec 19, 2010 7:53 am
by jan-ante
janMato wrote:That is interesting how the Russian Revolution had such an impact on basic things like how many clause introducing words people used in published texts.
it was the biggest revolution of minds in russian history. it brought the precise thinking to the broad masses of people. from that time every schoolchild studied literature, mathematics, chemistry, darwinism, etc. but ngrams could be even more interesting than you expect. look how defeats in both world wars affected the german thinking. you could separately try without sie so wenn to view the effect for low frequancy words. Compare, how the wars affected the english speakers. the effect wass opposit (exkept "will"). then go "back to the USSR". you can see peaks at 1928, 1942, 1953, 1990. you probably know what do the 2nt ant 4th date mean in soviet history. 1928 was the famine, 1953 was the Stalin's death. the point of turnover in late soviet era was 1975-1977, when (probably) the accumulation of pakala started; 1990 was just a culmination. this refutes the theory of Gorbi's conspiracy as the cause of soviet collapse.
note, that these processes were probably subconsciousness. some very evident bad style (like starting the sentence with "Также" or "Далее" (Also & Further)) dropped down in the war, but increased abruptly with advent of "freedom".

i wonder, could somebody check this for french, spanish and (if applicable) for chinese? it would be interesting to compare.

Re: New Google tool

Posted: Sat Jan 01, 2011 10:15 pm
by janMato
Well, it doesn't span any historical periods and there isn't much political talk going on, but I have some metrics and I calculate them for a variety of documents. I'll have to go back to all of these to get what year they were written-- I didn't think to get that when I was gathering files for the corpus.
http://tokipona.net/tp/CorpusReadability.aspx

I also got a primative corpus search that accept regex searches
http://tokipona.net/tp/CorpusSearch.aspx

When I combine these and create a graph, I'll have something close to an N-Gram thingy.

Re: New Google tool

Posted: Thu Jan 06, 2011 6:01 am
by jan Ote
janMato wrote:I have some metrics and I calculate them for a variety of documents. I'll have to go back to all of these to get what year they were written-- I didn't think to get that when I was gathering files for the corpus.
http://tokipona.net/tp/CorpusReadability.aspx
I looked for the easiest text in the corpus. And the winner is... surprise! surprise!... "advanced- jan Kikamesi- jan Enkitu li kama" with combined readability score equal 0.0, as all its metrics are equal zero :D
(The file is empty).

While the harderst to read is your "Troll" (8.9, when 1.0 is the average). Its Complex NP, Function and Words/Sentence measures are extremly high just because the sentences are delimited by commas instead of full stops.

Re: New Google tool

Posted: Thu Jan 06, 2011 8:34 am
by janMato
Empty file- fixed. There's actually a lot of work left to make this corpus usable for a variety of purposes. First is to come up with a system for metadata-- is it poetry, what year was it written, etc.

Crazy minima and maxima -- not fixed yet but addressed with more data. I included my entire compiled corpus including the stuff that isn't strictly redistributable-- I'm supposing I'll use youtubes rules (post content and take down when the owner complains) or "fair uses" as a defense should I get any care bear stares.

Re: New Google tool

Posted: Sat Jul 30, 2011 8:43 am
by janMato
Ricky6 wrote:Good if google would come up with a Toki Pona version..
Done. http://tokipona.net/tp/ Enter your tp search words in the box and click search. The results are restricted to sites manually determined to have toki pona content.