New Google tool

Discuss any other topic in here.
Diskutu ĉiujn aliajn temojn ĉi tie.
jan-ante
Posts: 541
Joined: Fri Oct 02, 2009 4:05 pm

New Google tool

Postby jan-ante » Fri Dec 17, 2010 5:31 pm

just try it:
http://ngrams.googlelabs.com/
it would be great to have toki pona in list one day..

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: New Google tool

Postby janMato » Fri Dec 17, 2010 6:12 pm

toki pona doesn't even register. In fact, the only fake language name that registers is Klingon and Esperanto

Some more conlang phrases.

It seems "aux lang" was a common phrase during the war-- maybe it had a military meaning outside of the fake language sense.

jan-ante
Posts: 541
Joined: Fri Oct 02, 2009 4:05 pm

Re: New Google tool

Postby jan-ante » Sat Dec 18, 2010 10:42 am

janMato wrote:toki pona doesn't even register. In fact, the only fake language name that registers is Klingon and Esperanto

no, i mean the tool to trace thelanguage evolution over time
e.g. in english you can see the frequancy drop for many service words, as well as in russian. this may reflect the compactisanion of language structure, use of longer chains of modifiers, etc. but for some words english and russian evolve differently
so, what about tp? measuring f(li) we could estimate how does the length of tp sentence changes. f(pi) might indicate the change in modifier chains complexity, etc

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: New Google tool

Postby janMato » Sat Dec 18, 2010 6:43 pm

That is interesting how the Russian Revolution had such an impact on basic things like how many clause introducing words people used in published texts.

jan-ante wrote:so, what about tp? measuring f(li) we could estimate how does the length of tp sentence changes. f(pi) might indicate the change in modifier chains complexity, etc


I wrote some code trying to come up with some measure of toki pona for readability scores.

I plan to write some code that will assign readbility metrics to each text file in the toki pona corpus-- mostly so that I can sort them and publish them in a graded reader. If I extend that code to spit out the date of the source material, then a graph wouldn't be too much more work.

Outside of sentence length, what metric would you use to measure complexity, or what other metrics would be of interest?

jan-ante
Posts: 541
Joined: Fri Oct 02, 2009 4:05 pm

Re: New Google tool

Postby jan-ante » Sun Dec 19, 2010 7:53 am

janMato wrote:That is interesting how the Russian Revolution had such an impact on basic things like how many clause introducing words people used in published texts.

it was the biggest revolution of minds in russian history. it brought the precise thinking to the broad masses of people. from that time every schoolchild studied literature, mathematics, chemistry, darwinism, etc. but ngrams could be even more interesting than you expect. look how defeats in both world wars affected the german thinking. you could separately try without sie so wenn to view the effect for low frequancy words. Compare, how the wars affected the english speakers. the effect wass opposit (exkept "will"). then go "back to the USSR". you can see peaks at 1928, 1942, 1953, 1990. you probably know what do the 2nt ant 4th date mean in soviet history. 1928 was the famine, 1953 was the Stalin's death. the point of turnover in late soviet era was 1975-1977, when (probably) the accumulation of pakala started; 1990 was just a culmination. this refutes the theory of Gorbi's conspiracy as the cause of soviet collapse.
note, that these processes were probably subconsciousness. some very evident bad style (like starting the sentence with "Также" or "Далее" (Also & Further)) dropped down in the war, but increased abruptly with advent of "freedom".

i wonder, could somebody check this for french, spanish and (if applicable) for chinese? it would be interesting to compare.

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: New Google tool

Postby janMato » Sat Jan 01, 2011 10:15 pm

Well, it doesn't span any historical periods and there isn't much political talk going on, but I have some metrics and I calculate them for a variety of documents. I'll have to go back to all of these to get what year they were written-- I didn't think to get that when I was gathering files for the corpus.
http://tokipona.net/tp/CorpusReadability.aspx

I also got a primative corpus search that accept regex searches
http://tokipona.net/tp/CorpusSearch.aspx

When I combine these and create a graph, I'll have something close to an N-Gram thingy.

User avatar
jan Ote
Posts: 424
Joined: Thu Oct 08, 2009 1:15 am
Location: ma Posuka
Contact:

Re: New Google tool

Postby jan Ote » Thu Jan 06, 2011 6:01 am

janMato wrote:I have some metrics and I calculate them for a variety of documents. I'll have to go back to all of these to get what year they were written-- I didn't think to get that when I was gathering files for the corpus.
http://tokipona.net/tp/CorpusReadability.aspx
I looked for the easiest text in the corpus. And the winner is... surprise! surprise!... "advanced- jan Kikamesi- jan Enkitu li kama" with combined readability score equal 0.0, as all its metrics are equal zero :D
(The file is empty).

While the harderst to read is your "Troll" (8.9, when 1.0 is the average). Its Complex NP, Function and Words/Sentence measures are extremly high just because the sentences are delimited by commas instead of full stops.

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: New Google tool

Postby janMato » Thu Jan 06, 2011 8:34 am

Empty file- fixed. There's actually a lot of work left to make this corpus usable for a variety of purposes. First is to come up with a system for metadata-- is it poetry, what year was it written, etc.

Crazy minima and maxima -- not fixed yet but addressed with more data. I included my entire compiled corpus including the stuff that isn't strictly redistributable-- I'm supposing I'll use youtubes rules (post content and take down when the owner complains) or "fair uses" as a defense should I get any care bear stares.

janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: New Google tool

Postby janMato » Sat Jul 30, 2011 8:43 am

Ricky6 wrote:Good if google would come up with a Toki Pona version..


Done. http://tokipona.net/tp/ Enter your tp search words in the box and click search. The results are restricted to sites manually determined to have toki pona content.


Return to “ijo ante | miscellaneous | diversaj”

Who is online

Users browsing this forum: No registered users and 2 guests