tpp++ toki pona cross compiler

Tinkerers Anonymous: Some people can't help making changes to "fix" Toki Pona. This is a playground for their ideas.
Tokiponidistoj: Iuj homoj nepre volas fari ŝanĝojn por "ripari" Tokiponon. Jen ludejo por iliaj ideoj.
Post Reply
janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

tpp++ toki pona cross compiler

Post by janMato »

I've written two blog posts on cross compiling annotated toki pona to ordinary toki pona.

This is the blog post to try to express what my idea is. It isn't clever or original, but it is easy misunderstand.
http://fakelinguist.wakayos.com/?p=831

This is a draft I threw together on annotations that I think would be nice for tp++ to have. It is not complete at all. Some of it is already implemented in my existing toki pona parser (e.g. using commas to mark prepositions) and some of it I'm not sure I'm smart enough to implement (such as doing declared variables and phrase to pronoun binding)
http://fakelinguist.wakayos.com/?p=834
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: tpp++ toki pona cross compiler

Post by janKipo »

I get the general idea, at least, but I am not clear about what the purpose is, nor how to apply it. Given a corpus, what? We feed a sentence from it into a machine that produces, minimally, a declaration "tp" or "not tp". But, hopefully, also a set of parse trees that lay out all the possible valid readings for the sentence (and as much as it can manage of things that turn out not to be tp). What does this new gizmo add? Apparently, if there are a bunch of pre-editing marks added to the text, it can guide the parser to more accurate parses (eliminate some unreasonable possibilities, for example). But that doesn't seem to add to the parser's power, since it could have been written with those marks in place already and then drop them when the final sentence is reproduced after the parsing is done. The question comes, then, what does the preedit? And, if it is humans, what does the parser actually do? Are there, in this new gizmo, devices to automatically add the preediting marks? Or, at least, to consider the possibilities with and without them (i.e., the possibilities)?
I admit that I don't get linear grammars well and often find them uninformative. So, most of these rules just seem odd to me. The one that does make sense is the rule that always puts 'li' between subject and predicate and then erases it after 'mi' and 'sina' standing alone, The idea of shoving pieces around is appealing, but except for moving PPs to the front with 'la' (and often losing the preposition), doesn't seem to have much obvious reference to tp. Unless, of course, we are actually stepping back from the surface text somehow to underlying structures. But that is an area where machines are notoriously unreliable (imagine all the structures that give rise to a 'pi' phrase, for example) or overproductive (imagine listing them all out).
So, I hope your next paper on this is a bit about what is going to happen and where whatever you are suggesting fits in.
I really want a machine to do the scut work, so keep on with this project. And thanks for what you have given so far.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: tpp++ toki pona cross compiler

Post by janKipo »

Or are you talking about a production grammar, which inputs and (more or less) abstract structure and outputs a sentence? So the meaning is quite explicit in the structure but may be ambiguous in the sentence. The role of a parsing grammar is then to produce all the production structures which would give rise to the given sentence.
janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: tpp++ toki pona cross compiler

Post by janMato »

From the sound of it, yes, a cross compiler is a production grammar. The input abstract structure is annotated toki pona with additional meaning specific synonyms for the particles. The output structure is a sentence of ordinary toki pona.
I get the general idea, at least, but I am not clear about what the purpose is, nor how to apply it. Given a corpus, what? We feed a sentence from it into a machine that produces, minimally, a declaration "tp" or "not tp".
That would be a byproduct of creating a cross compiler. Cross compiling ordinary toki pona to ordinary toki pona would fail on invalid toki pona. It's useful for basic grammar checking, but that's already been done.

The real purpose of a cross compiler of this sort is to allow advance toki pona users to take advantage of more powerful syntax, but still be able to machine convert their text into ordinary toki pona.
But, hopefully, also a set of parse trees that lay out all the possible valid readings for the sentence (and as much as it can manage of things that turn out not to be tp).
In both tp++ and tp, a pi chain parses to one data structure, even though it might mean a combinatorial explosion of different things based on different groupings.

The cross compiler doesn't know or care how the meaning of the sentence binds to elements in reality. The cross compiler might validate that pronouns bind to something in a predictable manner.
What does this new gizmo add? Apparently, if there are a bunch of pre-editing marks added to the text, it can guide the parser to more accurate parses (eliminate some unreasonable possibilities, for example).
The human reader would see fewer parsings (fewer ways to link up the sentence to reality), but the compiler would see only 1 of the possible ones. E.g. using mon instead of pi when indicating personal possession reduced the ways to interpret soweli pi mun lili mon jan Mato, would compile down to soweli pi mun lili pi jan Mato. They both have problems determining what is being possessed (i.e. the mun or the soweli). But in the case of mon jan Mato, the tp++ writer can see that we don't have a soweli or mun of a jan Mato sort. After compilation, soweli pi mun lili pi jan Mato is ordinary toki pona, presumably just as readable as any other toki pona.
But that doesn't seem to add to the parser's power, since it could have been written with those marks in place already and then drop them when the final sentence is reproduced after the parsing is done. The question comes, then, what does the preedit?
People who want to work with a more powerful syntax, but continue to write texts that are readable by people unaware of tp++.
And, if it is humans, what does the parser actually do? Are there, in this new gizmo, devices to automatically add the preediting marks? Or, at least, to consider the possibilities with and without them (i.e., the possibilities)?
Inferring missing symbols is incredibly difficult to do by machine. This is why all programming languages have well defined terminal symbols and spaces between words. Humans it seems can do okay without punctuation, vowels and so on. But writing a compiler that can make sense of that is hard. I had to write a few hundred lines of ugly code to infer "li" when dropped beteween mi/sina & the verb phrase.
I admit that I don't get linear grammars well and often find them uninformative. So, most of these rules just seem odd to me. The one that does make sense is the rule that always puts 'li' between subject and predicate and then erases it after 'mi' and 'sina' standing alone, The idea of shoving pieces around is appealing, but except for moving PPs to the front with 'la' (and often losing the preposition), doesn't seem to have much obvious reference to tp.
tp has a mono-sentence syntax. There is only one sentence and all the phrases are required to be in one order. This is an unnecessary constraint. It is a rule that can be dispensed with. That some phrases don't have head particles is another irregularity that makes it difficult to break the mono-sentence syntax. If there was a subject header, then you could shuffle all the particle-headed-phrases of a sentence and it would mean the same thing, but it would be more expressive because

A (smallish) language should be expressive and not to verbose. As writers, we've pushed tp to it's limits. We alrea
Unless, of course, we are actually stepping back from the surface text somehow to underlying structures. But that is an area where machines are notoriously unreliable (imagine all the structures that give rise to a 'pi' phrase, for example) or overproductive (imagine listing them all out).
Since this is a cross compiler, it passes long pi chains undigest from the start to the end. Only a human brain can reliably make sense of a long pi chain. But while writing, the tp++ author can avail themselves to alternatives to pi, which compiles down to pi. For example, a particle that means possession instead of what ever generic thing pi stands for.
So, I hope your next paper on this is a bit about what is going to happen and where whatever you are suggesting fits in. I really want a machine to do the scut work, so keep on with this project. And thanks for what you have given so far.
It's an opportunity for me to learn about compilers. I ended up in programming sort of by accident, so if I finish it, this would be the computer science term paper I never wrote.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: tpp++ toki pona cross compiler

Post by janKipo »

OK, now I sorta get the program, When I hear about machine language-work, I tend to think in terms of parsers, since that seems to be the main thing people want to do, and it is a very useful thing to have done. Too bad computers aren't very good at it (not really that much better -- aside from speed -- than they were when I was working in it in 1960. Oh, and maybe elegance.) Lojban (and Loglan before it) are pretty thoroughly parsable,of course, but that is because they are essentially computer languages made vaguely speakable (as witness the fact that they have a syntax completely controlled by BNF rules and some sort of compiler. I'm talking about official Lojban here; what is spoken and used on the chats is either not parsable or gets the wrong parse most of the time).
So, someone sits down an writes some non-tp, which is, however, fairly clear -- unambiguous or at least less ambiguous. This is now run through your cross-compiler (what's the "cross" part about?) and out comes a sentence in ordinary tp, which has the intention of the original formula as one of its possible meanings, hopefully the most probable one in the context. So, in addition to learning tp, our compiler user has also to learn tp++, which he does not use to talk to other people, just to the machine to get something he can use to talk to other people, something he could, hopefully, do without the machine, framing sentences in his head in accordance with his intended meaning, maybe even taking steps along the way to block obvious problems in simple solutions.
So the question now is, why tp++; why not (as Chomsky and Montague and ol' Unca Tom Cobbley an' a' say) just tp and combination rules? We know to take a case from Chapter One, that if you have 'jan li suli', the next time you refer to that jan, you can use 'jan suli' and so on for all subject/predicate forms there are 'x li y' justifies 'x{y}' anywhere x has the original referent (the {}s deal with 'pi' externally and objects and PPs internally). Similarly (from like Chapter Five) 'y li jo e x' justifies (with coreferential x again) 'x{y}'. Indeed, that works with just about any verb, though the notion of possession varies a lot. Voila! no 'mon', just a tp situation. And on and on. 'tawa noka' comes from 'tawa kepeken noka' by part of the process involved in {} round predicates. The (at least) three meanings of 'ala' arise in (at least) three different ways, as do the (at least) five different meanings of 'kin'. Each way is just a matter of what tp sentences go into the mix to give rise to the final product. And, of course, Linguistics says that something like this is what actually goes on when we make sentences -- we're just very fast and we know a lot of shortcuts. Understanding is just working backward along the possible paths to a recognizable situation (again, we're fast, we know a lot of shortcuts, and we can correct our mistakes). If you must have a machine involved, at least what you have to work with is tp at its lowest level, ultimately sentences of 3 to 7 words to begin with. [Needless to say, thi is the way I think tp should be taught and the chapter references are to real chapters in a fairly vaguely concrete textbook.]
janMato
Posts: 1545
Joined: Wed Dec 02, 2009 12:21 pm
Location: Takoma Park, MD
Contact:

Re: tpp++ toki pona cross compiler

Post by janMato »

So the question now is, why tp++; why not (as Chomsky and Montague and ol' Unca Tom Cobbley an' a' say) just tp and combination rules?
The goal isn't so much the handful of example features I listed, but to break this design deadlock. I can't (well no one can) innovate with toki pona except around the edges, such as seeing an unspecified feature and using it as if it had a define specification (for example, the verb phrase has more possibilities than we know what it means, (jan li moku pini -- suggestive of perfective, I could use it consistently when I mean perfective, but AFAIK, no one ever said that pini is a grammatical word like English "have eaten"). And a few other tricks for innovating without significant buy in from the community or the designer.

But with tp++, the sky is the limit-- polysyntetic toki pona? Sure why not? Too hard for me to write the cross compiler, but that is one direction it could go.

For my aethetic sense, it just feels like the language has too few particles. Japanese has boatloads of them-- we probably don't need that many, but with so few words, more particles are called for. A new basic word would would make toki pona 1% more expressive-- hardly noticeable. But a relative clause would revolutionize the language and depending on how it was implemented, would compile down to a tedious chain of short sentences connected by ni's that you can't tell for sure what they refer to.* But in the source code with a relative clause particle, it would be immediately obvious.

*(Maybe it could compile down to ni's with obviates, e.g. soweli li suwi. mi lukin e ni soweli., last time I wrote much toki pona, I notice I preferred using ona with a modifier to obviate the referrent)

Regarding [y li jo e x] becoming x y and [x li y] becoming x y. In my opinion, this is exactly what makes toki pona difficult to read. Juxtaposition is a universal relationship (i.e. it can just about mean anything)-- not only is it hard for the machines to decide what juxtaposition means, it's hard for human brains as well. Those 2 transformations discarded information that a human brain now must painstakingly reconstruct. Obviously I'm not native fluent--a native might be able to do this with ease, but even seemly simple content written in toki pona takes me several readings & some pretty high quality context to guess what relationship was meant.

Anyhow, in the JavaScript/TypeScript world, there is a similar discussion going on, lots of people are quite happy with JavaScript & since TypeScript compiles down to JavaScript, they don't see the value. But TypeScript authors get to write in a programming language with features that browsers will never implement, expressiveness, without having to teach all the web browsers a new language.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: tpp++ toki pona cross compiler

Post by janKipo »

Still trying to locate this critter in a linguistic framework. So, at the top of any grammar of a sentence is a cluster concepts and references and functions, usually represented as a formula of some intensional logic (at least for Montagovians). This is the full meaning of the sentence and contains everything that the sentence might have (and perhaps more beside). This can give rise, even in a given language, to a number of discourses of very different levels of complexity. Using the fewest processing steps before the obligatory ones for a language (declension, word order, agreement and other such superficial surface phenomena, including the isolating -polysynthetic spectrum, and the pronunciation realizations) one ends up with a set of simple sentences, including, perhaps, sentences which merely relate other sentences: "This is past: He comes. This entails that. This is past: She is happy". If the structure that became this group was subjected to further processing before going to the final set, it might be something shorter, a single sentence, either 'If he came, she was happy" or "His coming was (or even "would have been") sufficient for her happiness" And so on. The point is that the structure at every point (well, maybe not every point) can be taken down to a discourse in the language and called up later to explain sentences that come later, or, indeed, earlier, in the chain of processes. For, a lot of these processes lose information, either absolutely (as when tense just drops out in tp) or by having its expression overlap with the expression of some other information (as when the retrofuture falls together with the subjunctive in English) and going back a step may reveal what was lost.

In that context, I take it that you are aiming, if not for the intensional formula, then for something still relatively high on the processing chart, where the distinction you favor are still in view, but lower than the all-simple-sentences version, so it is reasonably compact and usable (that is, it is the way you would like to write it). My point then is simply that that step, whatever it may be, can be put into tp in a way that helps explain what is lost. the usual case in tp is noun modifiers, which arise from several sources (depending on how you count): "possession" and predication being the chief. So, does 'A x y B' come from (the structure which could be brought down as) 'x li y'. A x B' (predication), or from 'y li C x D. A x B' (possession)? Presumably, you write (a possibly different descent of) the second form of these in each case when writing tp++ then set the program to produce the more reduced first form, rather than the second. In this case, I imagine that your intermediate form is not tp at all but tp with separate forms of 'pi' for possession and predication added, so that the rules give 'A x pred y B' or 'A x poss y B'. The point is just that this descent to tp++ is not necessary, since the situation was already covered in the tp descent from the complex to the simple. But, of course, "enriched" tp is easier and more natural to write, and ths useful -- so long as you remember what it is you are doing. And, of course, it might become a language in its own right someday.
User avatar
jan_Lope
Posts: 294
Joined: Sat Apr 06, 2013 1:30 pm
Location: mi lon ma tomo Pelin.
Contact:

Re: tpp++ toki pona cross compiler

Post by jan_Lope »

janMato wrote: Some of it is already implemented in my existing toki pona parser (e.g. using commas to mark prepositions)
I think commas before prepositions are useful. I've changed my lessons accordingly.

http://rowa.giso.de/languages/toki-pona ... 0000000000
pona!
jan Lope
https://jan-lope.github.io
(Lessons and the Toki Pona Parser - A tool for spelling, grammar check and ambiguity check of Toki Pona)

On my foe list are the sockpuppets janKipo and janSilipu because of permanent spamming.
janKipo
Posts: 3064
Joined: Fri Oct 09, 2009 2:20 pm

Re: tpp++ toki pona cross compiler

Post by janKipo »

Yes!
I would also recommend before 'pi's in phrases added to phrases which end in 'pi' phrases, to indicate the modification is to everything left, not merely to the expression introduced by the previous 'pi'.
Post Reply