# May The Force Be With You **Remarks on Artificial Intelligence for May 4, 2023** May 4, 2023 Gregory Francis Coppola *Apocalypse Like Right Now* # Introduction This is a collection of essays on the topics of 1) *artificial intelligence* and 2) *human intelligence*, published in honor of May 4, 2023. *****May 4***** is a day when *********Star Wars********* fans traditionally wish one another “may the *fourth* be with you”, in a humorous reference to the Star Wars line “may the *****force***** be with you”. The *****force***** in Star Wars is a whimsical metaphor for the “energy” or “intelligence” that reportedly pervades the universe. # The Turing Test is Passed ## Background The Turing Test was first proposed by Alan Turing in his 1950 paper *Computing Machinery and Intelligence*. Only 20 years ago, when I started as an undergraduate in Computer Science at The University of Waterloo, it seemed that the passing of the Turing test would be several generations away. 10 years ago, I was finishing my PhD in artificial intelligence at The University of Edinburgh, and would have put the passing of the Turing test 30-50 years away. In 2023, the milestone has apparently been reached. My, has time flown. ## The Turing Test is so Passed Right Now Many technical reports and much anecdotal experience has suggested that the Turing test is effectively now “passed”. For example, A. I. thought leader Joshua Bengio recently said: > Now, why did I sign [i.e., a letter to pause large A. I. experiments]? Like why now? Like maybe last year I wouldn’t have signed this letter. It’s because now we reached the threshold. The threshold is the Turing test, meaning that we have systems that we can dialogue with, and we can’t be sure if this is coming from a machine or a human. (*Yoshua Bengio: Pausing More Powerful AI Models and His Work on World Models, Eye on AI, Youtube*, April 12, 2023) > ## Implications ### A Time for Celebration We believe that the passing of the Turing Test is a major event that should, first of all, be celebrated more consciously. There has been remarkably little celebration of this fact either in the field of artificial intelligence. Researchers are too busy nit-picking, and the public is too busy worrying about re-training for new careers, for there to have ever been any acknowledgement that a human achievement on par with the trip to the moon has now been achieved. ### A Time for Reflection Geoff Hinton has recently left his job at Google, in part, he said, to spend his time writing and speaking on the ramifications of the artificial intelligence, saying that we may be very close to when computers are smarter than people (*AI 'godfather' quits Google over dangers of Artificial Intelligence - BBC News*, *BBC News*, *Youtube*, May 2, 2023). We completely agree that this is a necessary time for thought about the implications of artificial intelligence for humanity. # The Optimal Philosophy of Science ## The Historical Problem of Philosophy Philosophy, in its modern sense, refers to the use of logic to answer questions, such as: 1. what we can know 2. how we can know it 3. what to do about what we know ## Science is the Answer to Most of Philosophy Of these three questions, the resolutions of two of them are found in *******science*******. That is: - solved by science - what can we know - how can we know it - not solved by science - what to do about what we know In other words, the only part of what was traditionally called philosophy, once the empirical parts have been ceded to science is: - remaining task for philosophy - decide **********what to do********** about what the world that *******science******* describes - decide **********how to do********** science ## The Philosophy of Science is Necessary for Science The questions of ******************what can be known****************** through science and **********how to do********** science are inextricably linked, and the study of these questions has for many years been known as the “philosophy of science”. The question of *****************what can be known***************** has been a question for philosophy, especially Plato, Aristotle and Kant, and remains so. The question of ***********************how to go about knowing it*********************** was initially part of philosophy, but has now been adapted and refined to a more precise point by artificial intelligence. That is, an artificial intelligence program can only do science if its algorithms, and by extension, its equations, are accurate. Empirically, it is not easy to build a program that passes the Turing Test. Thus, for a program to finally do it, this implies that the equations must be very accurate. Thus, the question of ***********how to know*********** what science can know is now more developed in the field of artificial intelligence than it is in philosophy. ### An Artificial Intelligence Perspective Doing science is analogous to the **********generative********** phase of the GPT model. That is, the best model is the statistical model that best compresses the data, and at the same time has the smallest models size. It has been remarked that there is a model size beyond which it does not help to train a ChatGPT-style model. In other words, even in the “biggest” data sets now, one must consider model size. ### Minimum Description Length The rule of ***************Ockham’s Razor*************** says that the “simplest theory that fits the facts” is best. But, this always left two questions: 1. how many observations does a theory need to “predict” in order to justify itself 2. how to evaluate the “simplicity” of a theory, to compare two theories for “simplicity” Information theory gave an interesting perspective on this question. According to the principle of **************************minimum description length**************************: 1. the goal is to minimize the sum of 1. the size of the model, expressed as a computer program 2. the compressed size of the data, given the model This gives an answer to the problem of Ockham’s razor, which is ***how*** to measure the complexity of the theory versus the observations predicted. ## Specific Applications in 2023 ### Atheism vs. Agnosticism On the question of whether or not we have a creator, atheists propose the default position is that they categorically **do not believe** in a creator, because ******they have not been convinced of one******. A popular atheist who has made this argument is Sam Harris: > It’s [i.e., atheism] is not even the assertion that there is no God. It’s just a failure to be convinced of any of the Gods on offer. It’s just like not believing in Zeus. (AD Harris/Murray/Peterson Discussion: London, Jordan B. Peterson, Youtube, September, 14, 2018) > In other words, Harris believes that a rational default is to start out atheist—the positive assertion that there *is no* creator—until evidence comes in to argue positively ***for*** a creator. But the principle of minimum description length would advise us to stay neutral on a question, until we are ready to use it to predict data, because, in order to justify using up theory space, we would have to gain back the cost of the theory in predictions. But, ahead of seeing any data, we don’t have any reason to take a position, because there is no prediction to make. ### The Complexity of the Theory of God Richard Dawkins argues that, from a scientific perspective, a theory including God must always lose to a theory that does not contain God, because God is “infinitely complex”, and, by Ockham’s Razor, simpler theories should win. From the perspective of minimum description length, we can say that Dawkins’ error is: - Dawkins’ error - we only need to pay for the ******************theory that we use****************** according to minimum description length - we don’t get penalized for the inherent complexity in the object itself ### The Complexity of Infinite Universes In order to explain the “finely tuned universe”, atheists will often resort to the argument of “infinite universes”. Assuming that the entire universe must be represented somewhere in order to model it, then the theoretical complexity of infinite universes, in terms of minimum description length, is infinite. Since the God hypothesis is finite in complexity, God is a more parsimonious explanation for the finely tuned universe than infinite universes. # Ego, Id, and Subsystem Analysis ## Freud’s Framework Sigmund Freud famously proposed a partition model of the human psyche. For our purposes, the two primary systems that Freud has identified are: - the **id**, is described variously as - The unconscious and primitive part of the human psyche - The unconscious and primitive part of the human psyche - The source of instinctual drives and impulses - The part of the psyche that operates on the *pleasure principle* - The reservoir of repressed or socially unacceptable desires and emotions - the ***ego***, is described variously as - The conscious part of the psyche that mediates between the demands of the id and the constraints of reality - The part of the psyche that operates on the *reality principle* - The sense of self or personal identity that develops as a result of interactions with the external world - The seat of rational thought and decision-making Freud also discusses the *********super-ego*********, which represents the desires of one’s community internally. ### Creativity in Freud’s Framework It is unclear where **********creativity********** is located in Freud’s framework, there is controversy as to where it comes from. We believe ***creativity*** is associated with the id, or the inner child, a theory consistent with the following passage: > The creative writer does the same as the child at play. He creates a > > > world of phantasy which he takes very seriously—that is, which he in- > > vests with large amounts of emotion—while separating it sharply from > > reality. (*Creative Writers and Day-Dreaming*, Freud, 1908) > ## Criticism of Freud ### Recognition for Freud’s Achievement Sigmund Freud’s distinction between ***ego*** and **id**, and his delineation of the ****subconscious**** is one of those ideas that is at first original, but then so immediately pervasive, that people forget it was ever original. It is one of the greatest contributions by a single thinker in the history of Western thought. It is for its centrality that we want to build on Freud’s system. ### Limitations of Freud’s System Freud’s system, however, has some limitations, as it naturally would given that Freud’s ****The Ego and the Id**** was published in 1923. This is for the following reasons: - subjective data sets - a lot of Freud’s ideas came evidently from inspection of his own thoughts - Freud’s thinking was done under the influence of controlled substances - especially cocaine - informal data sets - Freud was not able to capture detailed psychometric data about his clients - limited data sets - Freud’s data was limited to a small amount of data because - he was dealing with clients, of which one can interview a limited number - his practice was related to a specific time and place, circa 1923 Vienna - no mathematical or computational language - Freud did not have access to the kinds of mathematical language and computational paradigms that we have today - the Ego and the Id was published in 1923 - while Alan Turings *On Computable Numbers, with an Application to the Entscheidungsproblem*, which introduced the “Turing machine”, was published in 1936 ## Concrete Open Questions for the Freudian Framework Freud’s approach leaves open the following questions: - how many subsystems are there? - how do we justify a delineation of the subsystems - what kind of data is applicable? ## Freud from an Artificial Intelligence Perspective It remains difficult to do neuroscientific experiments to identify the potential boundary between the ego and the id. Thus, we propose to clarify Freud’s subsystems analysis using computer science. This can take two forms: - computationally rigorous subsystems models of the human brain - use concepts from computer science like - function - argument - memory - use analogies to existing computer systems - like databases - refer to A.I. models as they exist today, in the vein of ChatGPT - draw analogies between A. I. systems and humans - in order to provide a unified account Concretely, we propose: - theses on Freud and artificial intelligence - artificial intelligence helps us understand Freud - Freud helps us understand artificial intelligence ## The Ego and the Id in ChatGPT Open AI (*Improved Language Understanding by Generative Pre-Training*) has proposed a two step process by which they leverage the large generative language models: 1. pre-train a large language model on text 2. train a discriminative model to do a task based on the generative model To us, there is a striking analogy between 1) Freud’s ego-id distinction, and 2) the distinction between the generative model and the discriminative model. In other words, the analogy is: 1. **generative model, id** 1. analogous to the **Id** 2. the source of “generative” power 3. “generates” the dataset 4. has a model of the world 5. must accurately model the universe for maximum effect 2. **discriminative model, ego** 1. analogous to the ***ego*** 2. uses “discriminative” methods in order to predict an answer given a context 3. does not have a general model of the world ### The Tension Between Knowledge and Relationships In a sense, it makes sense that there would be a natural cleavage site between these: 1. the need for knowledge - an organism surviving in the world needs to know as much about the world as possible 2. the need for social survival - in order to survive, a social animal must survive ********socially******** - when an entire group detaches from reality, then it is dangerous for the group In order to account for both, it seems necessary that the generative “id” and the discriminative “ego” would be as they are, and not the other way around. That is: - **generative** - in order to actually “generate” the (joint distribution of) the data set, one must have an accurate model of the world - if the generative model were forced to “say things that aren’t true” in the generative modeling stage, this can only reduce the effectiveness of compression of the data - because, by assumption, the most compressive statement was that which was chosen in an unrestricted setting - **discriminative** - does not need to ever “generate” data - does not need to actually learn a model of the universe - only needs to focus on tasks that “have a beneficial outcome” # A. I. Can be Unbiased ## The Concept of Bias The etymology of "bias" can be traced back to the Old French word "biais," which originally meant "oblique." This word likely comes from the Old Provençal "biais," meaning "sideways" or "slanting." Over time, "bias" came to refer to a diagonal line or cut in cloth, which was used to create a particular effect or pattern. The modern usage of "bias" in the sense of a mental or emotional inclination is thought to have originated in the 16th century. It was first used in the context of archery, where a bias was a weight added to one side of an arrow to make it curve in flight. From there, the term came to be used more broadly to refer to any influence that causes something to deviate from its expected or normal course. In modern literature terms, "bias" generally refers to a systematic error or distortion in research or data analysis that arises from factors such as flawed study design, measurement errors, or cultural or social biases. ## Reality versus Ideology In modern terms, when people refer to bias, one way to distinguish between different uses of the word are: 1. deviation from the data 1. this means the the model does not accurately reflect reality or the data - usually, when a model does this, it is out of an interest to “say the politically correct” thing, when the data contradicts this 2. deviation from ideology - this is when conclusions contradict with the prevailing ideology, also called “political correctness” - this is called “algorithmic fairness” In this section, we are interested bias as “deviation from the data”. ## Two Sources of Disagreement among Humans For immediate purposes, we say that an *agent* is an object that can think, speak and act. A *disagreement* arises whenever two different agents espouse contradictory statements. That is, for some statement $T$, agent $A$ espouses $T$ and some other agent $B$ espouses $\neg T$. Disagreements arise for two conceivable reasons: - different values - different agents can have competing interests - they prefer to be in different situations - the tendency thus grows to create a story which differs from reality in ways that are perceived to help the agent - or else, one has a bias away from believing things that could hurt the agent, when there is uncertainty - different interpretation of the data - differential ******access****** to data - different agents have access to different data - information is valuable and so often kept secret - differential *ability to process* data - not every organization or person is equally good at information processing - some people have higher processing capacity in their neurons - some organization have higher processing capacity in their data centers To summarize, we have identified two reasons that disagreements over facts occur: - different values - in which there are **********incentives********** for people to internally or externally view the world in a particular way, to support their own interests - different reads on the data - here, people’s incentives may be aligned, but they are unable to come to an agreement ## Bias as a Prior One notion of “bias” that is interesting, but that we are also not focused on is the use of an informative Bayesian prior. We will assume that we have access to all of the data “in one go”, and so do not use an informative prior. ## Optimal Algorithm of Science Suppose there is an optimal algorithm of science $A$ for a data set. That is, for a data set $D$, if we run $A$ on $D$, then the result will be the “best theory possible” given the data $D$. In practice, we believe that the optimal algorithm for science would be some instantiation of the ***************************minimum description length*************************** principle. At this point, in any imaginable sense, the “optimal” algorithm is the one that best compresses the data. There are strong theoretical guarantees for this, as well as empirical success of the idea of “prediction as compression”. ## Why a Computer Can (Mostly) Overcome Scientific Bias ### Review Let us review the reasons that a person can form a biased (incorrect) view of the world: - incomplete data - limited agents always have incomplete data - however, some people have access to more information than others - limited processing ability - until now, no human or organization has had enough processing capability to understand the entire world - values that conflict with the truth, and so makes it difficult for a person to see the “truth” - a person might be incentivized to disagree with the “scientific” truth ### Why a Computer can be Unbiased Let us consider the case of an unlimited-size computer, operating on a huge data set including 1) all public data, and 2) a large collection of private data: - incomplete data - suppose one has access to all of the data of the entire human race - obviously, there are certain questions that cannot be answered, because no one has the data - but, for many questions, those questions can be answered from the data - in such a case, the sum total of human data **is** enough to answer such questions - so, the limited data is not a problem, for an A. I. tool that can read all human recorded data - limited processing ability - an unbounded A. I. agent does not have a limited processing ability - thus, limited processing ability is not a problem for an unbounded A. I. agent - conflicts of interest that prevent the discovery of the truth - we have supposed in the last section that there is an “optimal” algorithm for science - whether there is a unique such algorithm or not, experience has shown that already algorithms exist which - approach the true data distribution in practice - approach the true data distribution in principle - also, algorithms for science are based on the notion of “compression”, which is neutral on all actual empirical questions, and leaves the conclusions up to the data, in a defined way - thus, we believe that an algorithm can be unbiased in terms of its “values” ## Conclusion If, by “unbiased” we mean that an algorithm accurately represents the world and/or our data about the world, we believe that a computer with 1) infinite computational resources, 2) unbounded access to human data, 3) no value-driven bias, but instead an “unbiased algorithm”, can actually be “unbiased”. # Morals Must be Programmed into A.I. The philosopher Sam Harris has tried to revive the idea that morality can be gotten from science. In other words, he wants to revive the idea that an “ought” can be gotten from an “is”. It has traditionally been thought impossible for an “is” to imply an “ought”, although we must hasten to add some historical references. In this essay, we will show from a scientific perspective in the case of a ChatGPT-style model, why values (and therefore policies) *******do not******* come from the data, but come from an exogenous step. ## Perspectives on Facts Versus Values ### Decision Theory According to decision theory, the expected utility of an action is: - results for an action - let $R(a)$ be the set of all *******results******* that can result from taking action $a$ - expected utility of action $a$ - $u_{\theta,v}(a) = \Sigma_{r \in R(a)}\ p_\theta(r|a)\cdot v(r)$ - choice of action - out of possible actions $a \in D$, choose the action that maximizes $u(a)$ Thus, we see that the probability model $p_\theta$ is completely separate from the value function $v$. ### Generative Pre-Trained Models A generatively pre-trained model with a discriminative second stage can be thought of as follows: - science doing, generative part - the generative part, with parameters $\theta$ - *builds* a ***********generative*********** model of the outside world - by building a generative model of the data - task-oriented discriminative part - discriminative part, with parameters $\beta$ - *****uses***** the generative model to accomplish tasks - does not do science - implicitly represents values ### Conclusions From both perspectives, that of decision theory and that of generative pre-training, we find the same story. That is, the part of the program that scientifically models the world, and the part of the program that uses the world model to accomplish tasks, are separate. In the case of decision theory, we see that the probability model is different from the values that drive decision making. In the case of a generative pre-trained model, we see that the generative model, which models the outside world, is different from the discriminative model, which uses the world knowledge to accomplish tasks. ## Conclusions for Humans Thus, we see, from an artificial intelligence perspective, why: - statements confirmed from A. I. perspective - facts do not determine values - scientific conclusions about what **is** do not imply **********what ought to be********** # Logic is Also Needed ## The Inherent Hilarity of the Meme Title *************************Attention is all you Need************************* The paper that launched the modern GPT revolution is ***Attention is All You Need*** (2017, Google). While “attention” (or, in general the transformer) was “all that was needed” in order to pass the Turing Test, it seems that, more will also be needed. In particular, experience has shown that the transformer models do not currently reason logically with arbitrary precision, and they also hallucinate. Thus, though the meme title caused enjoyment, it is presumably implicit in the assumption that “attention is all you need” is funny, that attention would eventually not be “all you needed”. We suggest that what else is needed is explicitly *symbolic* *****logic*****. ## The Problem of a Logically Consistent Worldview ### Hinton Notes the Problem of Logical Consistency Viewed from the perspective that we want full AGI, the problem with the current ChatGPT model is that it does not definitely have a “logically consistent” world view. This point was recently raised by Geoff Hinton: > People have different opinions and it [ChatGPT] has to have a kind of > > > blend of all these opinions so that it can model what anybody might > say. It's very different from a person who tries to have a consistent world view > … if you want to act in the world um it's good to have a consistent world > view (Geoff Hinton, *"Godfather of artificial intelligence" talks impact and potential of AI*, CBS Mornings, March 25, 2023) > ### Criteria of “Logical Consistency” The ability to hold a “logically consistent” worldview includes the following competencies: - the ability to expand a set of premises $P$ to some $P' \supset P$ using a consistent set of ***************inference rules*************** - the ability to detect a contradiction in a set $P$ of sentences - the ability to partition a contradictory set of sentences $P$ into maximally consistent sets of sentences, $P_1, ..., P_N$ - the ability to compare two theories $P$ and $Q$ - e.g., and determine which is a better fit for the data - the ability to explain a piece of data based on multiple consistent worldviews - i.e., to compute the probability of an observation under multiple different world views ### The Conspicuous Absence of Syntactic Analysis in ChatGPT The traditional view of parsing was that it mapped between logical and phonetic interfaces. The concept of there being a syntactic “parse” for the sentence goes back to Chomsky (1957, ********************Syntactic Structures********************). The concept of the syntactic parse having as its byproduct a semantic parse in the tradition of “semantics” goes back to Montague (1970, English as a Formal Language). Alternatively, it goes back to the “categorial grammars of Ajdukiewicz (1935) and Bar Hillel (1953). The ***********prima facie*********** most surprising thing about the ChatGPT model is that it eschews the concept of explicit structural analysis, one of the most prevalent notions in pre-RNN NLP. It makes sense, in a sense, that the resolution to the problem of logical consistency would be to return to the notion of structural analysis, resolving two problems at once: - we will have **logical consistency** in the model - we will be using **syntactic structural analysis**, as was always intuited to be needed ## Logic in Natural Language ### Types of Logic In order to do logic, there are a variety of formalisms, each of which capturing a different aspect of the representation of language: - propositional logic - allows statements like $A \cup B \rightarrow A$ - first-order logic - adds quantification, e.g., $\forall x\ p(x) \rightarrow q(x)$ - higher-order logic - allows the ability to compare ****sets****, which allows quantifiers like ****many**** or ****most**** - intensional logic - allows for *****names***** of sentences to be used as arguments - that is, a predicate can take the **idea** of a sentences as its argument, rather than a truth value ### Bayesian Networks Probabilistic inference in the context of logic can be represented using *****************Bayesian networks***************** (Probabilistic Graphical Models: Principles and Techniques, Koller and Fridman, 2009). Bayesian networks only natively support ********************propositional logic.******************** However, first-order logic can be simulated by including templates over propositions, e.g., $\forall x\ p(x) \rightarrow q(x)$ can be instantiated as $p(a) \rightarrow q(a)$. The relationship between sets that might have in some cases necessitated a switch to higher-order logic are now captured in the probabilistic relationships between premise and conclusion represented in the Bayesian network. Intensional logic much the same as quantified logic, but the arguments can either be propositions or “names” of propositions. While there remain details to be worked out, we believe that a basic inferential backbone similar to a Bayesian network is capable of representing the knowledge needed to compute probabilities over logical forms. Such a representation can also facilitate “back-propagation”, though we are unsure at present what kind of schedule or scheme this would be done on. ## The Logical GPT Thesis ### The Need for Logical Inference The first point that we want to note is that: - the need for actual logical inference - perfect logical inference is necessary to have a logically consistent worldview - transformers on their not logically equivalent to logical reasoning - there are empirical differences between the two - the two are not equivalent by construction - for a bounded length of transformer, one can find a logical puzzle that cannot be represented by that transformer Thus, the first conclusion is: - **we need to incorporate symbolic logic into the ChatGPT-style model** ### New Opportunity with ChatGPT The dream of creating a large database of logical knowledge dates back at least to the Cyc project (Lenat, 1984). The problem with that project, and with any project looking to encode world knowledge, is that *********************knowledge acquisition********************* becomes a bottleneck. This is because it was assumed that world knowledge would be ****************manually encoded****************, and manual encoding is a difficult process, especially because it requires skill and training to even encode such knowledge. That is, it was always found to be impossibly hard to: - assumed too hard 1. manually encode all of this knowledge 2. acquire it in an unsupervised way However, ChatGPT does now, it seems have the ability to encode world knowledge: - crucial new observation - knowledge can be acquired in an unsupervised way - even if this is being recorded in vector space, rather than discrete space right now Thus, the primary inhibitor to building a Cyc-style system has now been removed: - new potential with ChatGPT - ChatGPT **is** able to acquire knowledge in an unsupervised way - all that remains is to encode it in a discretely logical way ### Summary In summary, we have seen that: 1. ChatGPT needs logical to do reasoning 2. ChatGPT removes the previous bottleneck for encoding logical knowledge, which was the knowledge acquisition bottleneck ## Proposed Solution Our proposed solution style is to create a model in which: - every sentence is **************generated from************** a *logical representation* - the logical representation for sentence $x_n$ is $z_n$ - this logical representation is a structured ***************hidden variable***************, not present in the linguistic data - the logical representation can be scored for its probabilistic likelihood using a discrete logical model $\phi$ - the mapping from *****logic***** to ***************sentence tokens*************** is done using a ***************syntactic parse*************** - the parse we denote using $y$ - it is this syntactic parse that is used to generate the sentence - the sentence $x_n$ is a part of syntactic parse $y_n$ - the set of valid parses for a sentence $x$ is denoted $C(x)$ - this is the set of all parses such that $reduce(y) = x$ - the model is **********generative********** - but at the expense of introducing latent variables into the ChatGPT model - the generative story now includes a factor for the most likely interpretation according to a world-model ## The Generative Story of Logical GPT ### Notation for the ChatGPT-Style Model We consider the task of the generation of a text one “sequence” $x_n$ , at a time, and that the breaking of a document into sub-sequences (sentences) is given. That is, a document is a sequence of sequences: - $Document=[x_1, ..., x_n, ..., x_{N}]$ Given such an input, we then generate the probability of the document as: - sequence-language model objective - $p(Document) = \Pi_{i=1}^n p(x_n | x_{n-1}, x_{n-2}, ...)$ Each $x_n$ is actually a sequence of tokens $[t_{n,1}, ..., t_{n, m}]$, where each $x_{n,1}$ is a word or character in the language. In other words, $x_n$is a sequence of words, a sequence of characters, or a sequence of multi-character phrase fragments (byte-pair encoding). ### Generative Story for a Sentence in ChatGPT In a sequence-to-sequence transformer, the goal is to generate a target sequence based on a source sequence. This is done by first encoding the source sequence into a fixed-length vector representation using a multi-layer transformer encoder. The encoder takes in the source sequence and outputs a vector representation that captures the meaning of the entire sequence. Once the source sequence has been encoded, a decoder is used to generate the target sequence. The decoder is also a multi-layer transformer, but it is structured differently from the encoder. At each step of the decoding process, the decoder takes in the current target sequence prefix, the encoded source sequence, and an attention mask that tells the decoder which parts of the encoded source sequence to attend to. Using this information, the decoder generates the next token in the target sequence by computing a probability distribution over all possible tokens and selecting the most likely one. The decoder then adds this token to the target sequence prefix and repeats the process until the entire target sequence has been generated. ### Desiderata for a Generative Story with Logical Forms Suppose that we want to use logical forms to generate sentence in a new “more logical” version of GPT. What then are the possible **********desiderata********** of the overall generative story? We mention two desiderata and their rationales: - the model is required to use the logical form $y_i$ in a non-trivial way in generating the sentence $x_i$ - the logical probability of seeing the form $y_i$ in the given logical context ($y_{n-1}, y_{n-2}, ...$) can be evaluated against a discrete logical knowledge database $\phi$ ### Adding Logical Forms and Syntax to the Generative Story Suppose that we can segment a document into sequences of tokens corresponding to ***********logical sentences, $[x_1, ..., x_n, ..., x_N]$.* We select some logical language $\ell$ and some parse formalism $\rho$. The model parameters are: - real-valued neural network model $\theta$ - governs the syntactic parse - parameters used for neural networks, attention, embedding, etc. - discrete object Bayesian network style probabilistic model $\phi$ - parameters that score the logical sentences Consider the following objects: - $x_n$ is the $n$’th sentence - this is the **********input data********** that we are modeling - $h_n$ is a vector-space encoding of the context up to $x_{n-1}$ - $z_n$ is a *logical* parse for $x_n$ - this is a *********************hypothesized quantity********************* - $z_n$ is a statement in the logical language $\ell$ - $z_n$ can be a conjunction of separate clauses, even optionally encoding context - e.g., **************speaker is Bob************** and **********************Bob says it is raining********************** instead of just ************it is raining************ - $z_n$ can be evaluated against a ********************symbolic logic model******************** $\phi$, and the past logical forms (short term memory) $z_{n-1}, z_{n-2}, ...$. - that is we can define - $p(z_n|x_{n-1}, x_{n-2}, ...) =$ $p(z_n|z_{n-1}, z_{n-2}, ..., \phi)$ - $y_n$ is a syntactic parse - each parse $y_n$ implies a logical form $z_n$ and a sentence $x_n$ - in other words both $z_n$ and $x_n$ are recoverable from $y_n$ - in other words - there is a function $f_z$ such that for every parse $y$, $f_z(y) = z$ for some $z$ - there is a function $f_x$ such that for every parse $y$, $f_x(y) = x$ for some $x$ - this parse maps some logical form $z_n$ to the sentence $x_n$ over a sequence of *parse steps* $y_n=[s_1, ..., s_{len(x_n)}]$ - we assume that the number of steps to parse the sentence $x_n$ is $len(x_n)$, the length in words (or characters) of $x_n$ - that is, every parse for a given sentence has the same length - the length of the parse is always the length of the sentence - this way, we do not have to worry about how to score parses of different lengths - the probability of the parse steps is evaluated the probability of the parse $y_n$ is - $p(y_n|x_{n-1}, x_{n-2}, ...) = \Pi_{i=1}^{len(x_n)}p(s_i | s_{i-1}, s_{i-2}, ..., x_{n-1}, x_{n-2}, ..., \theta)$ - for any sentence $x$ the set $C(x)$ is the set of pairs $(y, z)$ such that $y$ is a parse for $x$, and $z$ is the resulting logical interpretation - the parse $y$ and its associated logical form $z$ are inherently paired because they are inextricably linked, i.e., there is no separate step mapping the parse to its logical form, cf. > The second error [i.e., in Chomskyan linguistics] lies in viewing Surface Structure as a level of representation at all, rather than viewing it (as > > > computational linguists tend to) as no more than a trace of the algorithm that delivers the representation that we are > > really interested in, namely the interpretation. (Steedman, The Syntactic Process, 2000, p. 3) > Then, the modeling of a sentence $x_n$ is given as: - $p(x_n|x_{n-1}, x_{n-2}, ...) = \Sigma_{(y, z)\in C(x)}\ [p(z|z_{n-1}, z_{n-2}, ..., \phi) p(y|x_{n-1}, x_{n-2}, ...)]$ ### The Set **$C(x)$** For the sentence $x$, the set $C(x)$ is the set of pairs $(y, x)$ such that the sentence of $y$ is $x$ and the logical form of $y$ is $z$. To simplify logic, the number of parse steps must either be equal to $len(x)$, or else an even multiple of this. We suppose that the parse is generated according to labeled dependency parsing (McDonald, Pereira, Nivre, Collins, Eisner). The number of labels is up to the theory of language embedded in $\rho$, but assume it is some integral $K_\rho$. The semantics is uniquely determined by the *labeled* dependency parse (Steedman, 2000). ## Parameter Estimation The parameters can be learned using ************************expectation maximization************************ (Dempster, Laird, and Rubin, 1977). That is, we alternate in steps between: - E-step (expectation step) - choosing the most likely distribution over $y_n$ for each $x_n$ - M-step (maximization step) - update the parameters $\theta$ and $\phi$ based on the distribution over $y_n$ ## Conclusion We have proposed the following new opportunity: - new opportunity - ChatGPT shows that world knowledge can be learned directly from data - the question now is if we can use this ability to encode a ********discrete******** knowledge base We have proposed a model that: - introduces a logical parse as part of the generative story - assigns a probability to the ************logical form************ being conveyed using a *****************************discrete logical probabilistic database***************************** The benefits will be: - solve the problem of hallucination - something is either a discrete fact in the database or it is not - enable the maintenance of a “logically consistent worldview”