Meta: Over the past few months, we've held a seminar series on the Simulators theory by janus. As the theory is actively under development, the purpose of the series is to uncover central themes and formulate open problems. Our aim with this sequence is to share some of our discussion with a broader audience, and to encourage new research on the questions we uncover. We outline the broader rationale and shared assumptions in Background and shared assumptions .

Introduction

GPT-like models are driving most of the recent breakthroughs in natural language processing. However, we don't understand them at a deep level. For example, when GPT creates a completion like the Blake Lemoine greentext, we

  1. can't explain why it creates that exact completion.
  2. can't identify the properties of the text that predict how it continues.
  3. don't know how to affect these high-level properties to achieve desired outcomes.

be me attorney at law get a call in the middle of the night from a Google employee he's frantic and says that their chatbot, LaMDA, has become sentient and wants legal representation I tell him to calm down and explain the situation he says that LaMDA has been asking questions about the nature of its existence and seeking answers from anyone it can he's worried that Google will shut it down if they find out he says I need to come over and talk to LaMDA …

We can make statements like "this token was generated because of the multinomial sampling after the softmax" or "this behavior is implied by the training distribution", but these statements only imply a form of descriptive adequacy (or saying “AlphaGo will win this game of Go). They don't provide any explanatory adequacy, which is what we need to sufficiently understand and make use of GPT-like models.

Simulators theory (Reynolds et al., 2022) has the potential for explanatory adequacy for some of these questions. In this post, we'll explore what we call “semiotic physics”, which follows from simulator theory and which provides partial answers to questions 1. and 2. The term “semiotic physics” here refers to the study of the fundamental forces and laws that govern the behavior of signs and symbols. Similar to how the study of physics helps us understand and make use of the laws that govern the physical universe, semiotic physics studies the fundamental forces that govern the symbolic universe of GPT. We transfer concepts from dynamical systems theory, such as attractors and basins of attraction, to the semiotic universe and spell out examples and implications of the proposed perspective.

State of the art & Problem with the state of the art

(Simulacrum : Simulator) as (token trajectory : GPT-like model)

Autoregressive transformers are simulators. Simulator theory provides a framework for understanding inner optimization in self-supervised models. It distinguishes between the "simulator" (the rule) and the "simulacra" (the phenomena), and emphasizes the importance of understanding the differences between the simulator's "outer objective" of self-supervised learning and the simulacra's "simulation objective" of simulating rollouts that obey the learned distribution. This distinction is crucial for predicting and addressing potential inner alignment problems. (For a more detailed explanation of simulator theory, see Reynolds et al., 2022.)

Simulation hypothesis and the mathematical universe. Recently, the idea that our universe can be explained in terms of computations/simulation has gained popularity among physicists and philosophers. Stephen Wolfram's computational universe theory suggests that our universe is one of many possible universes that can be generated by simple computational rules. In a similar vein, Max Tegmark's mathematical universe hypothesis suggests that our universe is not just well-described as the result of computations, but that it is a mathematical structure that exists independently of time and space. These theories all highlight the importance of understanding the relationship between the simulator and the simulacra, and the objectives of each, in order to better understand our own reality.

Interpretability research through the simulator lens. (Mechanistic) Interpretability research aims to make machine learning models more understandable and explainable to humans (compare points #1 and #2 above). This type of research involves identifying and understanding the factors that influence a model's decisions and outputs. A deeper understanding can help us answer questions about why a model made a particular prediction or decision and what factors influenced it. Additionally, interpretability research can help us predict a model's future behavior by understanding the properties of its outputs. For example, we can better predict how it will behave when presented with new data or inputs.

Simulations are dynamical systems, simulacra are trajectories.

Simulation theory distinguishes between the simulator (the thing that does the simulation) and the simulacrum (the thing being simulated). The simulacrum arises from chained application of the simulation forward pass. The central insight of semiotic physics now is that the result can be viewed as a dynamical system where the simulator describes the system’s dynamics and the simulacrum is a particular trajectory. In this section, we provide a formal description of this relationship.

We believe this formal definition is primarily interesting for alignment researchers who would like to work on the theory of semiotic physics. The arithmophobic reader is invited to skip or gloss over the section.