Systems and
Formalisms Lab

Research

Background

We explore the science and craft of verified systems engineering.

We're particularly interested in novel compilers and compilation techniques (we've worked on relational code generation and verified cryptographic systems, interactive refinement, binary parsing, verified hardware compilation, hardware simulation, regex-engine specification and implementation, verified just-in-time compilation, etc.). We also think a lot about how to make interactive and automated theorem provers more accessible (we've explored literate programming systems for Coq and Lean, proof-development environments, and even automated-reasoning algorithms).

We co-maintain a number of open source tools related to programming languages and interactive theorem proving, including Alectryon, Proof General, company-coq, Flycheck, F⋆-mode, dafny-mode, and others.

Core research

Systems-software verification

We're generally interested in most systems-verification projects, especially the ones that do not involve shared-memory concurrency (there's no need to introduce concurrent threads — or even I/O — to start running into lots of interesting bugs). Some software that we're interested in verifying:

  • Binary parsing: compressed image formats, network packets, serialization libraries, etc.

  • (Userland) network stacks and network applications: TCP/IP, SSH, VPNs

  • Unix style utilities (think Busybox)

The exact programs matter less than developing the technology that makes them easy to verify, so think of the above as case studies for new ways to approach systems verification. Our favored approach at the moment is proof-producing compilation / binary extraction, which we're exploring in Rupicola.

Hardware and whole-system verification

Hardware vulnerabilities and bugs can nullify the guarantees provided by software proofs. In the long term, we'd like to strengthen our verification story by developing lightweight hardware customized to support long-running verified systems in challenging environments (think medical devices, communications, infrastructure, etc.).

Concretely, we're looking to develop languages and compilers in the vein of Bluespec and Kôika to verify embedded-class processors that we'll use to run our software on, with end-to-end proofs connecting high-level software specs to the I/O behavior of RTL circuits.

Verification tools and methodology

In addition to building verified systems, we are interested in exploring better ways to engineer, maintain, preserve, and communicate formal proofs. These days, one of our main interests is proof presentation, in conjunction with literate programming — especially coming up with visualizations and media that make formal proofs easier to develop and comprehend.

[SLE ’20] Untangling Mechanized Proofs (Alectryon)

Formal methods for real-world DSLs

Programming languages evolve organically, and at time this evolution creates specific and interesting challenges. We are curious to see how PL and formal methods theory can help improve popular languages and inform the design of new ones.

In that space, we have recently focused on two problems:

  • Effect tracking in Emacs Lisp, a language that makes pervasive use of dynamic variables for configuration, and

  • Linear-time matching of modern regexes (most modern regex engines and are vulnerable to denial-of-service attacks due to catastrophic backtracking). Modern regexes have surprisingly little in common with traditional regular expressions (and modern regex languages have little in common with each other!), to the point that even their exact space and time matching complexity is not entirely understood. We have shown that:

    • Lookarounds can be matched in linear time and memory, and lookbehinds in streaming mode and constant memory, with linear-time preprocessing.

    • Semantic peculiarities in JavaScript require special care when implementing quantifiers and groups to avoid quadratic explosion, but the same oddities make it possible to match lookarounds with groups in linear time in JS.

    • Almost all published pen-and-paper formalizations of JS regex make subtly incorrect simplifying assumptions about the language (e.g. assuming that r? is equivalent to (?:r|) — it is not).

    Some of our algorithms have been merged in V8, the engine used by Google Chrome and Node.js.

Umbrella projects

Here are two illustrative medium- to long-term projects that bring most of the above together:

Problem

Connecting a device to the internet is dangerous: it immediately exposes it to a wide world of potential attacks. Firewalls help, but only to some extent (attackers may hide behind a firewall; network equipment itself may be malicious, etc.).

Often a device does not need access to the whole internet; it just needs to talk to some other specific devices.

Proposal

Build a verified software-based communication stack (userland UDP/IP) from high-level specs (~ RFC diagrams), build a VPN on top of that, and run it on a verified chip built on an FPGA. The final device should expose just two Ethernet ports: one to connect to a LAN, and one to connect a computer or other networked device into.

Use this device to build secure point-to-point communication channels: pair two copies of the device, connect them to the internet on two sides of the world, and then plug two potentially-unsecured computers into the devices. These computers can now talk to each other as if they were connected by a physical wire, without fear of data theft, corruption, or attacks — it's as if they were on their own private network.

Context

A reasonable objection here is that this is already the abstraction that a pure-software VPN provides, but the attack surface there is much larger: bugs in the OS, the IP stack, the VPN software can all lead to compromise of the machines using the communication channel. With a verified physical VPN device, the only exposure that each machine has on both sides of the communication channel is to the other machine. If we achieve sufficiently strong guarantees, it would be safe to use this to use the internet to connect even critical infrastructure that is too complex to fully secure against network attacks (of course, there is still the quality of service issue: we won't provide guarantees against denial of service, network interruptions, etc.).

Problem

Medical devices combine relatively simple software and hardware to deliver vital care; a robust verification platform should be able to bridge the gap from high-level characterization of behavior (think: differential equations for an insulin pump system at night) to low-level implementation (think: voltage shifts on the pins of the physical insulin pump).

Can we develop the technologies and know-how to verify the operation of a simple medical device at reasonable cost?

Can we systematically rule out software or hardware-based misdesigns, security flaws, and functional bugs, thereby ensuring that the device is as trustworthy as its physical components, and isn't weakened by flawed software or hardware engineering?

Can we significantly reduce the amount of trust that patients are forced to put into vitally important systems? (“don't trust, verify”?)

Context

There is an excellent explanation of the genesis of looping, from its beginnings as an open-source hack, in this Atlantic article from 2017. There are now commercialized, FDA approved looping devices.

It requires simple hardware, reasonably simple software, and feasibility of a non-verified implementation has been demonstrated in online open source communities a while back.

Proposal

Develop an end-to-end prototype for looped insulin pumps. As an idealized roadmap (but to be successful this project will require collaboration with doctors, hardware-design experts, HCI folks for usability, and even ideally patients, so the steps below are just guidelines):

  1. Translate discrete equations relating blood sugar measurements with insulin release decisions into relations on I/O traces.

  2. Write a functional implementation of the control system; prove it correct against the trace relations.

  3. Lower the functional programs into an imperative language for which there exists a verified compiler: Bedrock2, or CompCert C, or some other option; compile to assembly.

  4. Verify a simple embedded-class processor chip (probably RISCV?) for execution on FPGA, exposing a Coq-level spec. Connect the monitor and the pump to that device, exposing them through MMIO.

  5. Link the proofs at each level of the stack, connecting the hardware I/O level with the original equations.

Open projects

Below is a sample of projects that we're interested in exploring in the short to medium term, sorted by topic. Ongoing projects are tagged as such; if there's no tag, it means that we have not actively started working on this project.

This is not a complete list, and the projects are not fully fleshed out: a lot of what's below is intended to be a conversation starter, and some of the projects below will quickly lead to negative results.

Some projects below would be suitable as BS and MS projects for current EPFL students; others are more open project directions that would be suitable for prospective PhD students.

If you have worked on, know of work about, or are starting work on any of these topics, we'd love to hear about it! Please let us know so that we can update the list and add pointers to your work.

Systems and performance engineering

Problem

Web browsers and web servers typically perform decompression and decoding separately. As a result decoding does not benefit from redundancy in the input: a 10kB compressed JSON stream that decompresses into 10MB worth of JSON will cause 10MB worth of JSON to be fed into the JSON decoder, and 10MB worth of decoding work to be performed. This is a giant waste of time: there's really only 10kB worth of decoding work to do here; the rest is redundant.

Proposal

Explore opportunities to fuse decompression and decoding.

Here is a very simple example: suppose that “compression” is “run-length encoding” (00000001111111111000*7,1*10,0*2) and “decoding” is “counting the number of ones”.

Here is the decoder:

for bit in decompressed_stream:
    if bit == 1:
        count += 1

This is hopelessly inefficient compared to the fused decompressor/decoder that operates directly on the compressed stream:

for bit, repeats in compressed_stream:
    if bit == 1:
        count += repeats

We should be able to achieve similar savings with off-the-shelf formats and compression algorithms, such as gzip+JSON. For example, if a key is repeated 1000 times in a given JSON file, we ideally shouldn't have to re-decode it 1000 times. Instead, assuming (optimistically) that each repeated appearance of that key in the uncompressed stream is replaced in the compressed stream by a backpointer to the location of its first appearance, a fused decompressor-decoder should be able to decode the key only once (the first time it sees it) and then reuse the decoded version every time it sees a new backpointer to it.

(Of course, compressed block boundaries won't nicely align with JSON object boundaries — that's partly what makes the problem so interesting. I suppose it would be reasonable to start by exploring the case where we hack together a gzip compressor that's polite enough to put block boundaries in semantically meaningful places from JSON's perspective.)

Context

I first ran into this issue in 2020 in Alectryon. The HTML and JSON that Alectryon generates for unfocused proofs with many open goals is highly redundant, with many literally repeated fragments throughout the output. The result is a web page that weighs 25MB but compresses down to 20kB… and still takes 3 seconds to render, because the HTML decoder has to go through 25MB of HTML after decompressing.

I fixed the issue in Alectryon by using a bespoke compression scheme that's trivial to fuse with decoding: I replace each redundant HTML node by a backpointer (using a custom HTML tag), let the browser decode the deduplicated HTML, and then use a 3-lines JavaScript program to replace each backpointer with a copy of the node that it points to (copying nodes is very cheap — much cheaper than decoding the corresponding HTML over and over).

Amazingly, it turned out in that case that the same savings apply to the HTML encoding side as well: instead of repeatedly re-encoding the same objects as HTML, we can memoize the HTML corresponding to repeated objects and perform the encoding only once, shaving off over 90% of the encoding time in some cases. (Again, this is easy because of the bespoke compression scheme.)

The aim of this project is to try to replicate these savings with off-the shelf compression algorithms.

Problem

Storage media are filled with redundant copies of libraries and binaries. For example, my main workstation has 5 versions of OCaml, 3 versions of LLVM, 4 versions of GCC, 5 versions of Python. VMs and containers contain even more copies of all these. Things are similar or worse on my phone (each app bundles its own copies of all the libraries it uses). File-system compression helps a bit (but not much: most file systems compress data in individual blocks, not across blocks), and deduplication not at all (minor changes in libraries, binaries, or VM layouts move block boundaries, and deduplicating FSs typically deduplicate at the block boundary).

Proposal

Design a deduplicating file-system based on content-aware chunking.

Context

Deduplicating backup systems have been using content-aware chunking for years. The key issue with block-based deduplication is misalignment. Suppose a file containing aaabbbaaabbb is stored on a file system that deduplicates blocks of length three. The file is segmented into four blocks aaa, bbb, aaa, bbb, and only two blocks need to be stored. Now an insertion is performed on the file, and the contents become aaaXbbbaaabbb. The file gets segmented into aaa, Xbb, baa, abb, b, and all deduplication is lost.

Content-aware chunking does away with fixed block lengths, using a hash function to split a file into blocks instead: we'll split every time the hash of the previous three characters is 0. Let's use xyz ↦ x + y + z mod 3 as the hash on our our example:

aaabbbaaabbb
...^     hash(aaa) = 3*a     mod 3 = 0 → cut after "aaa"
 ...^    hash(aab) = 2*a + b mod 3 ≠ 0
  ...^   hash(abb) = a + 2*b mod 3 ≠ 0
   ...^  hash(bbb) = 3*b     mod 3 = 0 → cut after "bbb"
    ...^ hash(bba) = 2*b + a mod 3 ≠ 0
     ... …etc.

In other words, we get the same two blocks aaa, bbb, aaa, bbb. What happens when we perform the insertion?

aaaXbbbaaabbb
...^      hash(aaa) = 3*a     mod 3 = 0 → cut after "aaa"
 ...^     hash(aaX) = 2*a + X mod 3 ≠ 0
  ...^    hash(aXb) = a + X + b mod 3 ≠ 0
   ...^   hash(Xbb) = X + 2*b mod 3 ≠ 0
    ...^  hash(bbb) = 3*b     mod 3 = 0 → cut after "bbb"
     ...^ hash(bba) = 2*b + a mod 3 ≠ 0
      ... …etc.

Here we get the blocks aaa, Xbbb, aaa, bbb — using a content-based function to decide on block boundaries yields a much more robust segmentation.

Content-aware deduplication works great for backup systems where each new snapshot of a system is taken by scanning the whole system and rehashing all of it. Can we make it work on a live system?

Follow-Up: Preprocessing the blocks

Deduplication works well when large swaths of the data is exactly the same, but binaries don't typically work that way: even trivial source code changes lead to massive changes at the binary level, because they shift (renumber) all absolute jump targets. There exists ways to work around this (Windows Update computes patches on decompiled binaries with normalized jump targets to minimize patch sizes, for example); can we integrate these techniques in a live file system?

Problem

Some common patterns in code generated by interpreted as well as lazy languages give a hard time to branch hardware branch predictors. There should be ways to generate code patterns that work better with branch prediction, or to propagate relevant information to the hardware.

For example, a typical functional language may represent pending or forced computations like this:

type α thunk = Forced of α | Suspended of unit -> α

and lazy values like that:

type α lazy = ref (α thunk)

Then, every attempt to force a lazy value will branch on the thunk to determine if it has been forced yet. For a given lazy value, the branch is trivially predictable: it is never taken except on the first taken (assuming “taken” means that the value has not been forced yet — the Suspended case above). But if we have a single library function that performs that branch for all lazy values, then that branch is much harder to predict (branch histories help, but they unalias branches on the same value happening at two different locations in the program, so they are not ideal either).

Similarly, a typical bytecode interpreter for a non-JITted language will have a piece of code implementing that bytecode's notion of conditional branches, as that piece of code will use a branch conditioned on a register or stack entry of the bytecode's VM. The result is that all branches of the interpreted program are overlaid onto a single branch of the interpreter.

Proposal

Explore software and hardware approaches to mitigate this issue: in the lazy case, maybe this means assigning distinct branch instructions to each lazy value. In the bytecode case, this could mean exposing to the hardware the value of the interpreted PC.

Problem

Some applications use lots of memory. RAM compression helps (recent Windows kernels do it by default), but it comes with high CPU costs, and it is usually done in application-agnostic way (by the OS or the architecture). Can we improve things using application-specific algorithms and compressed datastructures? Are there application-specific ways to compress RAM that interact better with the application's access patterns?

Proposal

Investigate application-specific compression schemes compatible with operations performed on the decompressed data.

For example, Emacs uses a gap buffer, and every editing operation requires moving the gap to the right place (a linear-time operation) and then performing edits. A standard compression algorithm like gzip applied to the whole buffer would make motion and changes costly (changes at position x may require re-compressing every position after x).

Alternative compression schemes may perform better: for example, compressing both sides of the gap separately, from each end (left-to-right on the left part, right-to-left on the right part), would have better locality, and changes would not require re-encoding (on the other hand, motion would).

(There's a wealth of work out there on elaborate data structures that represent data more densely. This proposal is narrower in scope: can we stick closely to existing datastructures, but carefully apply existing compression schemes to slices of the program's data?)

Problem

Some programs, like SMT solvers, put a lot of pressure on the data cache, but not on the instruction cache. Can we use the instruction cache to store data?

Proposal

Encode data in the instruction cache through a form of just-in-time compilation.

For example, consider a multiply-and-add type of computation with a constant matrix M, a constant vector B, and input A, and an output C: C = M * A + B. Assume that M is a matrix that barely fits in cache, without enough space to store B as well. Then we should be able to replace the + B part of the computation (implemented with a loop that consumes very little instruction cache but loads B into the data cache) by a dynamically generated piece of assembly that encodes the + B as a sequence of memory operations with immediates holding the values of B. Concretely, for B = [12, 4, 8, 27, 31, 1, 44, 18], instead of running a small loop with a tiny instruction cache footprint, like this:

for (i = 0; i < 8; i++)
  c[i] += b[i];

… we'll run an unrolled computation that hardcodes B, with a larger instruction cache footprint but no data cache footprint, like that:

c[0] += 12; c[1] += 4; c[2] += 8;  c[3] += 27;
c[4] += 31; c[5] += 1; c[6] += 44; c[7] += 18;

In assembly, this would mean replacing this:

.loop:
     mov     edx, DWORD PTR [rsi+rax]
     add     DWORD PTR [rdi+rax], edx
     add     rax, 4
     cmp     rax, 32
     jne     .loop

with this:

add     DWORD PTR [rdi],    12
add     DWORD PTR [rdi+4],  4
add     DWORD PTR [rdi+8],  8
add     DWORD PTR [rdi+12], 27
add     DWORD PTR [rdi+16], 31
add     DWORD PTR [rdi+20], 1
add     DWORD PTR [rdi+24], 44
add     DWORD PTR [rdi+28], 18

In more general cases, less efficient encodings are possible: for example, we can store a value as an immediate in an instruction that sets a register, followed by a ret to jump back to the caller (this is morally the same as storing an array of values as a flat array of functions that each return the corresponding value of the original array — “flat” because the array does not store pointers to functions, it stores the assembly code of each function directly).

Static analysis, program synthesis

Context

Dynamic variables (defvar and defcustom in Emacs Lisp, make-parameter in Racket) are a convenient way to pass arguments deep down a call chain without polluting every function along the chain with extra arguments (avoiding so-called tramp data). Other languages do this with global variables, implicit parameters [Hughes2004], etc.

Problem

Dependencies on the values of dynamic variables are not easy to track: for example, lots of code in Emacs incorrectly depends on the value of case-fold-search (a variable that controls whether text searches are case-sensitive). Lisps have a great way to locally shadow a global (that's what let-binding is about), but authors often forget these let-bindings.

Proposal

Implement a static analysis pass that bubbles up dependencies on dynamic variables. Evaluate it by annotating Emacs primitives and seeing how bad the problem of dynamic variable pollution is on popular Emacs libraries. Annotating primitives could be done by hand, by instrumenting the Emacs binary, or by statically analyzing its C sources.

Follow-Up

Implement a caching version of the analysis that is fast enough to run in real-time, and display the “dynamic signature” information as users type their programs.

[Hughes2004]

Hughes, John. 2004. Global variables in Haskell. Journal of Functional Programming 14(5), 489–502. doi:10.1017/S0956796802004471

Problem

Advanced IDEs offer powerful keybindings, but users commonly don't use them optimally.

For example, in Emacs, I can use C-a <tab> C-SPC C-e C-w to cut a whole line of text without its leading indentation:

C-a      ;; beginning-of-visual-line
<tab>    ;; indent-for-tab-command
C-SPC    ;; set-mark-command
C-e      ;; end-of-visual-line
C-w      ;; kill-region

It's much more efficient, however, to use M-m C-k:

M-m      ;; back-to-indentation
C-k      ;; kill-visual-line

The same thing happens all the time in vim and other modal editors: users are constantly writing single-use programs to move around a text buffer and perform edits.

Proposal

Design and implement a program optimizer for keyboard macros to suggest shorter key combinations as users type.

Peephole optimizations are probably a good place to start, but eventually we want context-sensitive optimizations too: for example, C-; (flyspell-auto-correct-previous-word), is a valid alternative to C-5 C-b C-d C-d Em when the buffer contains EMacs right before the current point, and so is M-b M-c (both correct the capitalization to Emacs).

Eventually, we need the system to be fast enough to run as the user types, and there are interesting questions around the feedback mechanisms to use.

Follow-Up: Hand models

Incorporate a physical hand model to suggest more ergonomic key combinations, not just shorter ones.

Follow-Up: Telemetry

Collect a large corpus of user inputs and aggregate suggestions across multiple users.

Proof presentation and exploration

Context

The ideal way to think of most proof assistant goals isn't plain text; early experiments with company-coq, Lean widgets, and Alectryon renderings show that there's plenty of demand for domain-specific graphical representations of proof states.

Proposal

Collect examples of domain-specific renderings, and implement them in the browser and in a proof assistant. Study specifically the diagrams used in published textbooks: what representations do people use when they are not constrained by plain text?

Follow-Up: Unified rendering

For this exploratory project we do not need a unifying theory of even particularly convenient interfaces to extend the notation system; instead we're hoping to collect interesting examples and measurably increase comprehensibility in these specific examples.

For a more ambitious follow-up, see the Mixed rendering for proof presentation proposal below, which generalizes on this idea to unify the rendering engine used in the examples that this project will collect.

Context

Proof assistant goals often mix concepts from many different domains, each of which may be rendered differently for optimal clarity. For example, the following goal mixes programming language syntax (C code), logical syntax (Coq matching), graphs (separation logic), math (arithmetic formulae), and data structures (separation logic predicates):

{{ let t := tree p0 (Branch (k0, v0) t1
                            (match Heq return … with
                             | eq_rect => Branch (k1, v1) Empty t1
                             end)) in
   t * (t -* tree p tr) }}
  while (p && p->k != k0) {
    sum += p->u;
    p = p->k < k0 ? p->l : p->r;
  }
  v = p ? p->v : null;
{{ λ out. out["v"] = (find k tr).(v) ∧
          out["sum"] = sum (map v (path k tr)) }}
Problem

Domain-specific visualization tools tend to be designed in a vacuum, and are typically not very modular: different fields have come up with different ways to build visualizations (graphs, diagrams, pictures, pretty-printed text, etc.), but they can't be combined easily. As a concrete example, Graphviz (one of the top graph-drawing programs) does not support displaying anything but text or fixed-sized images within a graph node.

Partly as a result, rich goal-visualization tools are rare in proof assistants, and mostly focus on a single custom kind of visualization. In the vast majority of cases it is plain text (possibly with some notations) that is used to represent everything, and that leads to goals that are much harder to read than needed.

Constraint-solving diagramming and rendering engines partly solve this problem, but:

  1. Many rendering algorithms have not been implemented efficiently as constraint solving problems (I'm not aware of an encoding of Wadler-style pretty printers or Coq's box-layout models as a general constraint-solving problem).

  2. Constraint solvers are not often continuous in the sense that small perturbations of the input preserve most of the output. When we're looking at consecutive proof states, we want to see clearly what changed from one step to the next, even if that leads to a slightly suboptimal rendering for the second goal.

Proposal
  1. Design a reasonably modular constraint-solving based renderer, encode common kinds of pretty-printing as well as a few graph layout problems, and use it to render mixed proof goals.

  2. Design an appropriate extension mechanism to allow new kinds of rendering along with a language to describe layout problems, and implement a notation mechanism in Coq that lets users describe how to render specific types (or symbols?) using this layout language.

  3. Study continuity by constraining new renderings to be sufficiently “close” to previous ones.

Problem

Diagram chasing is a technique for proving mathematical statements by manipulating directed graphs. These proofs are tedious to draw and maintain using current computer tools, especially because humans want to see the graphs evolve through the proof with some form of animation, rather than static illustrations of a proof state.

Proof assistants have the potential to improve upon this experience because their natural mode of operation is to evolve a state through proof steps. If they could produce diagram-chasing graph visualizations, they would let proof readers/writers work with familiar high-level mathematical tactics and see diagrams evolve automatically as expected. Indeed, several projects to build a diagram chasing interface for a proof assistant are in progress. The question is: what is the ideal mechanized diagram-chasing user experience?

Proposal

Build lightweight, malleable mockups of a handful of scripted interactions to explore key design axes:

  • What layout algorithms or heuristics work well for common diagrams? And if desired, how might the user customize the layout of the diagram?

  • When stepping through a completed proof, the completed diagram is known in advance; and even otherwise, diagrams are mostly constructed additively. How might the tool avoid re-laying out a diagram as much as possible and make inevitable re-layouts easy to follow?

  • If the diagram has to serve an an input mechanism too, e.g. by selecting edges to apply a lemma or drawing new ones, then how might a proof thus produced be made robust to future changes? How might we record the user’s objective at each step rather than just their actions?

  • Is the diagram better represented as a flat structure (adjacency matrix or list of edges) or an inductive one (?) or something else? If inductive, how might we avoid encumbering the user with reassociating and redistributing terms at every step to focus on subdiagrams?

  • What can we take and learn from manual diagramming tools with mature interfaces like quiver and TikZiT? Especially if users are already familiar with them?

  • How might we translate what we learn from the above into a fully-fledged tool?

Proof engineering

Problem

The magic wand operator -∗ of separation logic isn't a satisfactory way to talk about operations that return parts of a data structure: it allows only one borrow. Lots of extensions have been proposed. What's the right way to represent repeated borrows from a data structure?

As a concrete example, consider the following setup: we have an associative map (say, a tree) that stores keys and values inline (for concreteness we'll say that keys are bytes and values are constant-size objects).

We write map p m to say that map m is at address p and val p v to say that value v is at address p. We can use a simple representation for m, say a list of key * value pairs. For example, we may say map 42 [(37, v1), (12, v2)], which expands to a set of separating conjunctions, including val p1 v1 and val p2 v2 for some pointers p1 and p2.

First, let's consider the single-borrow case (assume that p0 points to a map that contains k1 and lookup returns an inner pointer to the value corresponding to key k1):

val* p1 = lookup(p0, k1);
memset((void*)p1, (int)0, sizeof(val));

How do we reason about this code fragment? To call memset we need ownership of p1, so we need a fact like val p1 v1. Assuming we start with map p0 [(k1, v1), (k2, v2), …], what do we know after running lookup, right before the memset?

It can't be val p1 v1 * map p0 [(k1, v1), (k2, v2), …], because lookup returns an internal pointer, not a copy, so there would be double ownership of p1 (we could say val p1 v1 ∧ map p0 [(k1, v1), …], but the non-separating conjunction is quite hard to work with in separation logic).

It can't be val p1 v1 * map p0 [(k2, v2), …], because this loses track of the space that was allocated for the key k1 inside the map.

This is what the magic wand operator is for; it's the separation-logic version of implication. We write val p1 v1 * (val p1 v1 -* map p0 [(k1, v1), (k2, v2), …]) to indicate that we have “borrowed” val p1 v1 out of the map. This is enough to then call memset, though we need to be careful about how to plug the mutated v1 back into the data structure, since after calling memset we have val p1 zeroes * (val p1 v1 -* map p0 [(k1, v1), (k2, v2), …]), and hence what we have and what's on the left of the magic wand doesn't match anymore (there's a paper about this by Andrew Appel, I think).

But now consider something trickier:

val* p1 = lookup(p0, k1);
val* p2 = lookup(p0, k2);
memmove((void*)p1, (void*)p2, sizeof(val));

We can't even call lookup the second time: we have a precondition like val p1 v1 -* map p0 [(k1, v1), (k2, v2), …], but we need a clean map fact to call lookup.

Proposal

Explore solutions to this problem. Malte Schwerhoff and Alex Summers (and perhaps other folks at ETH) had work on iterated separating conjunctions that approached this problem. I'd like to explore explicitly annotating borrows in the logical representation of the data instead (use list (key * option value) instead of list (key * value), where None means that the map has temporarily relinquished ownership of the corresponding value.

We have a more advanced version of this in Rupicola with two levels of borrowing (reserving and borrowing) and a more expressive type (instead of option you use a type that records the original pointer so that you can unborrow more easily). One nice aspect is that it allows you to borrow subparts of the whole data structure if you change the constructors of the type that represents the structure of the object (like list in this case); for example, you can borrow complete subtrees, mutate them separately, and plug them back in.

There are plenty of interesting questions on automating reasoning about this, and plenty to explore around representation choices, too.

Problem

What is a good representation for hardware circuits (specifically one to be used as a compilation target) in the context of a proof assistant like Coq?

Context

For circuits respecting the digital abstraction, there are some properties that we want to enforce (such as absence of combinational cycles), but not necessarily with types (we may want to prove that). Encoding choices have a strong impact on how easy it is to reason about circuits (for example, a fully general graph encoding will not allow us to write a structurally decreasing interpreter, because of termination concerts).

The Kôika language uses a cute encoding hack to avoid reasoning explicitly about cycles and sharing in the circuits it generates: its type of circuits is a tree, and the DAG structure is encoded using implicit in-memory sharing; then, an unverified pass reconstructs a proper DAG by testing for physical equality (in other words, there are no let-bindings in Kôika's circuit type). Let's take a concrete circuit for illustration purposes (A and B are arbitrary circuits, potentially large):

assets/projects/xor-circuit-dag.svg

In Kôika, the same circuit will be represented as (¬A ∧ B) ∨ (A ∧ ¬B), or more precisely as let a := A in let b := B in (¬a ∧ b) ∨ (a ∧ ¬b); the two instances of a are physically equal, so it's possible to reconstruct the graph structure and the proper circuit sharing, but the reasoning and proofs get to treat circuits as trees, and reasoning is hence greatly simplified.

(It is interesting to think of what kind of optimizations are expressible on this tree representation: in particular, graph simplifications and circuit optimizations can be performed as part of the construction of the graph, but not as a separate pass after the fact, since applying a recursive function that reconstructs a tree circuit destroys sharing.)

In general, there are many constraints to satisfy: simplicity, ease of reasoning, support for local and global optimizations, compatibility with (or ease of export to) other representations, expressiveness, performance, etc.

Proposal

Design or identify a usable representation of circuits, and use it to retarget the Kôika compiler, reducing its trusted base. Explore extensions to a general representation of modules, including theorems about linking.

Programming language implementation

Problem

Coq's type inference does not introduce new type variables (it does not allow generalization). For example, Coq will refuse to typecheck the following definition:

Fail Fixpoint map f xs :=
  match xs with
  | [] => []
  | x :: xs => f x :: map f xs
  end.

This makes writing non-dependently typed programs in Coq more painful than in (say) OCaml.

Proposal

The problem is not solvable in general because of the power of Coq's type system; but it would still be nice to handle simple cases like the above; it's possible that it would be enough to try and plug a type variable in each remaining hole after Coq's own type inference (Hindley-Milner's standard generalization rules). The most interesting insight here would probably be to see in which way this fails: does it ever lead to confusing errors? Does it lead to other issues?

Problem

The following Scala program runs out of memory:

def print_forever() =
  val ints = LazyList.from(0)
  ints.foreach(println)

In theory, this program should use constant memory: values are read one by one ints and each value can be discarded immediately after being printed, before the next value is generated.

Unfortunately, things don't work that way in Scala today. Instead, the ints variable, which lives on the stack, pins the whole lazy list in memory until foreach returns (which never happens).

Proposal

Change the JVM backend of the Scala compiler to null out stack variables right after their last use. In the above case, ints should be pushed onto the call stack and nulled out before control is transferred to foreach.

Follow-Up: Avoiding memory leaks in iterators and coroutines

A similar issue pops up when control is transferred out of a (co)routine while it keeps a reference to dead variables. For example:

while (var item = queue.pop()) {
  await DoWork(item);
}

Even if DoWork releases its reference to item immediately, the contents of item are pinned by the reference on the stack (var item) until the DoWork returns.

My simplest proposals in that direction live on the Coq bug tracker. The difficulty ranges from a solid afternoon of work to a few days of work. Beyond that, here are things that don't fit on the bug tracker:

Proposal: Better unification

Coq's unification errors are not very eager, so often Coq reports that the whole goal fails to unify with a whole lemma; then you spend a lot of time manually padding arguments to the lemma until it applies (typically because of a dependent type) or until you get a proper error that pinpoints the issue. The proposal is to report on a more eager unification attempt by pinpointing where a naive unification ended up failing.

Proposal: Single-step eauto

For extensible forward reasoning I use a trick with typeclasses eauto + shelve. This trick is too complicated and should be replaced by an eauto_one with db that applies any lemma in database db that matches the current goal.

Proposal: An Alectryon backend for coqdoc

Coqdoc is hard to migrate from, but its syntax is reasonably simple. Writing a coqdoc backend that generates Alectryon documents would make it easier to switch between tools.

Proposal: Printing with sharing

Debugging Coq performance issues can require careful analysis of term sizes an memory usage. An option similar to Lisp's *print-circle* would make much easier to see what's going on.

Proposal: Automatic goal naming

Coq functions (and hence theorems) can have named arguments (e.g. substring: forall (s: string) (start: nat) (len: nat), string), but most tactics do not make use of this fact. Automatically naming goals after arguments would make it much nicer to navigate large manual proofs.

Productivity and debugging

Problem

Backtraces are a terrible way to understand and debug issues. At worst they show no nothing but a list of lines that led to an error (without concrete values); at best they show the state of the program when it crashed, but not how it got there.

Proposal

Implement a tracing debugger that records every single line executed by a program, with concrete values substituted for variables and expressions. The result should be a full trace, with indentation indicating call depth.

This will generate huge traces. That's OK. It would be particularly useful for bugs deep in unfamiliar code (and it would beat detailed stack traces by far).

Here is a concrete example. Take this Python program:

def swap(l, src, dst):
    l[src], l[dst] = l[dst], l[src]
def shuffle(l: List[int]):
    for i in range(len(l)):
        swap(l, i, random.randint(i, len(l)))

l = MutableString("abcde")
random.seed(1)
shuffle(l)

It has a bug: the bounds of randint should be i, len(l) - 1. Running it prints the following backtrace, which does not make it clear exactly what happened:

Traceback (most recent call last):
  File "~/shuffle.py", line 10, in <module>
    shuffle(l)
  File "~/shuffle.py", line 6, in shuffle
    swap(l, i, random.randint(i, len(l)))
  File "~/shuffle.py", line 2, in swap
    l[src], l[dst] = l[dst], l[src]
IndexError: list index out of range

The proposal is to generate this instead:

# for i in range(len(l)):
  for i⇒0 in range(len('abcde')⇒5):
     # swap(l, i, random.randint(i, len(l)))
       swap('abcde', 0, random.randint(0, 5)⇒1)
       # def swap(l, src, dst):
       ⤷ def swap(l⇒'abcde', src⇒0, dst⇒1):
           # l[src], l[dst] = l[dst], l[src]
             l[0], l[1] = l[1]⇒b, l[0]⇒a
# for i in range(len(l)):
  for i⇒1 in range(len('bacde')⇒5):
     # swap(l, i, random.randint(i, len(l)))
       swap('bacde', 1, random.randint(1, 5)⇒5)
       # def swap(l, src, dst):
       ⤷ def swap(l='bacde', src=1, dst=5):
           # l[src], l[dst] = l[dst], l[src]
             l[1], l[5] = l[5]⇒💥IndexError, l[1]⇒a
             💥 IndexError: list index out of range
Follow-Up: Reduce clutter by handling nesting interactively

Start by displaying this:

# swap(l, i, random.randint(i, len(l)))
  swap('bacde', 1, 5)

Then if the user clicks the 4, change the line to

# swap(l, i, random.randint(i, len(l)))
  swap('bacde', 1, random.randint(1, 5)5)

Then if the user clicks the first 5, change to

# swap(l, i, random.randint(i, len(l)))
  swap('bacde', 1, random.randint(1, len('dbcae')5)5)

etc.

Follow-Up: Show implicit state

Some languages operate mostly on hidden state (think Ltac in Coq, for example). How do we represent traces in these languages?

Follow-Up: Compute trace diffs

This would make it trivial to detect where execution diverges between two separate problems. This is a particularly tricky problem today for porting proof scripts across versions of Coq, so this would be especially useful for Ltac traces in Coq.

Follow-Up: Make the traces editable

Changing a value anywhere should update the resulting trace in the rest of the program.

Context

Traditional profilers are great at identifying obvious performance bottlenecks and blatant performance bugs. Unfortunately (perhaps because profilers work so well for that?), blatant performance bugs are not why most programs are slow — there's only so many places in a program where you can replace a slow algorithm with a fast one.

In fact, many established codebases are slow despite using efficient algorithms, and show no obvious bottlenecks with a typical profiler. Instead, inefficiencies often come from doing redundant or unnecessary work.

Problem

Traditional profilers don't help with identifying and eliminating useless computations: they don't automatically identify opportunities for memoization, for example.

Proposal

Design a different kind of profiler that uses memory snapshots and taint tracking to flag redundant or unused work.

Concretely, we'd want to show programmers information on how many times each function ended up computing distinct results from distinct inputs — not just how many times each function was called (this would help decide whether it's worth caching (or “memoizing”) the results of the function).

We'd also want to get information about how many times the result of a computation was used: if the results of a computation are unused, we'd want to suggest removing that computation, or disable it for certain inputs.

Finally, we might want to do this at a finer granularity than a single function, to identify places where computations performed by a given function differ in trivial ways, and the bulk of the work could be reused across calls.

Problem

The world depends on massive amounts of code written in legacy languages (think COBOL, early versions of Fortran, etc.), and we don't have enough developers to maintain them.

Proposal

Write a translator from a legacy language (say, COBOL) into a more popular one (say, Python). The result should be as readable and idiomatic as possible, so that a developer experienced in Python will be able to work with the result, using the standard suite of Python tools: debuggers, static analyzers, etc. (maybe even profilers!).

Then write a reverse translator, so that changes performed on the translation can be back-propagated to the source. Not all changes can be represented, of course, but it should be possible to reorder computations, change constants, add function parameters, introduce logging instructions, move code from one project to another (e.g. support for a new input format), etc.

Follow-Up: Back-propagating manual optimizations

This doesn't need to be just about legacy languages: if I write a program in C and I don't like GCC's output (maybe it chose allocated registers suboptimally, or maybe it used an inefficient instruction, or maybe it failed to realize that an optimization was possible). Can I change the assembly and reverse-engineer a source-level change that would achieve that result?

(The answer is trivially yes, since GCC supports inline assembly in C programs, but we want more reasonable changes like adding assertions that help the optimizer, or pragmas that control it, or even small code changes that don't change the meaning of the program but interact better with the optimizer)

Problem

Adding support for a new language (think syntax highlighting, semantic navigation, indentation, completion, structural editing, code folding, etc.) to a text editor is a very complex and time consuming process. Initiatives like the language server protocol and tree-sitter mitigate the n² problem (n editors × m languages), but implementing an LSP server or tools on top of a tree-sitter AST is still quite time-consuming.

Proposal

Enhance a parser-generator framework (say, Menhir) to accept indentation and other annotations in such a way that the grammar serves as a single source of truth for most syntactic features of the IDE (highlighting, indentation, structural editing, code folding, and related features). Leverage existing error-recovery mechanisms to (mostly) correctly highlight and indent partial or broken programs.

Follow-Up: Leverage parsing information for completion

IDE completion works well for symbol names (usually a single word), but is seldom needed in mainstream languages for syntax. With languages like Coq that support syntax extensions (think Coq's tactic notations), and to some extend with some languages support Lisp-style macros (think CL's loop), more powerful completion mechanisms would be very welcome: completion would be rephrased as completing the last word of an incomplete program in a way compatible with an annotated grammar, so a user of Coq typing apply H in would be offered completions based on the productions allowed by the mini-language of goal and hypotheses selection.

Follow-Up: Bounded lookback

A standard parser would require parsing from the beginning of a file, which might be costly but also undesirable: if a block of three lines are supposed to be at the same indentation and I purposefully indent the first one by hand in a way that disagrees with the automated system, reindenting the next line should line it up with the first one, not reparse from the beginning and line it up with where the first line should have been (Stefan Monnier has a working version of this in a tool called SMIE in Emacs). Additionally, there's a high cost to reparsing from the start of a buffer for large buffers; instead we prefer (need) to indent and highlight relative to some local context.

Software engineering

Problem

Semantic versioning is supposed to eliminate dependency hell by making version numbers informative and predictable. In practice, developers can make mistakes (incrementing version numbers too conservatively or too eagerly, e.g. by forgetting certain changes). How common are these mistakes, and how often do they cause issues? (Some ecosystems, like Elm, automatically enforce a form of semver by diffing APIs.)

Proposal

Use static analysis (API diffs) to check assigned version numbers for consistency with semver. Do human-assigned version numbers conform with semantic versioning?

Compare this with tools like semantic-release. How often are these tools wrong? (They use human annotations to determine the next version number, so they leave some responsibility to developers).

How about automatic enforcement with API diffs? How often does that make mistakes? (That technique can miss relevant changes that don't alter the signature or name of a function but change its semantics.) To answer this question API diffs will not be enough, but we can gather interesting data by looking at the amount of breakage if any that follows an update (if program P depends on library L and library L gets a minor or patch release then the test suite of P should not be affected; if it is, then L's change probably should have been classified as major and that change was missed).

Notes

Théo Zimmermann kindly pointed me to this very relevant paper, which suggests that there is wide deviation between human-assigned version numbers and proper semver. This is a great starting point to refine the question above (specifically: do projects that claim to follow semver in fact follow it?), and to explore the other questions that follow it.

(I suspect this has been done in various forms, so it might be enough to do a literature search and produce a nice summary of the results across various blogs and papers and the CIs of some large FLOSS projects.)

Problem

Compiler updates add a mix of new optimizations and new features, often at the cost of higher compiler times. How much do these updates actually buy you in terms of performance? Is that worth the added compilation cost?

Proposal

Pick a few interesting programs from the 60s to the 80s and compile them with every version of, say, GCC from then to now, on a modern machine. How much faster does the result get? How much slower does the compilation process get?

Follow-Up

There is a common adage in compiler development (I forget who coined it) that the only optimizations that are worth it are those that make the compiler itself (this is assuming that the compiler is written in the same language that it compiles). Is it true? Which optimizations make, say, GCC compile itself faster, and which ones make it compile itself slower? Does this correlate with the usefulness of these optimizations in the wild?

Problem

Most free software uses text-only unit and integration tests (Emacs is like that, for example). Unfortunately, these tests fail to catch all sorts of interesting bugs that only manifest graphically (such bugs are hard to test for non-graphically, so large parts of the codebase are in fact untested).

Proposal

Apply the visual regression testing methodology to existing, large free software apps: use a DSL (essentially keyboard macros and mouse clicks) to drive interactions with the software, taking regular screenshots, and compare these screenshots across versions. Flag revisions that create display changes and review them for bugs. Does this find many new bugs? Does it flag bugs that were later reported by users and manually diagnosed?

Context

I implemented something like this a while back for Emacs, specifically in the context of automatic bisecting: there was a specific display bug that I observed and I implemented automatic screenshotting which, combined with git bisect, pinpointed the source of the issue in a few hours of automatic recompilation and screenshot cycles. Visual regression testing has also been very useful in testing company-coq, Alectryon, and other packages (it caught many bugs that were not caught by other testing mechanisms).

PL+X

Context

Role-playing video games are built around “quests” that players collect and resolve as they explore the world: to attain a goal, players need to perform a series of steps (talking to a character, collecting an item, visiting a location, etc.). A simple model to understand these quests is labeled transition systems: actions taken by the player influence the state of the game and hence the state of a given quest (the labels are the entries in the players “journal” or “quest log”).

Additionally, RP video games are very often extensible, so quests are expressed in a custom programming language that so-called “modders” use to describe new quests or patch existing ones.

Problem

RP video games are very buggy. It's common for quests to get stuck (a place in the state machine where no transitions are possible), to require repeated interactions (the transition system fails to take a step despite a trigger), to require a reset (the state of the whole transition system is corrupt), etc.

Proposal

Use automated program verification to prove functional assertions and reachability / liveness properties about complex quests, such as:

  • There is always a way to complete a quest: either success or failure, but no quest that persists forever, unfinished in the journal (a form of liveness).

  • It's not possible to obtain the reward without completing a specific step (a form of functional correctness)

  • It is not possible to receive a quest reward twice (a form of linearity).

etc.: existing bugs and patches provide a rich source of inspiration for interesting properties.

Problem

Checking Coq proofs can be extremely slow, because it requires re-executing an arbitrarily complex program. We'd like a consistently fast way to check a proof, so as to be able to cheaply verify a claimed proof of a property, possibly using a protocol or encoding that requires extra work on the prover's side (that is, on the side of the person supplying the proof, as opposed to the verifier who checks it).

Context

Research on PCP (probabilistically checkable proofs) has progressed a lot lately, and PCP in theory provides a cheap way to verify proofs (albeit at the cost of generating very large certificates).

Proposal

Investigate a PCP encoding of traces of Coq's VM, with the goal to confirm the correctness of a claimed reduction.

Literate programming and documentation

Context

Alectryon currently supports Coq and has preliminary support for Lean3; folks are working on porting it to Lean4. Experimenting with new proof languages yields interesting insights into similarities and differences and how different communities like to present their proofs.

Proposal

Extend Alectryon to support other protocols and proof languages. We have early support for EasyCrypt; F* would be another natural candidate (early F* work was the inspiration for Alectryon), and so would be HOL-light.

Updates
  • Niklas Bülow added support for Lean4 to Alectryon as part of his Bachelor's thesis at KIT.

Context

A lot of programming languages and software projects are converging towards Sphinx as their documentation system. Sphinx is built on top of Docutils, an extensible document-processing framework originally written for Python's documentation.

Problem

Nominally, Docutils support arbitrary frontends. In practice one of them dominates by far: reStructuredText (reST). Unfortunately, while reST is extensible at the block level, it does not support nesting for inline constructs. Other frontends like MyST (for Markdown) suffer from various related problems.

Proposal

Design and implement a conservative extension of reST (ideally with minimal modifications) that solves its long-standing issues. Can its parser be fixed to support nesting in a simple way? (See the old notes about this on the reST website) Is there a reasonable alternative to significant indentation that avoids having to write pages and pages of text indented 9 spaces to the right?

Problem

We are developing ways to present proofs and documents in a convenient and accessible way, but these documents are static: readers cannot readily attach comments to relevant parts and discuss inline. (Contrast this with in-context discussions on e.g. Gitlab or SourceHut.)

Proposal

Extend reStructuredText to support inline annotations and discussions; ideally these annotations should be robust to changes in the document (perhaps comments visible by everyone could be expressed as patches to the original document, so that there remains a single source of truth).

Context

Umut Acar and Anil Ada at CMU have a CMS (https://www.diderot.one/) that does this sort of stuff, though in a custom format. If Diderot's source were available this would be a valid option.

Problem

latexdiff is a script that compares two LaTeX source files and produces a third based on their diff (with some smart chunking decisions), such that when compiled, the resulting document shows additions and deletions with visual markup like color and strikethrough. latexdiff naturally does not fully parse, macro-expand, and evaluate its input, but relies on pattern matching and heuristics to catch the most common cases. This poses a problem when using custom convenience macros: latexdiff diffs the invocations of those macros and not the LaTeX commands they expand into (example).

Proposal

Patch latexdiff to expand simple custom macros before running its diff algorithm. Gradually expand the scope of handled macros based on issues surfacing in practical use. (Note that latexdiffcite already does this for BibTeX-based citations.)

Offensive security

I keep a private list of attack ideas (proposals for offensive security research); I'm happy to chat about it if we meet in person.

Events

Our lab co-organized the following events, under the proofs.swiss umbrella: