The first architecture is reminiscent of the map-reduce paradigm. The idea is to split analysis into relatively simple indexing phase, and a separate full analysis phase.
The core constraint of indexing is that it runs on a per-file basis. The indexer takes the text of a single file, parses it, and spits out some data about the file. The indexer can’t touch other files.
Full analysis can read other files, and it leverages information from the index to save work.
This all sounds way too abstract, so let’s look at a specific example — Java. In Java, each file starts with a package declaration. The indexer concatenates the name of the package with a class name to get a fully-qualified name (FQN). It also collects the set of methods declared in the class, the list of superclasses and interfaces, etc.
Per-file data is merged into an index which maps FQNs to classes. Note that constructing this mapping is an embarrassingly parallel task — all files are parsed independently. Moreover, this map is cheap to update. When a file change arrives, this file’s contribution from the index is removed, the text of the file is changed and the indexer runs on the new text and adds the new contributions. The amount of work to do is proportional to the number of changed files, and is independent from the total number of files.
Let’s see how FQN index can be used to quickly provide completion.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// File ./mypackage/Foo.java
package mypackage;
import java.util.*;
public class Foo {
public static Bar f() {
return new Bar();
}
}
// File ./mypackage/Bar.java
package mypackage;
public class Bar {
public void g() {}
}
// File ./Main.java
import mypackage.Foo;
public class Main {
public static void main(String[] args) {
Foo.f().
}
}
The user has just typed Foo.f()., and we need to figure out that the type of receiver expression is Bar, and suggest g as a completion.
First, as the file Main.java is modified, we run the indexer on this single file.
Nothing has changed (the file still contains the class Main with a static main method), so we don’t need to update the FQN index.
Next, we need to resolve the name Foo.
We parse the file, notice an import and look up mypackage.Foo in the FQN index.
In the index, we also find that Foo has a static method f, so we resolve the call as well.
The index also stores the return type of f, but, and this is crucial, it stores it as a string "Bar", and not as a direct reference to the class Bar.
The reason for that is import java.util.* in Foo.java.
Bar can refer either to java.util.Bar or to mypackage.Bar.
The indexer doesn’t know which one, because it can look only at the text of Foo.java.
In other words, while the index does store the return types of methods, it stores them in an unresolved form.
The next step is to resolve the identifier Bar in the context of Foo.java.
This uses the FQN index, and lands in the class mypackage.Bar.
There the desired method g is found.
Altogether, only three files were touched during completion.
The FQN index allowed us to completely ignore all the other files in the project.
One problem with the approach described thus far is that resolving types from the index requires a non-trivial amount of work.
This work might be duplicated if, for example, Foo.f is called several times.
The fix is to add a cache.
Name resolution results are memoized, so that the cost is paid only once.
The cache is blown away completely on any change — with an index, reconstructing the cache is not that costly.
To sum up, the first approach works like this:
-
Each file is being indexed, independently and in parallel, producing a "stub" — a set of visible top-level declarations, with unresolved types.
-
All stubs are merged into a single index data structure.
-
Name resolution and type inference work primarily off the stubs.
-
Name resolution is lazy (we only resolve a type from the stub when we need it) and memoized (each type is resolved only once).
-
The caches are completely invalidated on every change
-
The index is updated incrementally:
-
if the edit doesn’t change the file’s stub, no change to the index is required.
-
otherwise, old keys are removed and new keys are added
Note an interesting interplay between "dumb" indexes which can be updated incrementally, and "smart" caches, which are re-computed from scratch.
This approach combines simplicity and stellar performance.
The bulk of work is the indexing phase, and you can parallelize and even distribute it across several machine.
Two examples of this architecture are IntelliJ and Sorbet.
The main drawback of this approach is that it works only when it works — not every language has a well-defined FQN concept.
I think overall it’s a good idea to design name resolution and module systems (mostly boring parts of a language) such that they work well with the map-reduce paradigm.
-
Require package declarations or infer them from the file-system layout
-
Forbid meta-programming facilities which add new top-level declarations, or restrict them in such way that they can be used by the indexer.
For example, preprocessor-like compiler plugins that access a single file at a time might be fine.
-
Make sure that each source element corresponds to a single semantic element.
For example, if the language supports conditional compilation, make sure that it works during name resolution (like Kotlin’s expect/actual) and not during parsing (like conditional compilation in most other languages).
Otherwise, you’d have to index the same file with different conditional compilation settings, and that is messy.
-
Make sure that FQNs are enough for most of the name resolution.
The last point is worth elaborating. Let’s look at the following Rust example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// File: ./foo.rs
trait T {
fn f(&self) {}
}
// File: ./bar.rs
struct S;
// File: ./somewhere/else.rs
impl T for S {}
// File: ./main.s
use foo::T;
use bar::S
fn main() {
let s = S;
s.f();
}
Here, we can easily find the S struct and the T trait (as they are imported directly).
However, to make sure that s.f indeed refers to f from T, we also need to find the corresponding impl, and that can be roughly anywhere!