Performance Evaluation

query

timing (s)

Call(Name("len"))

0.025985

BinOp(op=Add() \| Sub())

0.030508

Try(handlers=LEN(min=3, max=5))

0.033486

BinOp(left=Constant(), right=Constant())

0.146516

FunctionDef(f"run_%", returns = not None)

0.0216

ClassDef(body=[Assign(), *..., FunctionDef()])

0.28737

Analysis

There are 2 major points that cost nearly %95 of the whole query operation. The first, and the obvious point is the actually running the query in the database. There are a couple points that Reizc an do to optimize this step, including trying to generate the best possible query while being in a linear motion (for supporting constructs like reference variables). The code generator (reiz.reizql.compiler) went through a couple major refactors for performance reasons (e.g #12). Also there is a simple/naive AST optimization pass on the IR (EdgeQL) itself.

The second part is the actually retrieving the code snippets from the disk itself. We already store a lot of metadata (like start/end positions, github project etc.) but the actual ‘source’ is still on the disk. So after retrieving the filenames from the query, we simply go and read those files and get the related segments. This is an area that is open to more optimizations (we could statically determine the byte-range and only fetch it, we could parallelize this for multiple matches [the default resultset come with 10 matches], …), though these won’t have the same effects as in getting a better speed in the DB.

Of course alongside these, there have been tons of ways to optimize postgresql itself for different workloads, though it is outside of the Reiz project.

Setup

Machine;

provider

digital ocean

service type

droplet (basic plan)

cpu

(shared) 2vCPU

ram

2GB

disk

regular SSD (not NVME)

IndexDB;

total files

53k

total AST nodes

17 521 894

Benchmark script is present at the source checkout (scripts/benchmark_doc.py).