This article translates the blog basically to improve English writing skills.

DiskANN series include DiskANN, FreshDiskANN, and Filtered-DiskANN, which solve several real-world scenarios, including handling billions of vectors on a single machine, real-time updates, and query filtering. We will analyze the implementation principles of these one by one. I am interested in DiskANN and Filtered-DIskANN and will briefly introduce these two papers.

# DiskANN

Compared to HNSW, DiskANN can provide solutions for data on disk, which is demanded by the increasingly large scale of vector data. Obviously, a singleton service is easier to implement than a distributed solution.

The search in Graph usually uses the greedy algorithm, which adds the unvisited nearest node’s neighbors from the candidate list(initialized with a node s) to the candidate list and repeats the process until no nodes will be processed. Finally, it returns the top-k **approximately** nearest nodes. Here is the pseudocode:

But here are some issues:

- A lot of Neighbors should be reduced!
- Near nodes but far from other nearby nodes will not be returned.

The issues above point to the construction of the graph to apply the greedy search. “Sparse Neighborhood Graph”(SNG) proposes an ideal state that any selection will lead to the nearest node:

- For all node p in a visited set P that GreedySearch returned and initialize a set S $\leftarrow$ P \ {p}
- While S is not empty
- find the nearest $p^* \in S$ for p and add p* as a neighbor of p
- Delete all node p’ which satisfies the condition $||p^* - p’|| < ||p-p’||$

- end while

- While S is not empty
- end for

SNG is an excellent way to satisfy the convergence property of “greedy search”, but is not the fastest way to convergence. Assuming all nodes are in a straight line, the greedy search algorithm will search all nodes and cost tens of ms(single ssd read latency is 100us, for hundreds of reads) and real-world demands for ms-level latency.

So, in the second phase, we need to speed up the “nearest neighbor” processing. DiskANN adds some long connections in the graph, which modifies the RobustPrune algorithm by a multiplier factor $\alpha(\alpha > 1)$. Here is the pseudocode.

Finally, DiskANN proposes a two-phase construction of a graph for greedy search. Here is the conclusion:

- Initialize a graph with each node of random R neighbors.
- Apply GreedySearch(s, p, 1, L) for each node p, then the RobustPrune(p, V, $\alpha$, R), then connect (p’, p) which $p’ \in N_{out}(p)$. if $|N_{out}(p’)| > R$, fix the graph by RobustPrune(p’, $N_{out}(p’), \alpha$, R).
- Apply phase 2 twice with $\alpha = 1 \ and\ \alpha > 1$.

The algorithm is called “Vamana” and it’s just a “$\alpha$-NSG” with a random initial graph.

We can see the graphs in the second line, the algorithm with a new $\alpha$ reduce some removal of connections.

Here are some other methods in the DisANN search, which may sacrifice the accuracy to speed up the process.

- Divide the dataset into k classes by k-means and distribute each node to 2 classes. We are building Vamana for each class and merging all graphs by the out-edges into the complete graph.
- The process requires the original vector to be quantized by pq, and the quantized value should be stored in memory. The actual value will be kept in the SSD for precise calculation and rearrangement purposes. The quantized value is essential for the greedy search process, memory usage, and calculation. as it reduces m
- Each node in the SSD is a structure that contains padding-R-neighbors, enabling us to locate the appropriate position based on the ID.
- Store the entry node and its 3-4 hops of neighbors in memory to reduce SSD access.
- A beam-search strategy will be added in algorithm 1: parallel accessing multiple nodes’ neighbors to reduce SSD access.

# FilterDiskANN

FilterDiskANN supports the “OR” filter function in DiskANN.

The only difference between FilterGreedySearch and GreedySearch is that FilterGreedySearch only adds nodes with the proper filter constraints to the candidate list.

Samely, we can see that the removal of connections also considers the filter constraints: connection (a, c) can be removed only if there exists a node b, which $F_c \cap F_a \subset F_b$.

# Questions

- “How can we use the k-means algorithm to balance class sizes?”
- Initialize the graph by random r neighbors selection, which can cause extreme performance fluctuations.
- The $\alpha$ the parameter will cause the addition of both long edges and short edges: the actual performance is not under control.