Index of /files
Precomputed embeddings
This file explains what you can do with directories with precomputed embeddings.
Some examples of precomputed embedding directories are:
- ProNE-s
- Specter
- SciNCL
- GNN
There are also some more directories with precomputed embeddings.
It is assumed that each of these directories contain the following files:
- embedding.f: a sequence of N by K floats, where N is the number of nodes (papers) in the embedding, and K is the number of hidden dimensions
- record_size: defines a few configuration variables such as K (number of hidden dimensions) and B (number of random bytes in approximate neareast neighbors)
- record_size.sh: similar to above
- map.old_to_new.i: mappings between corpus ids (old) and offsets into embedding.f (new)
- map.new_to_old.i: inverse of above
- indexing files for approximate nearest neighbors
- old version
- idx.*.i: permutation of N, used in . Papers that are near one another in the permutation should have large cosines.
- idx.*.i.inv: inverse of above
- new version
- landmarks.i
- postings.i
- postings.idx.i
Example of usage:
for dir in proposed sandeep scincl
do
echo '3 154392' | python src/simple_pairs_to_cos.py --dir $dir
done
The input are a pair of corpus ids from Semantic Scholar. See here for the code; it shows how to compute cosine similarities for pairs of corpus ids using several different embeddings.