Index of /files

Name	Last modified	Size

Parent Directory		-
ProNE/	2024-05-02 00:40	-
README.html	2024-05-08 21:30	1.8K
bigrams	2023-08-22 04:24	142G
proposed/	2024-03-26 04:12	-
sandeep/	2023-09-07 00:09	-
scincl/	2023-08-07 18:34	-
specter/	2023-08-17 21:04	-
specter2/	2023-08-11 22:09	-
specter2_from_doug/	2024-05-01 22:01	-
src/	2024-02-24 22:29	-

Precomputed embeddings

This file explains what you can do with directories with precomputed embeddings. Some examples of precomputed embedding directories are:

There are also some more directories with precomputed embeddings. It is assumed that each of these directories contain the following files:

embedding.f: a sequence of N by K floats, where N is the number of nodes (papers) in the embedding, and K is the number of hidden dimensions
record_size: defines a few configuration variables such as K (number of hidden dimensions) and B (number of random bytes in approximate neareast neighbors)
record_size.sh: similar to above
map.old_to_new.i: mappings between corpus ids (old) and offsets into embedding.f (new)
map.new_to_old.i: inverse of above
indexing files for approximate nearest neighbors
1. old version
  1. idx.*.i: permutation of N, used in . Papers that are near one another in the permutation should have large cosines.
  2. idx.*.i.inv: inverse of above
2. new version
  1. landmarks.i
  2. postings.i
  3. postings.idx.i

Example of usage:

for dir in proposed sandeep scincl 
do
echo '3    154392' | python src/simple_pairs_to_cos.py --dir $dir
done

The input are a pair of corpus ids from Semantic Scholar. See here for the code; it shows how to compute cosine similarities for pairs of corpus ids using several different embeddings.