Index of /files

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[DIR]ProNE/2024-05-02 00:40 -  
[TXT]README.html2024-05-08 21:30 1.8K 
[   ]bigrams2023-08-22 04:24 142G 
[DIR]proposed/2024-03-26 04:12 -  
[DIR]sandeep/2023-09-07 00:09 -  
[DIR]scincl/2023-08-07 18:34 -  
[DIR]specter/2023-08-17 21:04 -  
[DIR]specter2/2023-08-11 22:09 -  
[DIR]specter2_from_doug/2024-05-01 22:01 -  
[DIR]src/2024-02-24 22:29 -  

Precomputed embeddings

This file explains what you can do with directories with precomputed embeddings. Some examples of precomputed embedding directories are:
  1. ProNE-s
  2. Specter
  3. SciNCL
  4. GNN
There are also some more directories with precomputed embeddings. It is assumed that each of these directories contain the following files:
  1. embedding.f: a sequence of N by K floats, where N is the number of nodes (papers) in the embedding, and K is the number of hidden dimensions
  2. record_size: defines a few configuration variables such as K (number of hidden dimensions) and B (number of random bytes in approximate neareast neighbors)
  3. record_size.sh: similar to above
  4. map.old_to_new.i: mappings between corpus ids (old) and offsets into embedding.f (new)
  5. map.new_to_old.i: inverse of above
  6. indexing files for approximate nearest neighbors
    1. old version
      1. idx.*.i: permutation of N, used in . Papers that are near one another in the permutation should have large cosines.
      2. idx.*.i.inv: inverse of above
    2. new version
      1. landmarks.i
      2. postings.i
      3. postings.idx.i
Example of usage:
for dir in proposed sandeep scincl 
do
echo '3    154392' | python src/simple_pairs_to_cos.py --dir $dir
done
The input are a pair of corpus ids from Semantic Scholar. See here for the code; it shows how to compute cosine similarities for pairs of corpus ids using several different embeddings.