Index of /files

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[DIR]embeddings/2024-07-21 19:29 -  
[   ]nohup.out2024-07-15 22:40 2.7M 
[TXT]README.html2024-05-25 11:37 5.2K 
[DIR]bin_assignments/2024-05-24 21:55 -  
[DIR]src/2024-05-23 09:27 -  
[DIR]embeddings.old/2024-05-23 07:51 -  

Bin Assignments

There are 100 files here with 3-digit names (the name of a bin). Each file contains a list of corpus ids in that bin.

Precomputed embeddings

There are several large embeddings that you can download here. That file contains the following subdirectories:
Subdirectory Papers (N)Dimensions (K)Size of embedding.f
ProNE-s112M280125GB
Specter2119M768365GB
SciNCL90M768279GB
GNN99M20079GB
This README file explains what you can do with those subdirectories.

There are also some embeddings for a few ProNE-s models trained on subgraphs:
Subdirectory Papers (N)Dimensions (K)Size of embedding.f
ProNE-s (bin 10)8M2809GB
ProNE-s (bin 20)19M28021GB
ProNE-s (bin 30)29M28032GB
ProNE-s (bin 40)39M28044GB
ProNE-s (bin 50)49M28055GB
It is assumed that each of these directories contain the following files:

  1. embedding.f: a sequence of N by K floats, where N is the number of nodes (papers) in the embedding, and K is the number of hidden dimensions
  2. record_size: defines a few configuration variables such as K (number of hidden dimensions) and B (number of random bytes in approximate neareast neighbors)
  3. record_size.sh: similar to above
  4. map.old_to_new.i: mappings between corpus ids (old) and offsets into embedding.f (new)
  5. map.new_to_old.i: inverse of above
  6. indexing files for approximate nearest neighbors
    1. old version
      1. idx.*.i: permutation of N, used in . Papers that are near one another in the permutation should have large cosines.
      2. idx.*.i.inv: inverse of above
    2. new version
      1. landmarks.i
      2. postings.i
      3. postings.idx.i
See here for documentation on an API that provides convenient access to pieces of these embeddings.

Consider this example:

  1. Find the id for a paper from (part of) its title with this
  2. Use that id Recommend papers with:
    1. a recommendation API from Semantic Scholar (S2)
    2. ProNE
The links above show that the top recommendation for 316030 is 47066000 by the S2 API, and 6496359 by ProNE-s.

The code below shows how to get cosine scores for these pairs by the embeddings in this directory. The code in the src directory is intended to illustrate how to memory map these embeddings into Python. It should be easy to modify that code to get vectors.

Note: when a vector is missing, the cosine is -1.

for dir in embeddings/*
do
echo $dir
echo '316030    47066000' | python src/simple_pairs_to_cos.py --dir $dir
done
for dir in embeddings/*
do
echo $dir
echo '316030    6496359' | python src/simple_pairs_to_cos.py --dir $dir
done
The following is like above but for several ProNE-s models.
for dir in embeddings/ProNE-s/bins/0?? embeddings/ProNE-s
do
echo $dir
echo '316030    47066000' | python src/simple_pairs_to_cos.py --dir $dir
done
for dir in embeddings/ProNE-s/bins/0?? embeddings/ProNE-s
do
echo $dir
echo '316030    6496359' | python src/simple_pairs_to_cos.py --dir $dir
done
The input are a pair of corpus ids from Semantic Scholar. See here for the code; it shows how to compute cosine similarities for pairs of corpus ids using several different embeddings.

The following implements the approximate nearest neighbor search. The Python program, src/near.py, inputs a corpus id and outputs topN corpus ids. The code assumes the --dir argument contains several indexing files named postings.* and landmarks.*. These indexing files are available for ProNE-s and Specter2, but not for SciNCL and GNN.

echo 316030 | src/near.py --dir embeddings/ProNE-s --topN 5
echo 316030 | src/near.py --dir embeddings/Specter2 --topN 5