Bin Assignments

There are 100 files here with 3-digit names (the name of a bin). Each file contains a list of corpus ids in that bin.

Precomputed embeddings

There are several large embeddings that you can download here. That file contains the following subdirectories:

Subdirectory	Papers (N)	Dimensions (K)	Size of embedding.f
ProNE-s	112M	280	125GB
Specter2	119M	768	365GB
SciNCL	90M	768	279GB
GNN	99M	200	79GB

This README file explains what you can do with those subdirectories.

There are also some embeddings for a few ProNE-s models trained on subgraphs:

Subdirectory Papers (N) Dimensions (K) Size of embedding.f

ProNE-s (bin 10) 8M 280 9GB

ProNE-s (bin 20) 19M 280 21GB

ProNE-s (bin 30) 29M 280 32GB

ProNE-s (bin 40) 39M 280 44GB

ProNE-s (bin 50) 49M 280 55GB

It is assumed that each of these directories contain the following files:

Subdirectory	Papers (N)	Dimensions (K)	Size of embedding.f
ProNE-s (bin 10)	8M	280	9GB
ProNE-s (bin 20)	19M	280	21GB
ProNE-s (bin 30)	29M	280	32GB
ProNE-s (bin 40)	39M	280	44GB
ProNE-s (bin 50)	49M	280	55GB

embedding.f: a sequence of N by K floats, where N is the number of nodes (papers) in the embedding, and K is the number of hidden dimensions
record_size: defines a few configuration variables such as K (number of hidden dimensions) and B (number of random bytes in approximate neareast neighbors)
record_size.sh: similar to above
map.old_to_new.i: mappings between corpus ids (old) and offsets into embedding.f (new)
map.new_to_old.i: inverse of above
indexing files for approximate nearest neighbors
1. old version
  1. idx.*.i: permutation of N, used in . Papers that are near one another in the permutation should have large cosines.
  2. idx.*.i.inv: inverse of above
2. new version
  1. landmarks.i
  2. postings.i
  3. postings.idx.i

See here for documentation on an API that provides convenient access to pieces of these embeddings.

Consider this example:

Find the id for a paper from (part of) its title with this
Use that id Recommend papers with:
1. a recommendation API from Semantic Scholar (S2)
2. ProNE

The links above show that the top recommendation for 316030 is 47066000 by the S2 API, and 6496359 by ProNE-s.

The code below shows how to get cosine scores for these pairs by the embeddings in this directory. The code in the src directory is intended to illustrate how to memory map these embeddings into Python. It should be easy to modify that code to get vectors.

Note: when a vector is missing, the cosine is -1.

for dir in embeddings/*
do
echo $dir
echo '316030    47066000' | python src/simple_pairs_to_cos.py --dir $dir
done

for dir in embeddings/*
do
echo $dir
echo '316030    6496359' | python src/simple_pairs_to_cos.py --dir $dir
done

The following is like above but for several ProNE-s models.

for dir in embeddings/ProNE-s/bins/0?? embeddings/ProNE-s
do
echo $dir
echo '316030    47066000' | python src/simple_pairs_to_cos.py --dir $dir
done

for dir in embeddings/ProNE-s/bins/0?? embeddings/ProNE-s
do
echo $dir
echo '316030    6496359' | python src/simple_pairs_to_cos.py --dir $dir
done

The input are a pair of corpus ids from Semantic Scholar. See here for the code; it shows how to compute cosine similarities for pairs of corpus ids using several different embeddings.

The following implements the approximate nearest neighbor search. The Python program, src/near.py, inputs a corpus id and outputs topN corpus ids. The code assumes the --dir argument contains several indexing files named postings.* and landmarks.*. These indexing files are available for ProNE-s and Specter2, but not for SciNCL and GNN.

echo 316030 | src/near.py --dir embeddings/ProNE-s --topN 5
echo 316030 | src/near.py --dir embeddings/Specter2 --topN 5