Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
README.html | 2024-05-25 11:37 | 5.2K | ||
bin_assignments/ | 2024-05-24 21:55 | - | ||
embeddings.old/ | 2024-05-23 07:51 | - | ||
embeddings/ | 2024-07-21 19:29 | - | ||
nohup.out | 2024-07-15 22:40 | 2.7M | ||
src/ | 2024-05-23 09:27 | - | ||
Subdirectory | Papers (N) | Dimensions (K) | Size of embedding.f |
---|---|---|---|
ProNE-s | 112M | 280 | 125GB |
Specter2 | 119M | 768 | 365GB |
SciNCL | 90M | 768 | 279GB |
GNN | 99M | 200 | 79GB |
There are also some embeddings for a few ProNE-s models trained on subgraphs:
Subdirectory | Papers (N) | Dimensions (K) | Size of embedding.f |
---|---|---|---|
ProNE-s (bin 10) | 8M | 280 | 9GB |
ProNE-s (bin 20) | 19M | 280 | 21GB |
ProNE-s (bin 30) | 29M | 280 | 32GB |
ProNE-s (bin 40) | 39M | 280 | 44GB |
ProNE-s (bin 50) | 49M | 280 | 55GB |
Consider this example:
The code below shows how to get cosine scores for these pairs by the embeddings in this directory. The code in the src directory is intended to illustrate how to memory map these embeddings into Python. It should be easy to modify that code to get vectors.
Note: when a vector is missing, the cosine is -1.
for dir in embeddings/* do echo $dir echo '316030 47066000' | python src/simple_pairs_to_cos.py --dir $dir done
for dir in embeddings/* do echo $dir echo '316030 6496359' | python src/simple_pairs_to_cos.py --dir $dir doneThe following is like above but for several ProNE-s models.
for dir in embeddings/ProNE-s/bins/0?? embeddings/ProNE-s do echo $dir echo '316030 47066000' | python src/simple_pairs_to_cos.py --dir $dir done
for dir in embeddings/ProNE-s/bins/0?? embeddings/ProNE-s do echo $dir echo '316030 6496359' | python src/simple_pairs_to_cos.py --dir $dir doneThe input are a pair of corpus ids from Semantic Scholar. See here for the code; it shows how to compute cosine similarities for pairs of corpus ids using several different embeddings.
The following implements the approximate nearest neighbor search. The Python program, src/near.py, inputs a corpus id and outputs topN corpus ids. The code assumes the --dir argument contains several indexing files named postings.* and landmarks.*. These indexing files are available for ProNE-s and Specter2, but not for SciNCL and GNN.
echo 316030 | src/near.py --dir embeddings/ProNE-s --topN 5 echo 316030 | src/near.py --dir embeddings/Specter2 --topN 5