Documentation for Similar (and related capabilities)

Tool for finding similar documents (and more). All of these tools are based on Semantic Scholar (S2) APIs.

APIs
Quick Tour (out of date)

APIs

The following APIs return json objects:

API Examples Arguments Description

Paper Search example help, query, fields

Find papers matching input query (a string); output fields from S2 for each paper.

See documentation on fields for more information on fields in S2.

A common use case is to request paper ids from titles of papers since many of the APIs below are based on ids in Semantic Scholar (and other sources).

Author Search

simple example

more challenging example

help, query, fields

Find authors matching input query (a string); output fields from S2 for each author.

See documentation on fields for more information on fields in S2.

Sort_by can be any of the fields that can be converted to integers

Limit argument will truncate results (after sorting)

Note: author fields are different from paper fields.

Lookup Paper

simple example

more challenging example (with score2)

more challenging example (with embeddings)

help, id, fields, embeddings, score2

Input one or more comma separated paper id and output fields from S2, as well as embeddings.

If embeddings argument is specified, then output embedding vectors for each input paper (missing values will have vectors of 0).

See documentation on embeddings for details on how to specify combinations of different embeddings to return.

Lookup Author

example1

more challenging example (with score2)

more challenging example (with embeddings)

help, id, fields, sort_by, limit, embeddings, score2

Input author id and output author fields from S2.

Field argument can request list of papers

Limit argument will truncate list of papers returned

Sort_by argument can be citationCount; if so, then field argument should contain papers.citationCount

Embeddings can return vectors; see documention on embeddings

Note: author ids are different from paper ids and author fields are different from paper fields.

Lookup Citations example help, offset (defaults to 0), limit (defaults to 100; max is 1000), id, fields

Lookup Citations for paper id and output fields from S2 for each citation.

A useful field to request is contexts; that field returns citing sentences, sentences from other papers that cite the input paper id.

For papers with more than 1000 citations, call this API multiple times with different offsets.

Coauthors example help, query, after_year

Input query (a string); for each matching author ids, returns a list of coauthors filtered by after_year (a 4 digit number).

Note: since Semantic Scholar may have multiple author ids for the same author, the json object contains a list of coauthors for each author matching the input query

Recommend Papers example help, id, limit, recommend_method, fields, sort_by, score1, score2

Recommend papers similar to paper id using recommend_method.

See documentation on recommend_method for choices of recommend_methods that are currently supported.

Output fields from S2 for each recommended paper.

The optional arguments, score1 and score2, score recommendations one at a time (for score1) and pairwise (for score2), using one or more of four embeddings.

Recommend Authors example help, id, limit, recommend_method, fields, sort_by, score1, score2

Recommend authors near paper id using recommend_method

Output fields from S2 for each recommended author.

Compare and Contrast example1, example2 example2 help, ids (two or more ids, separated by commas)

Use RAG to compare and contrast the first id with the rest.

Compare and Contrast Texts example help, text1, text2

Use RAG to compare and contrast text1 with text2, where both texts are strings.

API	Examples	Arguments	Description
Paper Search	example	help, query, fields	Find papers matching input query (a string); output fields from S2 for each paper. See documentation on fields for more information on fields in S2. A common use case is to request paper ids from titles of papers since many of the APIs below are based on ids in Semantic Scholar (and other sources).
Author Search	simple example more challenging example	help, query, fields	Find authors matching input query (a string); output fields from S2 for each author. See documentation on fields for more information on fields in S2. Sort_by can be any of the fields that can be converted to integers Limit argument will truncate results (after sorting) Note: author fields are different from paper fields.
Lookup Paper	simple example more challenging example (with score2) more challenging example (with embeddings)	help, id, fields, embeddings, score2	Input one or more comma separated paper id and output fields from S2, as well as embeddings. If embeddings argument is specified, then output embedding vectors for each input paper (missing values will have vectors of 0). See documentation on embeddings for details on how to specify combinations of different embeddings to return.
Lookup Author	example1 more challenging example (with score2) more challenging example (with embeddings)	help, id, fields, sort_by, limit, embeddings, score2	Input author id and output author fields from S2. Field argument can request list of papers Limit argument will truncate list of papers returned Sort_by argument can be citationCount; if so, then field argument should contain papers.citationCount Embeddings can return vectors; see documention on embeddings Note: author ids are different from paper ids and author fields are different from paper fields.
Lookup Citations	example	help, offset (defaults to 0), limit (defaults to 100; max is 1000), id, fields	Lookup Citations for paper id and output fields from S2 for each citation. A useful field to request is contexts; that field returns citing sentences, sentences from other papers that cite the input paper id. For papers with more than 1000 citations, call this API multiple times with different offsets.
Coauthors	example	help, query, after_year	Input query (a string); for each matching author ids, returns a list of coauthors filtered by after_year (a 4 digit number). Note: since Semantic Scholar may have multiple author ids for the same author, the json object contains a list of coauthors for each author matching the input query
Recommend Papers	example	help, id, limit, recommend_method, fields, sort_by, score1, score2	Recommend papers similar to paper id using recommend_method. See documentation on recommend_method for choices of recommend_methods that are currently supported. Output fields from S2 for each recommended paper. The optional arguments, score1 and score2, score recommendations one at a time (for score1) and pairwise (for score2), using one or more of four embeddings.
Recommend Authors	example	help, id, limit, recommend_method, fields, sort_by, score1, score2	Recommend authors near paper id using recommend_method Output fields from S2 for each recommended author.
Compare and Contrast	example1, example2 example2	help, ids (two or more ids, separated by commas)	Use RAG to compare and contrast the first id with the rest.
Compare and Contrast Texts	example	help, text1, text2	Use RAG to compare and contrast text1 with text2, where both texts are strings.

Arguments

help: return short documentation
query: input string
id: input for lookup_paper, recommendations and lookup_citations; many of the externalId formats are supported, including:
1. sha (40 byte hex); example
2. CorpusId (the primary key in Semantic Scholar); example
3. PMID (pubmed ids); example
4. ACL (acl anthology ids); example
5. arXiv; example
6. MAG (Microsoft Academic Graph); example
id: input for lookup_author (Note: author ids are different from paper ids)
offset: start of papers to return (defaults to 0)
limit: number of results to return
fields: one or more comma separated values. Many values are supported including title, authors, publication year, bibtex entries, references, citations, citing sentences and much more (see discussion below)
recommend_method (for generating recommendations); recommend_method should be one of the following (case insensitive):
1. combined: example (a fast precomputed combination of ProNE and Specter)
2. ProNE: example
3. Specter: example
4. s2_api: example (based on an API from Semantic Scholar)
5. pubmed_api: example
6. ads_api: example (based on an API from Astrophysics Data System)
ProNE and Specter use cached embeddings to generate recommendations. The last two generate recommendations from Semantic Scholar (S2) and PubMed, respectively
Embeddings (for lookup paper); one or more of the following (comma separated and case insensitive):
1. ProNE: example
2. Specter: example
3. SciNCL: example
4. GNN: example
5. s2_api: example
The first four ProNE, Specter and GNN, use cached vectors. The last one, s2_api, uses an API from Semantic Scholar to return the most recent values. Specter and s2_api should return the same vectors, as long as the Specter vector is not missing. There will be one vector for each input paper. Vectors of zeros are returned for missing values.
score1: outputs 1d vectors of cosine scores between each recommendation. The value of score1 is a (case insensitive) comma separated list of embeddings: ProNE, Specter, SciNCL, GNN; example.
Note: missing values will have cosines of 0
score2: outputs pairwise cosine scores between each pair recommendations. The value of score2 is a (case insensitive) comma separated list of embeddings: ProNE, Specter, SciNCL, GNN comma separated list of embeddings (pairwise cosines of recommendations); example.
Note: missing values will have cosines of 0

Fields

Fields are based on Semantic Scholar APIs; see here for their documentation. Some useful values are shown below (with separate lists for papers, authors and citations). Fields is set to a comma separated list such as fields=title,authors.

Fields for papers:
1. externalIds (outputs one or more of the following ids from the 8 sources behind Semantic Scholar)
  1. CorpusId (the primary key for Semantic Scholar); example
  2. MAG (Microsoft Academic Graph); example
  3. DOI; example
  4. PubMed; example
  5. PubMedCentral; example
  6. DBLP; example
  7. arXiv; example
  8. ACL; example
2. url (pointer into Semantic Scholar); example
3. title (of paper); example
4. abstract; example
5. tldr; example
6. authors; you can request a list of author objects and/or specific fields from the author objects:
  1. authors (list of author objects); example
  2. authors.name (list of author names); example
  3. authors.authorId (list of author ids); example
7. year; example
8. venue; example
9. citationStyles; example (outputs bibtex entries)
10. referenceCount; example
11. citationCount; example
12. openAccessPdf (pointer to PDF file, if known); example
13. fieldsOfStudy (probably from MAG); example
14. s2FieldsOfStudy (like fieldsOfStudy, but from Semantic Scholar); example
15. embedding.specter_v2 (vector of 768 floats based on an encoding of the title and abstract using a BERT-like model); example
16. citations (list of papers that cite this paper); example
  1. citations.title; example
  2. citations.authors; example
  3. citations.citationCount; example
  4. citations.xyz where xyz is authors, citationCount and most other paper fields
17. references; example
  1. references.title; example
  2. references.authors; example
  3. references.citationCount; example
  4. references.xyz where xyz is authors, citationCount and most other paper fields
Fields for authors:
1. authorId; example
2. externalIds; example
3. url; example
4. name; example
5. affiliations; example
6. homepage; example
7. paperCount; example
8. hIndex; example
9. citationCount; example
10. papers (list of papers); example
  1. papers.title; example
  2. papers.externalIds; example
  3. papers.xyz where xyz is authors, citationCount and most other paper fields
Fields for citations:
1. contexts (citing sentences): example
2. intents: example
3. xyz where xyz is authors, citationCount and most other paper fields

Quick Tour

Specifying Input Documents

Many inputs are supported. The simplest case is a corpus id from Semantic Scholar:

similar?CorpusId=3051291

Search, query and author are also supported:

All of these use APIs from Semantic Scholar. Query uses autocomplete, and Search uses another recommend_method from Semantic Scholar for mapping strings to one or more corpus ids. Author maps strings to author ids. For each author id, we list a number of papers sorted by citations, with links to find similar papers for each of them.

Specifying Embeddings

There is an embedding option. Four values are currently supported: specter, specter2, scincl and proposed [default].

Json Output

Example:

similar?query=deepwalk
similar?query=deepwalk&output_mode=json Same as above, but outputs json

limit

Most of the commands above take a limit option [default=50]

similar?author=Povey&limit=10

More Details

search
query
author

Output Mode

None (default)
json
bibtex
RAG