Documentation for Similar (and related capabilities)

Table of Contents

Tool for finding similar documents (and more). All of these tools are based on Semantic Scholar (S2) APIs.
  1. APIs
  2. Quick Tour (out of date)

APIs

The following APIs return json objects:

APIExamplesArgumentsDescription
Lookup Paper help, id, fields, embeddings, score2
  • Input one or more comma separated paper id and output fields from S2, as well as embeddings.
  • If embeddings argument is specified, then output embedding vectors for each input paper (missing values will have vectors of 0).
  • See documentation on embeddings for details on how to specify combinations of different embeddings to return.
Lookup Author help, id, fields, sort_by, limit, embeddings, score2
  • Input author id and output author fields from S2.
  • Field argument can request list of papers
  • Limit argument will truncate list of papers returned
  • Sort_by argument can be citationCount; if so, then field argument should contain papers.citationCount
  • Embeddings can return vectors; see documention on embeddings
  • Note: author ids are different from paper ids and author fields are different from paper fields.
Lookup Citations example help, offset (defaults to 0), limit (defaults to 100; max is 1000), id, fields
  • Lookup Citations for paper id and output fields from S2 for each citation.
  • A useful field to request is contexts; that field returns citing sentences, sentences from other papers that cite the input paper id.
  • For papers with more than 1000 citations, call this API multiple times with different offsets.
Coauthors example help, query, after_year
  • Input query (a string); for each matching author ids, returns a list of coauthors filtered by after_year (a 4 digit number).
  • Note: since Semantic Scholar may have multiple author ids for the same author, the json object contains a list of coauthors for each author matching the input query
Recommend Papers example help, id, limit, recommend_method, fields, sort_by, score1, score2
  • Recommend papers similar to paper id using recommend_method.
  • See documentation on recommend_method for choices of recommend_methods that are currently supported.
  • Output fields from S2 for each recommended paper.
  • The optional arguments, score1 and score2, score recommendations one at a time (for score1) and pairwise (for score2), using one or more of four embeddings.
Recommend Authors example help, id, limit, recommend_method, fields, sort_by, score1, score2
Compare and Contrast example1, example2 example2 help, ids (two or more ids, separated by commas)
  • Use RAG to compare and contrast the first id with the rest.
Compare and Contrast Texts example help, text1, text2
  • Use RAG to compare and contrast text1 with text2, where both texts are strings.

Arguments

  1. help: return short documentation
  2. query: input string
  3. id: input for lookup_paper, recommendations and lookup_citations; many of the externalId formats are supported, including:
    1. sha (40 byte hex); example
    2. CorpusId (the primary key in Semantic Scholar); example
    3. PMID (pubmed ids); example
    4. ACL (acl anthology ids); example
    5. arXiv; example
    6. MAG (Microsoft Academic Graph); example
  4. id: input for lookup_author (Note: author ids are different from paper ids)
  5. offset: start of papers to return (defaults to 0)
  6. limit: number of results to return
  7. fields: one or more comma separated values. Many values are supported including title, authors, publication year, bibtex entries, references, citations, citing sentences and much more (see discussion below)
  8. recommend_method (for generating recommendations); recommend_method should be one of the following (case insensitive):
    1. combined: example (a fast precomputed combination of ProNE and Specter)
    2. ProNE: example
    3. Specter: example
    4. s2_api: example (based on an API from Semantic Scholar)
    5. pubmed_api: example
    6. ads_api: example (based on an API from Astrophysics Data System)
    ProNE and Specter use cached embeddings to generate recommendations. The last two generate recommendations from Semantic Scholar (S2) and PubMed, respectively
  9. Embeddings (for lookup paper); one or more of the following (comma separated and case insensitive):
    1. ProNE: example
    2. Specter: example
    3. SciNCL: example
    4. GNN: example
    5. s2_api: example
    The first four ProNE, Specter and GNN, use cached vectors. The last one, s2_api, uses an API from Semantic Scholar to return the most recent values. Specter and s2_api should return the same vectors, as long as the Specter vector is not missing. There will be one vector for each input paper. Vectors of zeros are returned for missing values.
  10. prone, scincl, specter, gnn, s2_api (case insensitive).
  11. score1: outputs 1d vectors of cosine scores between each recommendation. The value of score1 is a (case insensitive) comma separated list of embeddings: ProNE, Specter, SciNCL, GNN; example.
    Note: missing values will have cosines of 0
  12. score2: outputs pairwise cosine scores between each pair recommendations. The value of score2 is a (case insensitive) comma separated list of embeddings: ProNE, Specter, SciNCL, GNN comma separated list of embeddings (pairwise cosines of recommendations); example.
    Note: missing values will have cosines of 0

Fields

Fields are based on Semantic Scholar APIs; see here for their documentation. Some useful values are shown below (with separate lists for papers, authors and citations). Fields is set to a comma separated list such as fields=title,authors.
  1. Fields for papers:
    1. externalIds (outputs one or more of the following ids from the 8 sources behind Semantic Scholar)
      1. CorpusId (the primary key for Semantic Scholar); example
      2. MAG (Microsoft Academic Graph); example
      3. DOI; example
      4. PubMed; example
      5. PubMedCentral; example
      6. DBLP; example
      7. arXiv; example
      8. ACL; example
    2. url (pointer into Semantic Scholar); example
    3. title (of paper); example
    4. abstract; example
    5. tldr; example
    6. authors; you can request a list of author objects and/or specific fields from the author objects:
      1. authors (list of author objects); example
      2. authors.name (list of author names); example
      3. authors.authorId (list of author ids); example
    7. year; example
    8. venue; example
    9. citationStyles; example (outputs bibtex entries)
    10. referenceCount; example
    11. citationCount; example
    12. openAccessPdf (pointer to PDF file, if known); example
    13. fieldsOfStudy (probably from MAG); example
    14. s2FieldsOfStudy (like fieldsOfStudy, but from Semantic Scholar); example
    15. embedding.specter_v2 (vector of 768 floats based on an encoding of the title and abstract using a BERT-like model); example
    16. citations (list of papers that cite this paper); example
      1. citations.title; example
      2. citations.authors; example
      3. citations.citationCount; example
      4. citations.xyz where xyz is authors, citationCount and most other paper fields
    17. references; example
      1. references.title; example
      2. references.authors; example
      3. references.citationCount; example
      4. references.xyz where xyz is authors, citationCount and most other paper fields
  2. Fields for authors:
    1. authorId; example
    2. externalIds; example
    3. url; example
    4. name; example
    5. affiliations; example
    6. homepage; example
    7. paperCount; example
    8. hIndex; example
    9. citationCount; example
    10. papers (list of papers); example
      1. papers.title; example
      2. papers.externalIds; example
      3. papers.xyz where xyz is authors, citationCount and most other paper fields
  3. Fields for citations:
    1. contexts (citing sentences): example
    2. intents: example
    3. xyz where xyz is authors, citationCount and most other paper fields

Quick Tour

Specifying Input Documents

Many inputs are supported. The simplest case is a corpus id from Semantic Scholar:
  1. similar?CorpusId=3051291
Search, query and author are also supported:
  1. similar?search=deepwalk
  2. similar?query=deepwalk
  3. similar?author=Povey&limit=10
All of these use APIs from Semantic Scholar. Query uses autocomplete, and Search uses another recommend_method from Semantic Scholar for mapping strings to one or more corpus ids. Author maps strings to author ids. For each author id, we list a number of papers sorted by citations, with links to find similar papers for each of them.

Specifying Embeddings

There is an embedding option. Four values are currently supported: specter, specter2, scincl and proposed [default].
  1. embedding=specter
  2. embedding=specter2
  3. embedding=scincl
  4. embedding=proposed

Json Output

Example:
  1. similar?query=deepwalk
  2. similar?query=deepwalk&output_mode=json Same as above, but outputs json

limit

Most of the commands above take a limit option [default=50]
  1. similar?author=Povey&limit=10

More Details

  1. search
  2. query
  3. author

Output Mode

  1. None (default)
  2. json
  3. bibtex
  4. RAG

Limit

Input Embedding