goarxiv

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 29, 2026 License: MIT Imports: 22 Imported by: 0

README

goarxiv

Go Reference Go Report Card

Go SDK for the arXiv API. Provides an ergonomic, type-safe client for searching and retrieving arXiv articles while honoring the platform's usage guidelines.

Features

  • Idiomatic Client with functional options, request/response hooks, and context support
  • Declarative query builder with full category taxonomy validation
  • Built-in 3-second rate limiter (required by arXiv ToU)
  • Pagination helpers (SearchAll, StreamResults, Iterate) capped at 30,000 results
  • Export helpers (BibTeX, JSON, CSV) for downstream workflows
  • Rate-limit-friendly PDF downloader with progress callbacks

Installation

go get github.com/mtreilly/goarxiv

Quick Start

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/mtreilly/goarxiv"
)

func main() {
    client, err := goarxiv.New()
    if err != nil {
        log.Fatal(err)
    }

    ctx := context.Background()
    results, err := client.Search(ctx, "all:quantum computing", nil)
    if err != nil {
        log.Fatal(err)
    }

    for _, article := range results.Articles {
        fmt.Printf("%s - %s\n", article.ID, article.Title)
    }
}

Pagination

// Iterate through results one at a time
iter := client.Iterate("cat:cs.LG", 50, nil)
for iter.Next(ctx) {
    article := iter.Article()
    fmt.Printf("%s - %s\n", article.ID, article.Title)
}
if err := iter.Err(); err != nil {
    log.Fatal(err)
}

Export Formats

// Export to BibTeX
bibtex := article.ToBibTeX()

// Export to JSON
jsonData, _ := article.ToJSON()

// Export to CSV row
csvRow := article.ToCSV()

Download PDFs

err := client.DownloadPDF(ctx, article, &goarxiv.DownloadOptions{
    OutputDir: "./papers",
    OnProgress: func(downloaded, total int64) {
        fmt.Printf("%.1f%%\n", float64(downloaded)/float64(total)*100)
    },
})

Rate Limiting

The arXiv API requires a minimum 3-second delay between requests. This client enforces rate limiting automatically. For local testing against mock servers, you can use WithDebugMode(), but never use this against the real API.

Attribution

This project uses the arXiv API. Thank you to arXiv for use of its open access interoperability.

arXiv API Terms of Use: https://arxiv.org/help/api/tou

Development

# Run unit tests
go test ./...

# Run integration tests (hits real arXiv API)
ARXIV_INTEGRATION_TESTS=1 go test ./...

License

MIT License - see LICENSE for details.

Documentation

Overview

Package goarxiv provides an idiomatic Go SDK for the arXiv API.

Terms of Use and Attribution

All usage of this package must comply with the arXiv API Terms of Use: https://arxiv.org/help/api/tou. When redistributing work derived from arXiv metadata, include the attribution "Thank you to arXiv for use of its open access interoperability." Article PDFs/source files retain their original licenses; check individual articles for redistribution policies.

Rate Limiting

arXiv enforces a mandatory 3-second delay between requests. `Client` enforces this automatically. For local testing against mock servers you may opt into `WithDebugMode`, but never use that mode against the real API because it violates the ToU.

Caching Guidance

arXiv refreshes metadata daily. Cache search responses (for at least 24 hours) to reduce load. The SDK does not include a cache layer so you can integrate with your preferred caching solution.

Quick Start

client, err := goarxiv.New()
if err != nil {
    log.Fatal(err)
}
results, err := client.Search(ctx, "all:quantum", nil)
if err != nil {
    log.Fatal(err)
}
for _, article := range results.Articles {
    fmt.Printf("%s — %s\n", article.ID, article.Title)
}

Pagination

Use `Client.SearchAll`, `Client.StreamResults`, or `Client.Iterate` to traverse multiple pages. The client enforces arXiv's global limit of 30,000 total results per query.

Downloads & Exports

`DownloadPDF` respects rate limiting and provides optional progress callbacks. `Article` helpers convert metadata to BibTeX, JSON, or CSV for downstream workflows.

Observability

Register hooks via `WithRequestHook`/`WithResponseHook` to integrate logging, metrics, or tracing without forking the SDK.

Index

Constants

View Source
const (
	MaxResultsPerRequest = 2000
	MaxResultsTotal      = 30000
)
View Source
const MinRateLimit = 3 * time.Second

MinRateLimit defines the minimum delay enforced by arXiv ToU.

Variables

View Source
var (
	// Categories maps a category code to its metadata.
	Categories map[string]CategoryInfo
	// FieldCategories lists category codes grouped by high-level field.
	FieldCategories map[string][]string
)
View Source
var (
	// ErrInvalidID indicates the provided identifier failed validation.
	ErrInvalidID = errors.New("arxiv: invalid ID format")
	// ErrRateLimit indicates the API's rate limit has been exceeded.
	ErrRateLimit = errors.New("arxiv: rate limit exceeded")
	// ErrMaxResults indicates a request exceeded the maximum total items allowed.
	ErrMaxResults = errors.New("arxiv: max_results cannot exceed 30000")
	// ErrNetworkTimeout captures transport timeouts when reaching arXiv.
	ErrNetworkTimeout = errors.New("arxiv: network timeout")
	// ErrNotImplemented marks APIs that are still under construction.
	ErrNotImplemented = errors.New("goarxiv: not implemented")
)

Functions

func ArticlesToCSV

func ArticlesToCSV(articles []*Article) ([]byte, error)

ArticlesToCSV converts the provided article slice into CSV bytes.

func GetCategoryField

func GetCategoryField(code string) string

GetCategoryField returns the high-level field for a code.

func GetFieldCategories

func GetFieldCategories(field string) []string

GetFieldCategories returns a copy of codes for the specified field (case-insensitive).

func IsValidArxivID

func IsValidArxivID(id string) bool

IsValidArxivID validates both legacy and modern arXiv identifiers.

func IsValidCategory

func IsValidCategory(code string) bool

IsValidCategory verifies that the code exists in the taxonomy.

func NormalizeArxivID

func NormalizeArxivID(id string) (string, error)

NormalizeArxivID trims whitespace and validates the identifier format.

func ParseArxivID

func ParseArxivID(id string) (string, int, error)

ParseArxivID splits the identifier into base ID and version number.

func RequireCategory

func RequireCategory(code string) error

RequireCategory ensures the code exists or returns an error.

func SearchCategories

func SearchCategories(keyword string) []string

SearchCategories performs a case-insensitive search across code, name, and description.

Types

type Article

type Article struct {
	ID              string
	Title           string
	Summary         string
	Authors         []Author
	Published       time.Time
	Updated         time.Time
	PrimaryCategory string
	Categories      []string
	Links           []Link
	Comment         *string
	JournalRef      *string
	DOI             *string
}

Article represents a single entry returned by the arXiv API.

func (Article) AbstractURL

func (a Article) AbstractURL() string

AbstractURL returns the abstract URL for the article.

func (Article) BaseID

func (a Article) BaseID() string

BaseID removes any version suffix from the article identifier.

func (Article) PDFURL

func (a Article) PDFURL() string

PDFURL returns the canonical PDF URL for the article.

func (*Article) ToBibTeX

func (a *Article) ToBibTeX() string

ToBibTeX renders the article metadata as a simple BibTeX entry.

func (*Article) ToJSON

func (a *Article) ToJSON() ([]byte, error)

ToJSON marshals the article into JSON for export.

func (Article) Version

func (a Article) Version() int

Version extracts the numeric version; returns 1 when unspecified.

type ArticleResult

type ArticleResult struct {
	Article *Article
	Err     error
}

ArticleResult represents a streamed pagination value.

type Author

type Author struct {
	Name        string
	Affiliation *string
}

Author captures attribution details for a paper author.

type CategoryInfo

type CategoryInfo struct {
	Code        string `json:"code"`
	Name        string `json:"name"`
	Description string `json:"description"`
	Field       string `json:"field"`
}

CategoryInfo describes an arXiv subject classification entry.

func GetCategoryInfo

func GetCategoryInfo(code string) (*CategoryInfo, error)

GetCategoryInfo returns metadata for the provided code.

func ListCategories

func ListCategories() []CategoryInfo

ListCategories returns a copy of all category infos sorted by code.

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client exposes high-level methods for interacting with the arXiv API.

func New

func New(opts ...Option) (*Client, error)

New constructs a Client using the functional options pattern.

func (*Client) BaseURL

func (c *Client) BaseURL() string

BaseURL returns the effective API endpoint for the client.

func (*Client) DownloadPDF

func (c *Client) DownloadPDF(ctx context.Context, article *Article, opts *DownloadOptions) error

DownloadPDF downloads the PDF for a single article.

func (*Client) DownloadPDFs

func (c *Client) DownloadPDFs(ctx context.Context, articles []*Article, opts *DownloadOptions) error

DownloadPDFs downloads PDFs for multiple articles sequentially.

func (*Client) GetByID

func (c *Client) GetByID(ctx context.Context, id string) (*Article, error)

GetByID fetches a single article by arXiv identifier.

func (*Client) GetByIDs

func (c *Client) GetByIDs(ctx context.Context, ids []string) ([]*Article, error)

GetByIDs fetches multiple articles by their identifiers.

func (*Client) IsDebugMode

func (c *Client) IsDebugMode() bool

IsDebugMode reports whether the client is bypassing rate limiting safeguards.

func (*Client) Iterate

func (c *Client) Iterate(query string, pageSize int, opts *SearchOptions) *Iterator

Iterate constructs an Iterator that pages through results up to 30k articles.

func (*Client) Search

func (c *Client) Search(ctx context.Context, query string, opts *SearchOptions) (*SearchResults, error)

Search executes a query with the provided options and returns a single page.

func (*Client) SearchAll

func (c *Client) SearchAll(ctx context.Context, query string, maxTotal int, opts *SearchOptions) ([]*Article, error)

SearchAll returns up to maxTotal results for a query by automatically paging through responses.

func (*Client) StreamResults

func (c *Client) StreamResults(ctx context.Context, query string, pageSize int, opts *SearchOptions) <-chan ArticleResult

type DownloadOptions

type DownloadOptions struct {
	OutputDir      string
	AllowOverwrite bool
	Progress       func(downloaded, total int64)
}

DownloadOptions configures PDF download behavior.

type Error

type Error struct {
	Code       string
	Message    string
	StatusCode int
	ID         string
	Err        error
}

Error wraps arXiv-specific metadata around a root cause while remaining compatible with errors.Is/As.

func (*Error) Error

func (e *Error) Error() string

Error implements the error interface.

func (*Error) Is

func (e *Error) Is(target error) bool

Is allows comparisons against other Error instances or sentinel errors.

func (*Error) Unwrap

func (e *Error) Unwrap() error

Unwrap exposes the underlying error for errors.Is/As.

type FixedWindowLimiter

type FixedWindowLimiter struct {
	// contains filtered or unexported fields
}

FixedWindowLimiter enforces a minimum delay between requests.

func NewRateLimiter

func NewRateLimiter(interval time.Duration, debug bool) *FixedWindowLimiter

NewRateLimiter returns a limiter configured with the desired interval and debug mode.

func (*FixedWindowLimiter) IsDebugMode

func (l *FixedWindowLimiter) IsDebugMode() bool

IsDebugMode reports whether the limiter bypasses waiting.

func (*FixedWindowLimiter) Wait

func (l *FixedWindowLimiter) Wait(ctx context.Context) error

Wait blocks until the next request is allowed or the context is cancelled.

type Iterator

type Iterator struct {
	// contains filtered or unexported fields
}

Iterator streams articles across paginated API responses.

func (*Iterator) Article

func (it *Iterator) Article() *Article

Article returns the current article if Next returned true.

func (*Iterator) Err

func (it *Iterator) Err() error

Err exposes the iterator's terminal error.

func (*Iterator) Next

func (it *Iterator) Next(ctx context.Context) bool

Next fetches the next article into the iterator buffer.

func (*Iterator) TotalResults

func (it *Iterator) TotalResults() int

TotalResults returns the total number of matches reported by arXiv.

type Link struct {
	Href        string
	Rel         string
	Title       *string
	ContentType *string
}

Link describes related resources that accompany an article.

type Option

type Option func(*config) error

Option mutates Client configuration.

func WithBaseURL

func WithBaseURL(url string) Option

WithBaseURL overrides the default arXiv endpoint.

func WithDebugMode

func WithDebugMode() Option

WithDebugMode disables rate limiting safeguards. WARNING: violates arXiv ToU; testing only.

func WithHTTPClient

func WithHTTPClient(c *http.Client) Option

WithHTTPClient injects a custom HTTP client.

func WithRateLimit

func WithRateLimit(delay time.Duration) Option

WithRateLimit sets the minimum delay between requests (floor at arXiv's 3s limit).

func WithRateLimiter

func WithRateLimiter(limiter RateLimiter) Option

WithRateLimiter supplies a custom rate limiter implementation (advanced usage).

func WithRequestHook

func WithRequestHook(hook RequestHook) Option

WithRequestHook registers a hook invoked before each HTTP request.

func WithResponseHook

func WithResponseHook(hook ResponseHook) Option

WithResponseHook registers a hook invoked after each HTTP response.

func WithRetries

func WithRetries(maxRetries int) Option

WithRetries sets the number of retry attempts for transient errors.

func WithTimeout

func WithTimeout(timeout time.Duration) Option

WithTimeout configures the HTTP client timeout used for requests.

func WithUserAgent

func WithUserAgent(extra string) Option

WithUserAgent appends identifying text to the default SDK User-Agent string.

type Query

type Query struct {
	// contains filtered or unexported fields
}

Query represents a fluent builder for arXiv query parameters.

func NewQuery

func NewQuery() *Query

NewQuery creates a Query with sane defaults.

func (*Query) Encode

func (q *Query) Encode() url.Values

Encode builds the query parameters expected by the API.

func (*Query) IDs

func (q *Query) IDs(ids ...string) *Query

IDs filters the query to explicit identifiers.

func (*Query) MaxResults

func (q *Query) MaxResults(limit int) *Query

MaxResults caps the number of results returned.

func (*Query) Search

func (q *Query) Search(clause string) *Query

Search adds a raw search clause.

func (*Query) Sort

func (q *Query) Sort(by SortBy, order SortOrder) *Query

Sort configures sorting behavior.

func (*Query) Start

func (q *Query) Start(start int) *Query

Start sets the pagination offset.

func (*Query) Where

func (q *Query) Where(builder *QueryBuilder) *Query

Where attaches the result of a QueryBuilder to this query.

type QueryBuilder

type QueryBuilder struct {
	// contains filtered or unexported fields
}

QueryBuilder provides a fluent API for constructing search_query strings with field prefixes.

func NewQueryBuilder

func NewQueryBuilder() *QueryBuilder

NewQueryBuilder constructs a builder with no clauses.

func (*QueryBuilder) Abstract

func (b *QueryBuilder) Abstract(text string) *QueryBuilder

Abstract searches within the abstract field (abs: prefix).

func (*QueryBuilder) AllFields

func (b *QueryBuilder) AllFields(text string) *QueryBuilder

AllFields performs a generic search across title, abstract, and comments (all: prefix).

func (*QueryBuilder) And

func (b *QueryBuilder) And() *QueryBuilder

And inserts a logical AND between clauses.

func (*QueryBuilder) AndNot

func (b *QueryBuilder) AndNot() *QueryBuilder

AndNot inserts a logical AND NOT between clauses.

func (*QueryBuilder) Author

func (b *QueryBuilder) Author(name string) *QueryBuilder

Author searches within the author field (au: prefix).

func (*QueryBuilder) Build

func (b *QueryBuilder) Build() string

Build returns the final query string (already URL-safe).

func (*QueryBuilder) Category

func (b *QueryBuilder) Category(code string) *QueryBuilder

Category filters results by arXiv subject category (cat: prefix).

func (*QueryBuilder) Comment

func (b *QueryBuilder) Comment(text string) *QueryBuilder

Comment searches within the comments field (co: prefix).

func (*QueryBuilder) HasClauses

func (b *QueryBuilder) HasClauses() bool

HasClauses reports whether the builder contains any query clauses.

func (*QueryBuilder) JournalRef

func (b *QueryBuilder) JournalRef(ref string) *QueryBuilder

JournalRef searches within the journal reference field (jr: prefix).

func (*QueryBuilder) Or

func (b *QueryBuilder) Or() *QueryBuilder

Or inserts a logical OR between clauses.

func (*QueryBuilder) Raw

func (b *QueryBuilder) Raw(clause string) *QueryBuilder

Raw appends a literal clause, useful for advanced filters.

func (*QueryBuilder) ReportNumber

func (b *QueryBuilder) ReportNumber(rn string) *QueryBuilder

ReportNumber searches within the report number field (rn: prefix).

func (*QueryBuilder) SubmittedAfter

func (b *QueryBuilder) SubmittedAfter(t time.Time) *QueryBuilder

SubmittedAfter restricts submissions to those after the provided time.

func (*QueryBuilder) SubmittedBefore

func (b *QueryBuilder) SubmittedBefore(t time.Time) *QueryBuilder

SubmittedBefore restricts submissions to those before the provided time.

func (*QueryBuilder) SubmittedBetween

func (b *QueryBuilder) SubmittedBetween(start, end time.Time) *QueryBuilder

SubmittedBetween restricts submissions to a specific time window.

func (*QueryBuilder) Title

func (b *QueryBuilder) Title(text string) *QueryBuilder

Title searches within the title field (ti: prefix).

func (*QueryBuilder) TitleExact

func (b *QueryBuilder) TitleExact(text string) *QueryBuilder

TitleExact searches for an exact phrase within the title.

func (*QueryBuilder) Validate

func (b *QueryBuilder) Validate() error

Validate checks for empty queries, invalid categories, and malformed date ranges.

type RateLimiter

type RateLimiter interface {
	Wait(ctx context.Context) error
	IsDebugMode() bool
}

RateLimiter gates outbound requests to honor API constraints.

type RequestHook

type RequestHook func(ctx context.Context, req *http.Request) error

RequestHook runs before an HTTP request is sent.

type ResponseHook

type ResponseHook func(ctx context.Context, resp *http.Response, duration time.Duration) error

ResponseHook runs after an HTTP response is received.

type SearchOptions

type SearchOptions struct {
	Start      int
	MaxResults int
	SortBy     SortBy
	SortOrder  SortOrder
}

SearchOptions controls pagination and sorting for arXiv queries.

type SearchResults

type SearchResults struct {
	Articles     []Article
	TotalResults int
	StartIndex   int
	ItemsPerPage int
	Query        string
}

SearchResults wraps paginated metadata returned from the API.

func ParseSearchResults

func ParseSearchResults(r io.Reader) (SearchResults, error)

ParseSearchResults converts Atom XML into typed SearchResults structures.

type SortBy

type SortBy string

SortBy enumerates supported sort keys.

const (
	SortByRelevance       SortBy = "relevance"
	SortByLastUpdatedDate SortBy = "lastUpdatedDate"
	SortBySubmittedDate   SortBy = "submittedDate"
)

type SortOrder

type SortOrder string

SortOrder enumerates valid ordering options.

const (
	SortOrderAscending  SortOrder = "ascending"
	SortOrderDescending SortOrder = "descending"
)

Directories

Path Synopsis
examples
download command
paginate command
search command
internal
atom
Package atom contains low-level XML bindings used by the higher level parser.
Package atom contains low-level XML bindings used by the higher level parser.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL