A Go package to parse XML Sitemaps compliant with the Sitemaps.org protocol.
For information on reporting security vulnerabilities, see SECURITY.md.
- Recursive parsing (sitemap index → sitemaps → URLs)
- Concurrent (multi-threaded) fetching and parsing
- Configurable follow rules to filter which sitemaps to parse
- Configurable URL rules to filter which URLs to include
- Configurable HTTP response size limit
- Tolerant mode (default): resolves relative URLs in
<loc>elements; rejects URLs exceeding 2,048 characters after resolution - Strict mode: validates URLs per the sitemaps.org specification
- Google Image Sitemap extension (
<image:image>) - Google News Sitemap extension (
<news:news>) - Google Video Sitemap extension (
<video:video>) - XHTML hreflang extension (
<xhtml:link>) - Typed errors:
*ConfigError,*NetworkError,*ParseError,*ValidationError— inspectable viaerrors.As - Thread-safe
robots.txt- XML
.xml - RSS 2.0
- Atom 1.0
- Plain text
.txt - Gzip compressed files (e.g.,
.xml.gz,.txt.gz)
go get github.com/aafeher/go-sitemap-parserimport "github.com/aafeher/go-sitemap-parser"To create a new instance with default settings, you can simply call the New() function.
s := sitemap.New()- userAgent:
"go-sitemap-parser (+https://github.com/aafeher/go-sitemap-parser/blob/main/README.md)" - fetchTimeout:
3seconds - maxResponseSize:
52428800(50 MB) - maxDepth:
10 - maxConcurrency:
16 - multiThread:
true - strict:
false - httpClient:
nil(a default*http.Clientis created per call with the configuredfetchTimeout)
To set the user agent, use the SetUserAgent() function.
s := sitemap.New()
s = s.SetUserAgent("YourUserAgent")... or ...
s := sitemap.New().SetUserAgent("YourUserAgent")To set the fetch timeout, use the SetFetchTimeout() function. It should be specified in seconds as a uint16 value (1–65535 seconds). A value of 0 is rejected and a *ConfigError is recorded. Note: when a custom HTTP client is set via SetHTTPClient(), this value has no effect — the client's own Timeout field controls the request deadline.
s := sitemap.New()
s = s.SetFetchTimeout(10)... or ...
s := sitemap.New().SetFetchTimeout(10)To set the maximum allowed HTTP response size, use the SetMaxResponseSize() function. It should be specified in bytes as an int64 value. The default is 50 MB, matching the sitemaps.org protocol limit. Responses exceeding this limit will result in an error.
s := sitemap.New()
s = s.SetMaxResponseSize(10 * 1024 * 1024) // 10 MB... or ...
s := sitemap.New().SetMaxResponseSize(10 * 1024 * 1024) // 10 MBTo set the maximum recursion depth for following sitemap indexes, use the SetMaxDepth() function. A sitemap index may reference other sitemap indexes; this limits how many levels deep the parser will follow. The default is 10.
s := sitemap.New()
s = s.SetMaxDepth(5)... or ...
s := sitemap.New().SetMaxDepth(5)See examples/maxdepth for a runnable example.
When multi-threaded parsing is enabled, the parser spawns one goroutine per sitemap location and per robots.txt sitemap directive. For very large sitemap indexes this can lead to a large number of concurrent goroutines and HTTP connections. To bound the maximum number of in-flight fetches across the whole Parse() / ParseContext() call, use the SetMaxConcurrency() function.
The value is an int:
0: unlimited concurrency.- a positive value: at most that many concurrent fetches will run at any time. The default is
16.
Negative values are rejected and an error is recorded in GetErrors().
s := sitemap.New()
s = s.SetMaxConcurrency(8)... or ...
s := sitemap.New().SetMaxConcurrency(8)Cancelling the supplied context.Context while goroutines are queued for a slot causes them to return immediately with the context error, just like an in-flight fetch.
By default, the package uses multi-threading to fetch and parse sitemaps concurrently.
To set the multi-thread flag on/off, use the SetMultiThread() function.
s := sitemap.New()
s = s.SetMultiThread(false)... or ...
s := sitemap.New().SetMultiThread(false)To set the follow rules, use the SetFollow() function. It should be specified a []string value.
It is a list of regular expressions. When parsing a sitemap index, only sitemaps with a loc that matches one of these expressions will be followed and parsed.
If no follow rules are provided, all sitemaps in the index are followed.
Patterns longer than 1,000 characters are rejected and reported via GetErrors().
s := sitemap.New()
s.SetFollow([]string{
`\.xml$`,
`\.xml\.gz$`,
})... or ...
s := sitemap.New().SetFollow([]string{
`\.xml$`,
`\.xml\.gz$`,
})To set the URL rules, use the SetRules() function. It should be specified a []string value.
It is a list of regular expressions. Only URLs that match one of these expressions will be included in the final result.
If no rules are provided, all URLs found are included.
Patterns longer than 1,000 characters are rejected and reported via GetErrors().
s := sitemap.New()
s.SetRules([]string{
`product/`,
`category/`,
})... or ...
s := sitemap.New().SetRules([]string{
`product/`,
`category/`,
})To use a custom HTTP client for all requests, use the SetHTTPClient() function.
This is useful when you need a custom transport, proxy, TLS configuration, or
authentication via a custom http.RoundTripper.
When a custom client is provided, SetFetchTimeout has no effect — the client's
own Timeout field controls the request deadline. Pass nil to reset to the
default client behaviour.
s := sitemap.New()
s = s.SetHTTPClient(&http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
TLSClientConfig: &tls.Config{MinVersion: tls.VersionTLS12},
},
})... or ...
s := sitemap.New().SetHTTPClient(&http.Client{Timeout: 30 * time.Second})See examples/httpclient for a runnable example.
By default, the parser operates in tolerant mode: relative URLs found in <loc> elements are automatically resolved against the parent sitemap URL. This handles real-world sitemaps that may not fully comply with the specification.
To enable strict mode, use the SetStrict() function. In strict mode, all URL entries are validated per the sitemaps.org protocol:
<loc>must be an absolute HTTP or HTTPS URL<loc>must use the same host and protocol as the sitemap file<loc>must not exceed 2,048 characters<priority>must be between0.0and1.0inclusive (if present)
In tolerant mode (the default):
- Relative
<loc>URLs are resolved against the parent sitemap URL <loc>URLs exceeding 2,048 characters after resolution are rejected<priority>values outside[0.0, 1.0]are accepted as-is
Entries that fail validation are skipped and reported via GetErrors().
s := sitemap.New()
s = s.SetStrict(true)... or ...
s := sitemap.New().SetStrict(true)In both cases, the functions return a pointer to the main object of the package, allowing you to chain these setting methods in a fluent interface style:
s := sitemap.New().SetUserAgent("YourUserAgent").SetFetchTimeout(10)Each configuration setting can be read back via a corresponding Get* method. All getters are thread-safe.
| Getter | Return type | Description |
|---|---|---|
GetUserAgent() |
string |
Current user agent string |
GetFetchTimeout() |
uint16 |
Fetch timeout in seconds |
GetMultiThread() |
bool |
Whether multi-threaded fetching is enabled |
GetMaxResponseSize() |
int64 |
Maximum HTTP response size in bytes |
GetMaxDepth() |
int |
Maximum sitemap index recursion depth |
GetMaxConcurrency() |
int |
Maximum concurrent fetches (0 = unlimited) |
GetFollow() |
[]string |
Copy of the follow regex pattern list |
GetRules() |
[]string |
Copy of the URL filter regex pattern list |
GetHTTPClient() |
*http.Client |
Custom HTTP client, or nil if using the default |
GetStrict() |
bool |
Whether strict validation mode is enabled |
GetFollow() and GetRules() return copies — mutating the returned slice does not affect the parser's internal state.
s := sitemap.New().SetMaxConcurrency(8).SetStrict(true)
fmt.Println(s.GetMaxConcurrency()) // 8
fmt.Println(s.GetStrict()) // trueAll public methods on *S are safe to call from multiple goroutines. Internal state (configuration, collected URLs, errors) is protected by a mutex.
However, two important constraints apply:
- Concurrent
Parse()/ParseContext()calls on the same instance are serialised. A second call blocks until the first completes. If you need to parse multiple sitemaps concurrently, create a separate*Sinstance per goroutine withNew(). - Configure before parsing. Calling a
Set*method whileParse()is running on the same instance is safe (the write is mutex-protected), but the outcome is non-deterministic — the new value may or may not be picked up mid-parse. Set all options before callingParse().
Deadlock note: when SetMaxConcurrency is used together with a robots.txt entry that lists multiple sitemaps, the semaphore slot is released immediately after each HTTP fetch and before the recursive parse step. This prevents goroutines from holding a slot while waiting for a child fetch slot, which would otherwise deadlock.
Once you have properly initialized and configured your instance, you can parse sitemaps using the Parse() function.
The Parse() function takes in two parameters:
url: the URL of the sitemap to be parsed,urlcan be a robots.txt or sitemapindex or sitemap (urlset)
urlContent: an optional string pointer for the content of the URL.
If you wish to provide the content yourself, pass the content as the second parameter. If not, simply pass nil and the function will fetch the content on its own.
The Parse() function performs concurrent parsing and fetching optimized by the use of Go's goroutines and sync package, ensuring efficient sitemap handling.
s, err := s.Parse("https://www.sitemaps.org/sitemap.xml", nil)In this example, sitemap is parsed from "https://www.sitemaps.org/sitemap.xml". The function fetches the content itself, as we passed nil as the urlContent.
For new code, prefer ParseContext() so that callers can propagate cancellation
and deadlines to every HTTP request issued by the parser (the initial fetch as
well as the recursive sitemap-index/urlset fetches). The legacy Parse() is a
backward-compatible wrapper around ParseContext() that uses
context.Background().
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
s, err := sitemap.New().ParseContext(ctx, "https://www.sitemaps.org/sitemap.xml", nil)Cancelling ctx aborts in-flight downloads and prevents new ones from starting.
Already-parsed URLs accumulated before cancellation remain available via
GetURLs(); the cancellation cause is also recorded in the error list and
returned by ParseContext.
See examples/context for a runnable example.
After parsing, you can retrieve the results using the following methods:
Returns all parsed URLs as a []URL slice.
urls := s.GetURLs()Each URL struct contains the following fields:
Loc(string) — the URL locationLastMod(*LastModTime) — last modification time (embedstime.Time), may benilChangeFreq(*URLChangeFreq) — change frequency hint, may benil. Use the exported constants for comparison:ChangeFreqAlways,ChangeFreqHourly,ChangeFreqDaily,ChangeFreqWeekly,ChangeFreqMonthly,ChangeFreqYearly,ChangeFreqNeverPriority(*float32) — crawl priority between 0.0 and 1.0, may benilImages([]Image) — images associated with this URL via the Google Image Sitemap extension, may benilNews(*News) — news metadata associated with this URL via the Google News Sitemap extension, may benilVideos([]Video) — videos associated with this URL via the Google Video Sitemap extension, may benilHreflangs([]AlternateLink) — alternate language/region versions of this URL via the XHTML extension, may benil
Each Image struct contains the following fields (all string):
Loc— image URL (required by the spec; images with an emptyLocare silently dropped in tolerant mode, or produce an error in strict mode)Title— image title (optional)Caption— image caption (optional)GeoLocation— geographic location of the image subject (optional)License— URL of the image licence (optional)
See examples/image for a runnable example.
Each News struct contains:
Publication(NewsPublication) — publication metadata:Name(string) — publication name (required in strict mode)Language(string) — BCP 47 language code, e.g."en"(required in strict mode)
PublicationDate(*LastModTime) — article publication date; embedstime.Time, may benilif absent (required in strict mode)Title(string) — article title (required in strict mode)
In strict mode, all four required fields (Title, Publication.Name, Publication.Language, PublicationDate) must be present; missing fields are each reported via GetErrors() and the News entry is still included with whatever data was parsed. In tolerant mode no validation is performed.
See examples/news for a runnable example.
Each AlternateLink struct contains:
Rel(string) — relationship, should be"alternate"Hreflang(string) — language/region code (e.g."en","de-ch")Href(string) — the URL of the alternate version
See examples/hreflang for a runnable example.
Each Video struct contains:
ThumbnailLoc(string) — thumbnail image URL (required; videos with an emptyThumbnailLocare silently dropped in tolerant mode, or produce an error in strict mode)Title(string) — video title (required in strict mode)Description(string) — video description (required in strict mode)ContentLoc(string) — direct URL to the video file (at least one ofContentLocorPlayerLocrequired in strict mode)PlayerLoc(string) — URL of an embedded video playerDuration(*int) — duration in seconds (1–28800); validated in strict mode if presentExpirationDate(*LastModTime) — date after which the video should not be shown; embedstime.Time, may benilRating(*float32) — rating between 0.0 and 5.0; validated in strict mode if presentViewCount(*int) — number of viewsPublicationDate(*LastModTime) — publication date; embedstime.Time, may benilFamilyFriendly(string) —"yes"or"no"Restriction(*VideoRestriction) — country restriction withRelationship("allow"/"deny") andValue(space-separated country codes)Platform(*VideoPlatform) — platform restriction withRelationshipandValue(e.g."web mobile tv")RequiresSubscription(string) —"yes"or"no"Uploader(*VideoUploader) — uploader name (Value) and optional profile URL (Info)Live(string) —"yes"or"no"Tags([]string) — content tags; maximum 32 validated in strict mode
See examples/video for a runnable example.
Returns the number of parsed URLs.
count := s.GetURLCount()Returns a slice of n randomly selected URLs without duplicates.
randomURLs := s.GetRandomURLs(5)Returns all errors encountered during parsing.
errs := s.GetErrors()Errors are typed and can be inspected with errors.As:
| Type | When returned | Useful fields |
|---|---|---|
*ConfigError |
A Set* method received an invalid value |
Field (setting name), Err (root cause) |
*NetworkError |
An HTTP fetch failed | URL (requested URL), Err (root cause) |
*ParseError |
XML or gzip parsing failed | URL (sitemap URL), Err (root cause) |
*ValidationError |
A URL or field value failed validation | URL (value being validated), Err (root cause) |
All types implement Unwrap(), enabling errors.Is traversal to the root cause.
for _, err := range s.GetErrors() {
var netErr *sitemap.NetworkError
if errors.As(err, &netErr) {
fmt.Printf("fetch failed for %s: %v\n", netErr.URL, netErr.Err)
continue
}
var valErr *sitemap.ValidationError
if errors.As(err, &valErr) {
fmt.Printf("validation error for %s: %v\n", valErr.URL, valErr.Err)
continue
}
}See examples/errors for a runnable example.
Returns the number of errors encountered during parsing.
errCount := s.GetErrorsCount()Examples can be found in /examples.