Sitemaps: remove post limit on sitemap#5503
Conversation
modules/sitemaps/sitemap-buffer.php
Outdated
There was a problem hiding this comment.
I would suggest checking if mb_strlen exists before to use it, as some hosts have it disabled.
There was a problem hiding this comment.
On further inspection mb_strlen was the wrong thing to use here anyway- it specifically counts characters instead of bytes, which is what we need. I changed this to use just strlen instead, although reading some documentation I'm concerned that this will not do the right thing either as some hosts alias strlen to mb_strlen.
modules/sitemaps/sitemap-buffer.php
Outdated
There was a problem hiding this comment.
It might be good to check your use of spaces and yoda conditions across your changes, to match WordPress' coding standards>
You can use Code Sniffer rules to make that process easier.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
Did you deprecate this filter? If so, you can add it to the list here.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
You can probably remove this comment now.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
This would need a textdomain so it can be translated.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
You'll probably need to take the jetpack_sitemap_location filter into account, since some people may use the filter to define a custom sitemap location.
There was a problem hiding this comment.
The jetpack_sitemap_location filter is now respected. The way it's used did change a bit though: long story short, instead of setting the location as /path/to/the/sitemap.xml we only set the location to /path/to/the/.
There was a problem hiding this comment.
Can you explain the long story? :) Why is it important that jetpack_sitemap_location changed behavior? Is it possible to implement this without changing the behavior?
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
It'd be nice to keep that filter around somehow, it's useful for some site owners.
There was a problem hiding this comment.
The jetpack_sitemap_skip_post filter is now respected. As far as I can tell there is no change in behavior compared to the old version.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
Some site owners already use that filter, so I think it would be nice to keep it.
There was a problem hiding this comment.
As of 7afe5e7 the jetpack_print_sitemap filter is respected.
I'd be interested in seeing what this is used for. If this filter is kept there probably need to be analogous filters for image sitemaps and sitemap indices.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
Another filter to add to the list of deprecated hooks.
|
Hi Nathan, it's working now 🎉 I tried accessing |
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
This call uses 3 parameters but query_posts_after_id only accepts 2.
There was a problem hiding this comment.
I fixed this by making the logger object a class member, rather than passing one around.
modules/sitemaps/sitemap-stylist.php
Outdated
There was a problem hiding this comment.
The selectors #description, #content, .odd, #footer, img.thumbnail are never used.
There was a problem hiding this comment.
#description, .odd, #content, #footer, img.thumbnail
In the browser sitemap.xml is transformed by an XML stylesheet, these are used in the transformed HTML. You can see the stylesheet at sitemap.xsl.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
What's the advantage of creating this inline function, against a foreach that goes through array elements defined as
(
'jp_sitemap_master',
'Sitemap Master',
'jetpack-sitemap-master'
)
and calls register_post_type directly?
There was a problem hiding this comment.
I can't think of one. :) The latest commit uses an array with foreach.
There was a problem hiding this comment.
Scratch that- sitemaps are now stored in a custom database table, so this code is not relevant anymore.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
How can we handle default permalinks and PATHINFO permalinks?
There was a problem hiding this comment.
Does the URL of robots.txt depend on the permalink settings? (Edit: I just checked and omg it does. I don't know what's real anymore :))
This is doable, though I have to confess I'm not sure what the benefit is given that sitemaps are meant for robots. But then so is robots.txt.
Edit again- this module hijacks the usual URL routing by looking at $_SERVER['REQUEST_URI'] directly. Does this cause problems with any of the permalink schemes?
There was a problem hiding this comment.
The latest commit handles plain and PATHINFO permalinks. Plain permalinks are served at /?jetpack-sitemap=sitemap.xml.
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
Is it possible to register only the post type that is needed at a certain time? So for example if I'm seeing the image sitemap, I can't skip registering this?
Why did you choose a custom post type to model, well, other post types? What is the advantage of this vs rows in a custom database table?
There was a problem hiding this comment.
What's really needed is a way to store (1) the generated sitemap xml, with (2) a name, (3) a unique numeric ID, and (4) a timestamp, which are needed during the generation phase. Using a custom post type seemed like a natural fit for this. (Also this is how msm-sitemap does it, if I understand correctly.)
But also I did not realize that using a custom database table was an option. That might be a better solution, because sitemaps are sort of "out of band" compared to posts. It should be easy to switch, because the actual database queries are abstracted behind the query_ methods. I will have to look into this.
As for registering the post types, it's possible that this is not necessary at all. What exactly does register_post_type do and when is it required? Can we store custom post types in the database without having to register them? Because the sitemap post types are not needed in the admin panel.
There was a problem hiding this comment.
The latest commit uses a custom database table to store the sitemap data.
|
So |
|
@eliorivero : To figure out why
|
|
Hi Natha, sorry for the delay. I've tested this now and it works perfectly with pretty, default and pathinfo permalinks. Great work! 🎉 One concern is the name of the sitemaps with the numbers in them. Maybe the main one could be named |
modules/sitemaps/sitemaps.php
Outdated
There was a problem hiding this comment.
Maybe this can only be instantiated when WP_DEBUG is defined and set to true? It's better not to load this every time during production.
modules/sitemaps/sitemap-builder.php
Outdated
There was a problem hiding this comment.
This file and the class instantiation in line 87 should be done only when WP_DEBUG is defined and set to true.
modules/sitemaps/sitemap-builder.php
Outdated
There was a problem hiding this comment.
This inline doc for the filter must be moved one line down, right before apply_filters.
There was a problem hiding this comment.
The same mistake was in two more places- all are fixed now.
There was a problem hiding this comment.
Conditional can be simplified to return 1 === $num_rows_deleted;
There was a problem hiding this comment.
Missing @since 4.5.0 instead of @module sitemaps
There was a problem hiding this comment.
Every method should now have a @since annotation.
modules/sitemaps/sitemap-stylist.php
Outdated
There was a problem hiding this comment.
Jetpack_Sitemap_Stylist::sanitize_with_links strips out the em tag. Just remove the em.
modules/sitemaps/sitemap-finder.php
Outdated
There was a problem hiding this comment.
It turns out there were lots of missing @returns- I checked again and every function that returns something should now have a @return annotation.
modules/sitemaps/sitemap-finder.php
Outdated
There was a problem hiding this comment.
Will this benefit from using Jetpack_Options::get_option_and_ensure_autoload( 'jetpack_sitemap_location', '' )?
How about others get_option calls in this PR?
There was a problem hiding this comment.
Reading through the source of get_option, it looks like a bunch of stuff happens when get_option is called on a nonexistent name- stuff that is unnecessary when we provide a default value. So I imagine there will be some benefit to storing the default, especially for the calls that happen many times (which is all of them).
|
@mdawaffe Okay- I think I'm done. There are always improvements to be made but I've hit the point of diminishing returns.
|
That's all that I mean. How does the cade cache/store/otherwise preserve the stuff it generates? What stuff is cached? For how long? How is the cache invalidated, or is it? What stuff is not cached and always generated on the fly? |
|
This is also copied to the P2 post: This module deals with data of a few different kinds, that are cached in different ways. There's
Sitemap pages are stored in the posts table using several custom post types. This storage doesn't really act like a cache, because we don't worry about whether the data is stale. These files are stored indefinitely (until replaced by an updated version by the cron job). The news sitemap page (singular) is cached using a transient. It is invalidated whenever a post is published but not when a comment is posted (see callback_action_flush_news_sitemap_cache), and has a lifetime of JP_NEWS_SITEMAP_INTERVAL (set to 12 hours) (see news_sitemap_xml). The sitemap state, stored in an option and used to keep track of what needs to be generated next, isn't really cached either. But it is locked to prevent a race condition, and this lock (stored as a transient) has a lifetime of JP_SITEMAP_LOCK_INTERVAL (set to 15 minutes). The sitemap XSL files are generated fresh on each request. These would be constant if not for some translated UI strings. |
|
Looks good and tests well. @jeherve, can you make sure that we include this in the testing instructions to make sure people test it out thoroughly? |
* Readme: remove old release and add skeleton for 4.8. * Changelog: add #6572 * Changelog: add #6567 * Changelog: add #6542 * Changelog: add #6527 * Changelog: add #6508 * Changelog: add #6478 * Changelog: add #6477 * Changelog: add #6249 * Update stable version and remove old version from readme. * Changelog: add 4.7.1 to changelog. * Readme: add new contributor. * Sync: update docblock @SInCE version. Related: #6053 * Changelog: add release post. * changelog: add #6053 * Changelog: add #6413 * Changelog: add #6482 * Changelog: add #6584 * Changelog add #6603 * Changelog: add #6606 * Changelog: add #6611 * Changelog: add #6635 * Changelog: add #6639 * Changelog: add #6684 * Changelog: add #6710 * Changelog: add #6711 * Changelog: add #5461 * Testing list: update Settings UI feedback prompt. Props @MichaelArestad * Changelog: add #6789 * Changelog: add #6778 * Changelog: add #6777 * Changelog: add #6775 * Changelog: add #6755 * Changelog: add #6731 * Changelog: add #6721 * Changelog: add #6705 * Changelog: add #6702 * Changelog: add #6671 * Changelog: add #6637 * Changelog: add #6582 * Changelog: add #6566 * Changelog: add #6555 * Changelog: add #6529 * Changelog: add #6344 * Changelog: add #5763 * Changelog: add #5503 * Changelog: update #6637 changelog. @see 40e115c#commitcomment-21523982 * Changelog: add #6699 * Changelog: add #6632 * Changelog: add #6769 * Changelog: add #6707 * Changelog: add #6590
Addresses #3314: Remove 1000 post limit on sitemaps.
Addresses #3750: Support video sitemaps (in progress)
Changes proposed in this Pull Request:
Replace sitemap generating code.
Note that this PR is an update of #5394, which became broken because I am awesome at git
Documentation is on the Team Luna P2: p7rd6c-B0-p2
Also some design ideas: p7rd6c-fS-p2
Testing instructions:
example.com/news-sitemap.xml.example.com/sitemap.xml. (It may take a refresh or two the first time.)example.com/?jetpack-sitemap=sitemap.xmlandexample.com/?jetpack-sitemap=news-sitemap.xmlThese changes allow for arbitrarily large sitemap trees per the spec. To test this functionality without having to create an arbitrarily large site, adjust the constant
SITEMAP_MAX_ITEMSinmodules/sitemaps/sitemap-buffer.php, (5 is a good value.) This constant determines the maximum number of URLs on a single sitemap, so that if it's set to 5 and we have 6 posts/pages then the tree structure kicks in.The constant
SITEMAP_INTERVALinmodules/sitemaps/sitemaps.phpis the minimum number of seconds between rebuilds; it's currently set to 60 for testing.Past related issues/PRs:
example.com/index.php/sitemap.xml.ent2ncrto avoid this bug.Proposed changelog entry for your changes:
New filters:
jetpack_sitemap_image_skip_post: For selectively excluding images from the sitemap.jetpack_sitemap_image_sitemap_item: For manipulating the XML of image entries in the sitemap.jetpack_sitemap_video_skip_post: For selectively excluding videos from the sitemap.jetpack_sitemap_video_sitemap_item: For manipulating the XML of video entries in the sitemap.jetpack_sitemap_news_ns: Add XML namespaces for the news sitemap.jetpack_sitemap_image_ns: Add XML namespaces for image sitemaps.Filters with changed behavior:
jetpack_sitemap_location: The old version had only two sitemap URLs to keep track of, so it was feasible to make both of them configurable. This version keeps track of two URLs and 6 indexed families of unknown size. Do we make them all configurable?gulp js:hintbefore to commit your changes. It will allow you to detect errors in Javascript files.