Skip to content

Incrementally rebuild when a data file is changed#8771

Merged
jekyllbot merged 18 commits intojekyll:masterfrom
ashmaroli:site_data-module
Sep 29, 2022
Merged

Incrementally rebuild when a data file is changed#8771
jekyllbot merged 18 commits intojekyll:masterfrom
ashmaroli:site_data-module

Conversation

@ashmaroli
Copy link
Copy Markdown
Member

  • This is an 🙋 enhancement.
  • I've added tests.
  • The test suite passes locally.

Summary

Currently, data files are read and loaded directly as Ruby primitives. This prevents us from registering data files as dependencies of a page or document.
Therefore, initialize data files into dedicated Jekyll objects.

Context

Resolves #7682

@ashmaroli ashmaroli requested review from a team, mattr- and parkr and removed request for a team August 19, 2021 16:50
@parkr
Copy link
Copy Markdown
Member

parkr commented Nov 22, 2021

This seems like a great idea! I haven't been able to dig into the code yet but it's a great plan. Thanks for tackling this!

@ashmaroli
Copy link
Copy Markdown
Member Author

@parkr @mattr- Will you be able to look into this over the weekend? I'd like to include this in 4.3.0..
Thanks.

@ashmaroli
Copy link
Copy Markdown
Member Author

Pinging @parkr for a fresh set of 👀

Copy link
Copy Markdown
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, @ashmaroli! Overall, it looks good. I left a few inline comments about code readability / structure and maybe one or two gotchas. Let me know what you think.

class Directory
# Delegate given (zero-arity) method(s) to the Hash object stored in instance
# variable `@meta`.
def self.delegate_to_meta(*symbols)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a lot of delegation throughout the Jekyll codebase. Is the "Forwardable" module insufficient for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the main motivation behind this was to avoid initialising interim objects. At the time, using Forwardable's def_delegators method resulted in allocating an Array object on every invocation of the delegated method. Anyways, this method has been designated as a private_class_method so that it can be changed or removed with minimal side-effects.

attr_accessor :context

def initialize
@meta = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meta is a very generic name which doesn't explain what this variable is used for. Could we use a more descriptive name that tells us what it holds?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Hash instance assigned to the variable contains the data of constituent data files. I had therefore considered naming the variable @data, then @metadata and finally just @meta.
Naming things is hard 😢
I will try to come up with better names..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's @files right? in DataReader, the code does data[key] = DataFile.new... which stores into this hash by method_missing. https://github.com/jekyll/jekyll/pull/8771/files#diff-ae0ebeb5a3a44f07ee61a0e5851d165c646ddab160824029c467bffea3717e7bR51

I'd call it @files or something like that since that's what it's storing? Is it storing anything else?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parkr, @files makes me think of an Array of file objects like how we have an array of staticfiles. But DataFile isn't technically a file object either, only a container for the "parsed data" of the actual data file on disk. The closest I think that will express the intention correctly is calling this variable a @registry and DataFile to be DataEntry instead.

end

def to_liquid
self
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_liquid is generally a means of sanitizing output, converting into a Drop, or returning a simpler data structure. I don't see a private below here so we'd be exposing #merge, #merge!, and all other Hash methods (from method_missing). It seems like we'd rather simply expose the #[](key) method to Liquid? I'd rather see a simple Drop here which grants basic access to the underlying Hash of File objects.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional to maintain backwards-compatibility.
As of Jekyll 4.2.x, the top-level objects of site_data are Hash instances. Hash#to_liquid returns the current Hash instance. So all of the public methods should continue to be invokable in the future.

attr_accessor :context
attr_reader :content

def initialize(site, path, content)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a method comment describing the inputs, since it's a bit vague (e.g. is "path" relative or absolute?)

def initialize(site, path, content)
@site = site
@path = path
@content = content
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@content throughout the Jekyll codebase generally refers to plain text. Perhaps @data or @structured_content would help indicate that this is generally not a text value, but rather False, Array, or Hash.


module Jekyll
module SiteData
class File
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In generally, I try to not use class names which match stdlib names, like File. It can just confuse things a bit. We could try Jekyll::DataFile and Jekyll::DataDirectory, or something which doesn't match the stdlib name File.


module Jekyll
module SiteData
class File
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no unit tests for this anywhere – is that intentional? I would be happy to see simple unit tests for creating an instance and checking equality, where you have some custom code.

@ashmaroli
Copy link
Copy Markdown
Member Author

Thank you for the constructive feedback, @parkr. I shall make the requested improvements over a few days.

Copy link
Copy Markdown
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback, @ashmaroli! This is looking pretty good. I would just like to see a unit test for DataFile, especially testing for equality (<=>).

And I should see "John Doe -- Admin" in "_site/about.html"
And I should see "Rendering: index.html" in the build output
And I should see "Rendering: _posts/2009-03-27-wargames.markdown" in the build output
When I wait 1 second
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cucumber test suite is quite slow as it is. An optimization in a later PR could be to limit these waits to a smaller value and/or use a more granular mtime in our incremental meantime. 🤔

attr_accessor :context
attr_reader :data

def initialize(site, abs_path, data)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here would still be nice, since I have to make certain assumptions about these variables. Something like:

# Create a DataFile.
# site - the Jekyll::Site this file belongs to
# abs_path - the absolute path to the file
# data - the parsed contents of the data file

If the comment is wrong, then it only helps to illustrate why a comment is helpful 😜

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assumptions here are correct. I'll add the suggested comment to avoid assumptions. 🙂

attr_accessor :context

def initialize
@meta = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's @files right? in DataReader, the code does data[key] = DataFile.new... which stores into this hash by method_missing. https://github.com/jekyll/jekyll/pull/8771/files#diff-ae0ebeb5a3a44f07ee61a0e5851d165c646ddab160824029c467bffea3717e7bR51

I'd call it @files or something like that since that's what it's storing? Is it storing anything else?

@ashmaroli ashmaroli requested a review from parkr September 25, 2022 15:22

def [](key)
@registry[key].tap do |value|
value.context = context if value.respond_to?(:context=)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only reason we have DataHash for this one line?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily yes. But also to avoid manipulating the Ruby Core class Hash.
Ultimately, we need a way to detect what entities are dependent on a given data file or directory of data files to trigger the entity's re-rendering when data file(s) change. The most common way end-users directly consume Jekyll::Site#site_data is via Liquid drops: {{ site.data[my_var] }}. Therefore, the only access to the parent entity's path attribute is via the Liquid context. So, we need an object that behaves like Ruby's own Hash yet be aware of render context and remain the same subclass after execution of certain methods (that return a Hash object irrespective of the caller class).
The last point is debatable because I don't quite remember which methods exactly and therefore cannot assert whether modern Ruby versions have addressed that oddity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the reiteration.

@ashmaroli
Copy link
Copy Markdown
Member Author

@jekyllbot: merge +minor

@jekyllbot jekyllbot merged commit 160a681 into jekyll:master Sep 29, 2022
jekyllbot added a commit that referenced this pull request Sep 29, 2022
github-actions bot pushed a commit that referenced this pull request Sep 29, 2022
Ashwin Maroli: Incrementally rebuild when a data file is changed (#8771)

Merge pull request 8771
ashmaroli added a commit that referenced this pull request Oct 26, 2022
jekyllbot pushed a commit that referenced this pull request Oct 26, 2022
@ashmaroli ashmaroli deleted the site_data-module branch October 26, 2022 16:38
github-actions bot pushed a commit that referenced this pull request Oct 26, 2022
Ashwin Maroli: Revert "Incrementally rebuild when a data file is changed (#8771)" (#9170)

Merge pull request 9170
ashmaroli added a commit that referenced this pull request Oct 27, 2022
lylo pushed a commit to lylo/jekyll that referenced this pull request Oct 27, 2022
lylo pushed a commit to lylo/jekyll that referenced this pull request Oct 27, 2022
* master: (27 commits)
  Remove noise in `features/highlighting.feature`
  Release 💎 v4.3.1
  Update history to reflect merge of jekyll#9171 [ci skip]
  Release post for v4.3.1 (jekyll#9171)
  Update history to reflect merge of jekyll#9170 [ci skip]
  Revert "Incrementally rebuild when a data file is changed (jekyll#8771)" (jekyll#9170)
  Update history to reflect merge of jekyll#9167 [ci skip]
  Respect user-defined name attribute in documents (jekyll#9167)
  Revert back to developing 4.3.x
  Mark initiation of v5.0 development
  Disable critical GH Actions on `master`
  Fix spelling errors in History document
  Release 💎 v4.3.0
  Update history to reflect merge of jekyll#9157 [ci skip]
  Release post for v4.3.0 (jekyll#9157)
  Clean up HEAD section in History document
  Document xz dependency on macOS (jekyll#9098)
  Fix URL to Liquid documentation (jekyll#9158)
  Bump RuboCop to `v1.37.x`
  Update history to reflect merge of jekyll#9132 [ci skip]
  ...
@jekyll jekyll locked and limited conversation to collaborators Oct 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incremental regeneration ignores changes to data files

4 participants