Skip to content

DOC Update/clarify plot_unveil_tree_structure.py#16942

Merged
glemaitre merged 10 commits intoscikit-learn:masterfrom
lucyleeow:DOC_treestruct
May 19, 2020
Merged

DOC Update/clarify plot_unveil_tree_structure.py#16942
glemaitre merged 10 commits intoscikit-learn:masterfrom
lucyleeow:DOC_treestruct

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

@lucyleeow lucyleeow commented Apr 16, 2020

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

  • Change example to notebook style with alternating text and code blocks
  • Clarify explanations and comment code more
  • Consistently use the term 'split node', instead of alternating between 'split node' and 'test node' (happy to change back)

Any other comments?

@lucyleeow lucyleeow changed the title WIP Update/clarify plot_unveil_tree_structure.py DOC Update/clarify plot_unveil_tree_structure.py Apr 17, 2020
Comment on lines +54 to +61
# Among these arrays, we have:
# - ``children_left[i]`` - id of the left child of node i or -1 if leaf node
# - ``children_right[i]`` - id of the right child of node i or -1 if leaf
# node
# - ``feature[i]`` - feature used for splitting node i
# - ``threshold[i]`` - threshold value at node i
# - ``n_node_samples[i]`` - the number of of training samples reaching node i
# - ``impurity[i]`` - the impurity at node i
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Among these arrays, we have:
# - ``children_left[i]`` - id of the left child of node i or -1 if leaf node
# - ``children_right[i]`` - id of the right child of node i or -1 if leaf
# node
# - ``feature[i]`` - feature used for splitting node i
# - ``threshold[i]`` - threshold value at node i
# - ``n_node_samples[i]`` - the number of of training samples reaching node i
# - ``impurity[i]`` - the impurity at node i
# Among these arrays, we have:
# - ``children_left[i]``: id of the left child of node `i` or -1 if leaf node
# - ``children_right[i]``: id of the right child of node i or -1 if leaf node
# - ``feature[i]``: feature used for splitting node `i`
# - ``threshold[i]``: threshold value at node `i`
# - ``n_node_samples[i]``: the number of of training samples reaching
# node `i`
# - ``impurity[i]`` - the impurity at node `i`

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @lucyleeow Could you replace the string with f-string when relevant. It could make the code more readable here.

# to low level attributes such as ``node_count``, the total number of nodes,
# and ``max_depth``, the maximal depth of the tree. It also stores the
# entire binary tree structure, represented as a number of parallel arrays. The
# i-th element of each array holds information about the node i. Node 0 is the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# i-th element of each array holds information about the node i. Node 0 is the
# i-th element of each array holds information about the node `i`. Node 0 is the

# We can also retrieve the decision path of samples of interest. The
# ``decision_path`` method outputs an indicator matrix that allows us to
# retrieve the nodes the samples of interest traverse through. A non zero
# element in the indicator matrix at position (i, j) indicates that
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# element in the indicator matrix at position (i, j) indicates that
# element in the indicator matrix at position `(i, j)` indicates that

# ``decision_path`` method outputs an indicator matrix that allows us to
# retrieve the nodes the samples of interest traverse through. A non zero
# element in the indicator matrix at position (i, j) indicates that
# the sample i goes through the node j. Or, for one sample, i, the positions of
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# the sample i goes through the node j. Or, for one sample, i, the positions of
# the sample `i` goes through the node j. Or, for one sample, `i`, the positions of

# retrieve the nodes the samples of interest traverse through. A non zero
# element in the indicator matrix at position (i, j) indicates that
# the sample i goes through the node j. Or, for one sample, i, the positions of
# the non zero elements in row i of the indicator matrix designate the ids
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# the non zero elements in row i of the indicator matrix designate the ids
# the non zero elements in row `i` of the indicator matrix designate the ids

# ``apply`` method. This returns an array of the node ids of the leaves
# reached by each sample of interest. Using the leaf ids and the
# ``decision_path`` we can obtain the tests that were used to predict a sample
# or a group of samples. First, let's do it for one sample. Note:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# or a group of samples. First, let's do it for one sample. Note:
# or a group of samples. First, let's do it for one sample. Note that

Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lucyleeow , looks good

I haven't checked the rendered docs though

# ``decision_path`` method outputs an indicator matrix that allows us to
# retrieve the nodes the samples of interest traverse through. A non zero
# element in the indicator matrix at position ``(i, j)`` indicates that
# the sample ``i`` goes through the node ``j``. Or, for one sample, ``i``, the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# the sample ``i`` goes through the node ``j``. Or, for one sample, ``i``, the
# the sample ``i`` goes through the node ``j``. Or, for one sample ``i``, the

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)] # seed is the root node id and its parent depth
stack = [(0, -1)] # start with the root node id (0) and its parent depth (-1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of a nit but this could just be [(0, 0)] and we could have

node_id, current_depth = stack.pop()

below. It seems unnecessarily contrived to deal with the parent's depth

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I don't know why I didn't question this.

common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
len(sample_ids))

# obstain node ids using position in array
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# obstain node ids using position in array
# obtain node ids using position in array

# The leaf ids reached by samples of interest can be obtained with the
# ``apply`` method. This returns an array of the node ids of the leaves
# reached by each sample of interest. Using the leaf ids and the
# ``decision_path`` we can obtain the tests that were used to predict a sample
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "split tests" or "splitting conditions"

@glemaitre glemaitre merged commit d46663c into scikit-learn:master May 19, 2020
@lucyleeow lucyleeow deleted the DOC_treestruct branch May 19, 2020 10:59
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants