(WIP) Make controller go through attribute manager to clear transient attributes by kgaillot · Pull Request #1695 · ClusterLabs/pacemaker

kgaillot · 2019-02-01T23:21:48Z

This is for discussion only. I have made sure this compiles but otherwise not done any testing.

@todo

ATTRD_OP_PEER_CLEAR will remove all attributes set for a particular node, and write the changes to the CIB. This differs from ATTRD_OP_PEER_REMOVE in that the changes are written, and the node is not purged from the peer caches. This will enable putting attrd in full control of writing transient attribute changes to the CIB. @todo This will not work in mixed-version clusters

@todo

…ibutes Previously, the controller would clear a node's transient attributes from the CIB directly, but if the attribute manager has any attributes in memory, this simply results in writing them back out again. The only right time to clear the CIB directly is when the attribute manager's memory has already been purged. This may close some CLBZ's @todo This has not been tested or fully thought through

HideoYamauchi · 2019-02-02T06:45:15Z

Hi Ken,

Right now, I am testing the behavior.

Apparently, next week, I am likely to be able to present the first amendment that I can discuss.

I am planning to present the following modifications.

Mark when receiving the ATTRD_OP_PEER_CLEAR message.
In order to cope with crmd restarting, we will not immediately delete the attribute.
For remote node attributes, delete them immediately.
Deletes the attribute when the mark of the request to delete the attribute from crmd and the event of the node leaving attrd are collected.(It should be noted that the order in which this mark and the event occur may be reversed.)
With this fix, attributes are always deleted via the ATTRD_OP_PEER_CLEAR message.
Although this fix alone does not solve it, it seems to be effective even if only attrd is restarted in the future.
Add ATTRD_OP_ATTR_REMOVE as a new message.
After completing the update of cib, the Writer is delete the memory attribute of attrd.
At this time, send ATTRD_OP_ATTR_REMOVE to the attrd member of the cluster.
Other attrds that received ATTRD_OP_ATTR_REMOVE also delete the memory attributes.
In order to reliably synchronize the attributes in the cluster, we have made such a mechanism.
It does not incorporate the correspondence of the old CRM_ATTR_PROTOCOL version.
When the attrd leaving the cluster once again participates, the processing flag is cleared.
It is when attrd received the CRM_ATTR_PROTOCOL message
Actually, you should clear the processing flag when crmd comes in, but attrd can not detect it at the moment.
Forced write decision by force_write has been moved to write_attributes () write_attribute ().
This is because writing from write_attribute () may occur....
For now it seems to be difficult to delete the attribute of an isolated attrd if crmd does not restart.
etc..

(begin memo)

Memo for me
(My memo may be wrong, so please regard it as a memo)

(thinking....now)

~~1: Probably attrd that received ATTRD_OP_PEER_CLEAR will need to send_attrd_message the message as is.~~ (no need ^^)

2:In the case of processing after executing erase_status_tag (node_name, XML_CIB_TAG_LRM,...), I think that there is a problem depending on the timing.When pengine calculation is early...(->need test)
->Since it is only send_stonith_update (), there seems to be no problem.
--->This is the case when STONITH is executed, so there will be no problem.

3: The attrd thinks that after receiving ATTRD_OP_PEER_CLEAR it really needs to check that the node has left the member. (-->need test)
For example, when only the crmd of the DC node restarts, the attributes of that node are lost.
->If the target is a remote node, I think it is OK to erase it immediately.
->In other cases, set only the flag to clean up, and clear it when the node actually leaves.

4: It seems effective to delete the execution of this attrd_peer_remove ().(-->need test)
If this is not deleted, the attribute of the stopped DC node may remain in the CIB when the DC node stops.
This is because attrd detects node leaving before controller and ATTRD_OP_PEER_CLEAR message is not processed.
Where to erase is a problem.
->Writer deleted memory attributes with attrd_cib_callback () after actually clearing.
->Transfer deletion to other nodes.
-->There are two patterns for detection as to node withdrawal
-->The first is the pattern that attrd detects first, in this case, crmd will be detected later
-->The second is the pattern that attrd detects after crmd first detects

                /* TODO The controller should ask us to clear attributes from
                 * lost nodes, so we shouldn't do that here. But we need to
                 * consider corner cases such as the controller not being up.
                 */
                // Remove all attribute values associated with lost nodes
#if 0
//YAMAUCHI
                attrd_peer_remove(peer->uname, FALSE, "loss");
#endif

5: In the case of deleting, it will be necessary to immediately reflect even if there is a timer. (->need test)
6: Election of attrd may be delayed.
Especially in the case of large clusters, even if crmd sends ATTRD_OP_PEER_CLEAR, it may be discarded.(no need : Attributes are written after Writer is decided.)

The problem of attrd remaining after this correspondence.
7) However, if Election is delayed, problems remain. (https://bugs.clusterlabs.org/show_bug.cgi?id=5358#c21).
-->Currently, setting the "transition-delay" only prevents the problem.

Unlike this problem, we must also take into consideration that the attribute will be lost if only attrd is restarted in the future.
-->The current attrd deletes the attribute when it starts up.
Even if we exchange sync_response messages with other nodes, we reset them with NULL.
As a result, when attrd restarts, its attributes disappear.
-->Especially when configuring a cluster with a single node, the attribute is completely lost.
-->For example, when restarting attrd, it may be effective not to erase the attribute by pacemakerd passing the parameter.

(end memo)

Best Regards,
Hideo Yamauchi.

HideoYamauchi · 2019-02-20T02:22:59Z

Hi Ken,

There seems to be a mistake in your correction as much as 2 points in relation to the remote node.
Please confirm.

And...
It is almost time for my test to be completed.
Because there are many corrections, I would like to discuss by implementing another PR.

The PR that I issue includes your fix.
The subject of PR will be the same.

Best Regards,
Hideo Yamauchi.

HideoYamauchi · 2019-02-20T04:57:19Z

daemons/controld/controld_remote_ra.c

-    erase_status_tag(node_name, XML_TAG_TRANSIENT_NODEATTRS, call_opt);

-    /* Clear node's probed attribute */
-    update_attrd(node_name, CRM_OP_PROBED, NULL, NULL, TRUE);


It sounded like a magic trick...
Execution of update_attrd () is required after update_attrd_clear_node (node_name, TRUE).
The remote node can not be recognized.

HideoYamauchi · 2019-02-20T05:01:44Z

daemons/controld/controld_remote_ra.c

+     * the CRM_OP_PROBED ("probe_complete") attribute.
+     */
+    update_attrd_clear_node(node_name, TRUE);



/* Set the CRM_OP_PROBED ("probe_complete") attribute. */
update_attrd(node_name, CRM_OP_PROBED, NULL, NULL, TRUE);

Looking at the code again, I am a little confused. I believe the CRM_OP_PROBED attribute should be unnecessary since 8f76b78 . Also, remote_node_up() explictly clears it in the case of a remote node. I'm not completely sure, but I think we can remove all references to CRM_OP_PROBED from the controller (we probably need to keep the references in the scheduler for regression test purposes).

Hi Ken,

I understood this cause.

When attrd now handles remote node, we use this "probe_complete" attribute to register remote node.(crm_remote_peer_get())
Therefore, unless the "probe_complete" attribute is sent from crmd, attrd will not recognize the remote node properly.

(snip) static attribute_value_t * attrd_lookup_or_create_value(GHashTable *values, const char *host, xmlNode *xml) { attribute_value_t *v = g_hash_table_lookup(values, host); int is_remote = 0; crm_element_value_int(xml, F_ATTRD_IS_REMOTE, &is_remote); if (is_remote) { /* If we previously assumed this node was an unseen cluster node, * remove its entry from the cluster peer cache. */ crm_node_t *dup = crm_find_peer(0, host); if (dup && (dup->uuid == NULL)) { reap_crm_member(0, host); } /* Ensure this host is in the remote peer cache */ CRM_ASSERT(crm_remote_peer_get(host) != NULL); } (snip)

By the way, the attribute at this time works properly even if it is not "probe_complete".

#if 0 //YAMAUCHI /* Set the CRM_OP_PROBED ("probe_complete") attribute. */ update_attrd(node_name, CRM_OP_PROBED, NULL, NULL, TRUE); #endif update_attrd(node_name, "remote_up", NULL, NULL, TRUE);

It is like a magic trick to be recognized by the remote node to the last.

It may be better to fix it in the future, but this update_attrd () is necessary at present.

Best Regards,
Hideo Yamauchi.

It is like a magic trick to be recognized by the remote node to the last.

Ah, I see, thanks.

It may be better to fix it in the future, but this update_attrd () is necessary at present.

Agreed.

kgaillot · 2019-02-20T18:25:30Z

Because there are many corrections, I would like to discuss by implementing another PR.

Hi Hideo,

That sounds good. I'll close this one.

HideoYamauchi · 2019-02-20T23:29:47Z

Hi Ken,

Thank you for closing this PR.

I will be able to implement PR for new discussions tomorrow.

Best Regards,
Hideo Yamauchi.

gao-yan

Thanks for the info, @kgaillot , @HideoYamauchi !

gao-yan · 2024-09-11T15:44:01Z

daemons/controld/controld_callbacks.c

                    te_trigger_stonith_history_sync();
                } else {
-                    erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local);
+                    update_attrd_clear_node(node->uname, is_remote);


IIUC, the approach is trying to clear node attributes via attrd and the CIB will be in sync, but so far in here controld being down will make its node attributes get cleared in attrd anyway despite that attrd might still be running on the node, meaning we are discarding its existing attributes anyway?

the current goal is to avoid that (T139). the plan is that the controller won't request attribute removal, the attribute manager will do it when a node leaves the cluster layer

Ken Gaillot added 2 commits February 1, 2019 17:21

HideoYamauchi reviewed Feb 20, 2019

View reviewed changes

kgaillot closed this Feb 20, 2019

HideoYamauchi mentioned this pull request Feb 21, 2019

(WIP) Make controller go through attribute manager to clear transient attributes. #1699

Closed

kgaillot deleted the attrd branch June 21, 2023 17:24

gao-yan reviewed Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Make controller go through attribute manager to clear transient attributes#1695

(WIP) Make controller go through attribute manager to clear transient attributes#1695
kgaillot wants to merge 2 commits intoClusterLabs:masterfrom
kgaillot:attrd

kgaillot commented Feb 1, 2019

Uh oh!

HideoYamauchi commented Feb 2, 2019 •

edited

Loading

Uh oh!

HideoYamauchi commented Feb 20, 2019 •

edited

Loading

Uh oh!

HideoYamauchi Feb 20, 2019 •

edited

Loading

Uh oh!

HideoYamauchi Feb 20, 2019 •

edited

Loading

Uh oh!

kgaillot Feb 20, 2019

Uh oh!

HideoYamauchi Feb 20, 2019

Uh oh!

kgaillot Feb 21, 2019

Uh oh!

kgaillot commented Feb 20, 2019

Uh oh!

HideoYamauchi commented Feb 20, 2019

Uh oh!

gao-yan left a comment

Uh oh!

gao-yan Sep 11, 2024

Uh oh!

kgaillot Sep 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kgaillot commented Feb 1, 2019

Uh oh!

HideoYamauchi commented Feb 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HideoYamauchi commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HideoYamauchi Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HideoYamauchi Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgaillot Feb 20, 2019

Choose a reason for hiding this comment

Uh oh!

HideoYamauchi Feb 20, 2019

Choose a reason for hiding this comment

Uh oh!

kgaillot Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

kgaillot commented Feb 20, 2019

Uh oh!

HideoYamauchi commented Feb 20, 2019

Uh oh!

gao-yan left a comment

Choose a reason for hiding this comment

Uh oh!

gao-yan Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

kgaillot Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HideoYamauchi commented Feb 2, 2019 •

edited

Loading

HideoYamauchi commented Feb 20, 2019 •

edited

Loading

HideoYamauchi Feb 20, 2019 •

edited

Loading

HideoYamauchi Feb 20, 2019 •

edited

Loading