(WIP) Make controller go through attribute manager to clear transient attributes#1695
(WIP) Make controller go through attribute manager to clear transient attributes#1695kgaillot wants to merge 2 commits intoClusterLabs:masterfrom
Conversation
ATTRD_OP_PEER_CLEAR will remove all attributes set for a particular node, and write the changes to the CIB. This differs from ATTRD_OP_PEER_REMOVE in that the changes are written, and the node is not purged from the peer caches. This will enable putting attrd in full control of writing transient attribute changes to the CIB. @todo This will not work in mixed-version clusters
…ibutes Previously, the controller would clear a node's transient attributes from the CIB directly, but if the attribute manager has any attributes in memory, this simply results in writing them back out again. The only right time to clear the CIB directly is when the attribute manager's memory has already been purged. This may close some CLBZ's @todo This has not been tested or fully thought through
|
Hi Ken, Right now, I am testing the behavior. Apparently, next week, I am likely to be able to present the first amendment that I can discuss. I am planning to present the following modifications.
(begin memo) Memo for me (thinking....now)
2:In the case of processing after executing erase_status_tag (node_name, XML_CIB_TAG_LRM,...), I think that there is a problem depending on the timing.When pengine calculation is early...(->need test) 3: The attrd thinks that after receiving ATTRD_OP_PEER_CLEAR it really needs to check that the node has left the member. (-->need test) 4: It seems effective to delete the execution of this attrd_peer_remove ().(-->need test) 5: In the case of deleting, it will be necessary to immediately reflect even if there is a timer. (->need test) The problem of attrd remaining after this correspondence.
(end memo) Best Regards, |
|
Hi Ken, There seems to be a mistake in your correction as much as 2 points in relation to the remote node. And... The PR that I issue includes your fix. Best Regards, |
| erase_status_tag(node_name, XML_TAG_TRANSIENT_NODEATTRS, call_opt); | ||
|
|
||
| /* Clear node's probed attribute */ | ||
| update_attrd(node_name, CRM_OP_PROBED, NULL, NULL, TRUE); |
There was a problem hiding this comment.
It sounded like a magic trick...
Execution of update_attrd () is required after update_attrd_clear_node (node_name, TRUE).
The remote node can not be recognized.
| * the CRM_OP_PROBED ("probe_complete") attribute. | ||
| */ | ||
| update_attrd_clear_node(node_name, TRUE); | ||
|
|
There was a problem hiding this comment.
/* Set the CRM_OP_PROBED ("probe_complete") attribute. */
update_attrd(node_name, CRM_OP_PROBED, NULL, NULL, TRUE);
There was a problem hiding this comment.
Looking at the code again, I am a little confused. I believe the CRM_OP_PROBED attribute should be unnecessary since 8f76b78 . Also, remote_node_up() explictly clears it in the case of a remote node. I'm not completely sure, but I think we can remove all references to CRM_OP_PROBED from the controller (we probably need to keep the references in the scheduler for regression test purposes).
There was a problem hiding this comment.
Hi Ken,
I understood this cause.
When attrd now handles remote node, we use this "probe_complete" attribute to register remote node.(crm_remote_peer_get())
Therefore, unless the "probe_complete" attribute is sent from crmd, attrd will not recognize the remote node properly.
(snip)
static attribute_value_t *
attrd_lookup_or_create_value(GHashTable *values, const char *host, xmlNode *xml)
{
attribute_value_t *v = g_hash_table_lookup(values, host);
int is_remote = 0;
crm_element_value_int(xml, F_ATTRD_IS_REMOTE, &is_remote);
if (is_remote) {
/* If we previously assumed this node was an unseen cluster node,
* remove its entry from the cluster peer cache.
*/
crm_node_t *dup = crm_find_peer(0, host);
if (dup && (dup->uuid == NULL)) {
reap_crm_member(0, host);
}
/* Ensure this host is in the remote peer cache */
CRM_ASSERT(crm_remote_peer_get(host) != NULL);
}
(snip)
By the way, the attribute at this time works properly even if it is not "probe_complete".
#if 0
//YAMAUCHI
/* Set the CRM_OP_PROBED ("probe_complete") attribute. */
update_attrd(node_name, CRM_OP_PROBED, NULL, NULL, TRUE);
#endif
update_attrd(node_name, "remote_up", NULL, NULL, TRUE);
It is like a magic trick to be recognized by the remote node to the last.
It may be better to fix it in the future, but this update_attrd () is necessary at present.
Best Regards,
Hideo Yamauchi.
There was a problem hiding this comment.
It is like a magic trick to be recognized by the remote node to the last.
Ah, I see, thanks.
It may be better to fix it in the future, but this update_attrd () is necessary at present.
Agreed.
Hi Hideo, That sounds good. I'll close this one. |
|
Hi Ken, Thank you for closing this PR. I will be able to implement PR for new discussions tomorrow. Best Regards, |
gao-yan
left a comment
There was a problem hiding this comment.
Thanks for the info, @kgaillot , @HideoYamauchi !
| te_trigger_stonith_history_sync(); | ||
| } else { | ||
| erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local); | ||
| update_attrd_clear_node(node->uname, is_remote); |
There was a problem hiding this comment.
IIUC, the approach is trying to clear node attributes via attrd and the CIB will be in sync, but so far in here controld being down will make its node attributes get cleared in attrd anyway despite that attrd might still be running on the node, meaning we are discarding its existing attributes anyway?
There was a problem hiding this comment.
the current goal is to avoid that (T139). the plan is that the controller won't request attribute removal, the attribute manager will do it when a node leaves the cluster layer
This is for discussion only. I have made sure this compiles but otherwise not done any testing.