Skip to content

osd: allow FULL_TRY after failsafe#17177

Merged
liewegas merged 1 commit intoceph:masterfrom
liupan1111:wip-fix-rm
Aug 29, 2017
Merged

osd: allow FULL_TRY after failsafe#17177
liewegas merged 1 commit intoceph:masterfrom
liupan1111:wip-fix-rm

Conversation

@liupan1111
Copy link
Contributor

@liupan1111 liupan1111 commented Aug 23, 2017

In #12627 and #14193, I've supported "rbd rm" when osd is full. But I find that support is not enough: only when the "full osd" is not primary, "rbd rm" could work. I did experiment: use vstart to create only one osd, and write until full, then rm, it still hangs there. This fix in this pr could resolve it.

Signed-off-by: Pan Liu wanjun.lp@alibaba-inc.com

@liupan1111 liupan1111 requested a review from liewegas August 23, 2017 04:07
@liupan1111
Copy link
Contributor Author

retest this please

@liewegas
Copy link
Member

Hmm, I doesn't seem like you should be hitting the failsafe threshold.

Oh, it's because vstart sets the thresholds too high:

        osd failsafe full ratio = .99
        mon osd nearfull ratio = .99
        mon osd backfillfull ratio = .99
        mon osd full ratio = .99

should should be .99, .96, .97, .98, or similar. Update vstart.sh?

@liupan1111
Copy link
Contributor Author

@liewegas i got the test result by seting these options all to 15, so that This osd Could be filled full quickly. I donnot think This issue is related to option values... Could you give me some suggestion if we dont do This change to resolve this issue?

@liewegas
Copy link
Member

The important thing is that the full_ratio is less than the failsafe ratio, so that the clsuter is marked full and clients stop writing before hitting the failsafe.

The failsafe is a last-ditch safety check to prevent the OSD from filling itself up. You shouldn't be allowed to override it with the force flag.

@liupan1111
Copy link
Contributor Author

@liewegas yes, full_ratio is normally less than the failsafe ratio, but in my case, there is possible the failsafe reached first: that is because the "statfs" is called in osd(every one or five seconds?), and set cur_stat of this OSD to full by fail_safe ratio, and then send to monitor, and check osdmap change by full_ratio, and send to client, then pause client io... So there is a time interval...

In addition, I didn't override with full_force, but full_try.

I will not insist on if you think we should tune failsafe and full_ratio to avoid this issue... But I think it maybe a little tricky for this tuning...

@liupan1111
Copy link
Contributor Author

@liewegas And in this case, I set all this options to 15%, but I found both full_ratio and full_try are 20% when fio pause... I use 1m bs to write it... For this case, I think 1 or 2 percent difference between full_try and fail_safe could not reolve it...

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since this is just a FULL_TRY, it's probably harmless... we will only proceed if the transaction is a net reduction in usage. There is still some risk, though: it may be that the operation forces recovery of an object that then fills things up. The failsafe should block that from happening, though!

@liewegas liewegas changed the title osd: support "rbd rm" when osd is full osd: allow FULL_TRY after failsafe Aug 25, 2017
@liewegas
Copy link
Member

Do you mind updating the commit description?

Signed-off-by: Pan Liu <wanjun.lp@alibaba-inc.com>
@liupan1111
Copy link
Contributor Author

liupan1111 commented Aug 26, 2017

@liewegas commit description has been updated, thanks.

Ok, since this is just a FULL_TRY, it's probably harmless... we will only proceed if the transaction is a net reduction in usage. There is still some risk, though: it may be that the operation forces recovery of an object that then fills things up. The failsafe should block that from happening, though!

I searched the code, and found there were no other places set this CEPH_OSD_FLAG_FULL_TRY... I think we could avoid this risk by strictly limit the operations to set FULL_TRY/FULL_FORCE flag, so that failsafe will really safe to block that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants