Skip to content

osd/ReplicatedBackend: read data portion of pushing object in batch#29588

Closed
xiexingguo wants to merge 1 commit intoceph:masterfrom
xiexingguo:wip-build-fie-in-batch
Closed

osd/ReplicatedBackend: read data portion of pushing object in batch#29588
xiexingguo wants to merge 1 commit intoceph:masterfrom
xiexingguo:wip-build-fie-in-batch

Conversation

@xiexingguo
Copy link
Member

For sparse objects, this allows us to build the whole fie-map
simultaneously by leveraging BlueStore's asynchronous reading
support, which is 2x~4X faster if pushing objects get massively
fragmentated.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test docs
  • jenkins render docs

For sparse objects, this allows us to build the whole fie-map
simultaneously by leveraging BlueStore's asynchronous reading
support, which is 2x~4X faster if pushing objects get massively
fragmentated.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@tchaikov tchaikov self-requested a review August 16, 2019 12:26
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this change depends on some assumptions:

  1. the scatter/gather I/O can offset the overhead introduced by reading the holes in data_included
  2. the memory fragmentation can be neglected.

i cannot find the analysis in the commit message. neither can i find the test result.

could you help address the concerns above?

@xiexingguo
Copy link
Member Author

my main concern is the performance of small randomized rd/wr, so I only measured the time spending to read the whole fie map content off the disk, namely build_bush_op since we barely use omap in our testbed.
before this change, the typical time of building a push op was 0.4~06s:

2019-08-07 16:42:16.975905 7f71f743a700  7 osd.56 pg_epoch: 26740 pg[2.15a( v 26734'14248 (26326'12476,26734'14248] local-lis/les=26443/26444 n=40 e
c=206/206 lis/c 26443/26401 les/c/f 26444/26402/0 26442/26443/26401) [56,25,57]/[56,25] async=[57] r=0 lpr=26443 pi=[26401,26443)/1 rops=2 crt=26734
'14248 lcod 26734'14247 mlcod 26407'14043 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=11}}] build_push_op 2:5a892e98:::rbd_data.1
25664d73937c.0000000000008436:head v 26662'14238 size 4194304 recovery_info: ObjectRecoveryInfo(2:5a892e98:::rbd_data.125664d73937c.0000000000008436
:head@26662'14238, size: 4194304, copy_subset: [0~4194304], clone_subset: {}, snapset: 0=[]:[])

2019-08-07 16:42:17.365134 7f71f743a700 20 osd.56 pg_epoch: 26740 pg[2.15a( v 26734'14248 (26326'12476,26734'14248] local-lis/les=26443/26444 n=40 e
c=206/206 lis/c 26443/26401 les/c/f 26444/26402/0 26442/26443/26401) [56,25,57]/[56,25] async=[57] r=0 lpr=26443 pi=[26401,26443)/1 rops=2 crt=26734
'14248 lcod 26734'14247 mlcod 26407'14043 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=11}}] send_pushes: sending push PushOp(2:5a
892e98:::rbd_data.125664d73937c.0000000000008436:head, version: 26662'14238, data_included: [8192~8192,32768~24576,65536~24576,98304~24576,139264~24
576,172032~8192,188416~8192,229376~16384,262144~8192,278528~65536,352256~16384,376832~8192,393216~24576,425984~8192,450560~40960,516096~16384,540672
~8192,557056~49152,622592~16384,647168~16384,671744~8192,696320~32768,737280~8192,761856~16384,794624~24576,835584~16384,868352~8192,909312~8192,925
696~16384,966656~24576,999424~8192,1015808~8192,1032192~8192,1064960~8192,1081344~16384,1105920~8192,1122304~16384,1146880~24576,1179648~16384,12206
08~24576,1253376~24576,1294336~8192,1310720~16384,1335296~24576,1376256~16384,1409024~16384,1433600~8192,1449984~24576,1490944~16384,1515520~8192,15
31904~8192,1548288~8192,1564672~16384,1589248~8192,1671168~8192,1695744~24576,1744896~8192,1769472~16384,1794048~32768,1851392~8192,1900544~24576,19
41504~40960,1990656~8192,2023424~8192,2048000~16384,2072576~24576,2113536~24576,2146304~8192,2170880~8192,2187264~8192,2220032~16384,2252800~32768,2
293760~8192,2310144~16384,2342912~16384,2400256~32768,2441216~16384,2473984~24576,2514944~8192,2531328~32768,2588672~16384,2621440~24576,2654208~819
2,2711552~24576,2752512~24576,2785280~16384,2809856~8192,2826240~24576,2867200~8192,2883584~8192,2916352~40960,2965504~40960,3022848~8192,3063808~16
384,3088384~8192,3104768~40960,3153920~32768,3194880~8192,3219456~16384,3244032~16384,3284992~16384,3334144~8192,3350528~8192,3375104~32768,3416064~
8192,3432448~16384,3465216~16384,3489792~8192,3522560~8192,3538944~8192,3555328~16384,3588096~8192,3612672~16384,3637248~49152,3702784~16384,3743744
~16384,3768320~8192,3801088~16384,3825664~49152,3883008~32768,3964928~16384,3989504~8192,4005888~8192,4022272~24576,4055040~8192,4071424~16384,41041
92~8192,4128768~16384,4153344~8192,4169728~8192,4186112~8192], data_size: 2285568, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recov
ery_info: ObjectRecoveryInfo(2:5a892e98:::rbd_data.125664d73937c.0000000000008436:head@26662'14238, size: 4194304, copy_subset: [0~4194304], clone_s
ubset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_complete:true, omap_recovered_to:, omap
_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_comp
lete:false, error:false)) to osd.57


2019-08-07 16:42:17.368688 7f71f4c35700  7 osd.56 pg_epoch: 26740 pg[2.15a( v 26734'14248 (26326'12476,26734'14248] local-lis/les=26443/26444 n=40 e
c=206/206 lis/c 26443/26401 les/c/f 26444/26402/0 26442/26443/26401) [56,25,57]/[56,25] async=[57] r=0 lpr=26443 pi=[26401,26443)/1 rops=2 crt=26734
'14248 lcod 26734'14247 mlcod 26407'14043 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=10}}] build_push_op 2:5a82ae87:::rbd_data.1
25664d73937c.000000000001e5be:head v 26668'14239 size 4194304 recovery_info: ObjectRecoveryInfo(2:5a82ae87:::rbd_data.125664d73937c.000000000001e5be
:head@26668'14239, size: 4194304, copy_subset: [0~4194304], clone_subset: {}, snapset: 0=[]:[])

2019-08-07 16:42:17.835942 7f71f4c35700 20 osd.56 pg_epoch: 26740 pg[2.15a( v 26734'14248 (26326'12476,26734'14248] local-lis/les=26443/26444 n=40 e
c=206/206 lis/c 26443/26401 les/c/f 26444/26402/0 26442/26443/26401) [56,25,57]/[56,25] async=[57] r=0 lpr=26443 pi=[26401,26443)/1 rops=2 crt=26734
'14248 lcod 26734'14247 mlcod 26407'14043 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=10}}] send_pushes: sending push PushOp(2:5a
82ae87:::rbd_data.125664d73937c.000000000001e5be:head, version: 26668'14239, data_included: [16384~16384,40960~8192,65536~16384,106496~32768,188416~
8192,221184~8192,245760~8192,262144~16384,286720~8192,311296~8192,327680~32768,368640~8192,385024~32768,434176~8192,458752~8192,491520~8192,507904~2
4576,548864~16384,589824~8192,606208~8192,622592~24576,655360~32768,704512~16384,729088~16384,761856~16384,786432~24576,827392~16384,851968~16384,87
6544~16384,901120~16384,925696~8192,942080~8192,958464~40960,1024000~73728,1122304~8192,1171456~8192,1196032~8192,1228800~32768,1269760~49152,134348
8~8192,1376256~8192,1417216~32768,1466368~24576,1499136~8192,1523712~16384,1548288~8192,1564672~16384,1589248~16384,1613824~16384,1638400~8192,16629
76~8192,1695744~8192,1712128~8192,1761280~8192,1777664~8192,1794048~16384,1843200~8192,1859584~8192,1892352~24576,1941504~16384,1966080~8192,1982464
~8192,1998848~8192,2015232~8192,2031616~40960,2080768~8192,2097152~16384,2121728~16384,2154496~24576,2203648~8192,2236416~8192,2252800~8192,2285568~
16384,2318336~24576,2351104~16384,2383872~8192,2408448~40960,2482176~8192,2498560~32768,2555904~16384,2605056~8192,2646016~24576,2695168~8192,271155
2~8192,2736128~49152,2793472~16384,2826240~8192,2859008~8192,2883584~16384,2908160~8192,2932736~40960,3031040~8192,3047424~8192,3063808~32768,310476
8~8192,3137536~16384,3162112~57344,3235840~8192,3268608~16384,3293184~16384,3325952~8192,3358720~8192,3391488~8192,3407872~16384,3457024~8192,347340
8~16384,3506176~24576,3547136~16384,3588096~8192,3629056~8192,3645440~16384,3678208~8192,3702784~8192,3719168~16384,3743744~8192,3776512~8192,379289
6~8192,3809280~8192,3825664~8192,3842048~16384,3866624~8192,3883008~8192,3915776~24576,3948544~8192,3964928~16384,3989504~8192,4014080~8192,4030464~
8192,4046848~16384,4071424~24576,4104192~8192,4120576~8192,4136960~8192,4153344~8192,4169728~24576], data_size: 2113536, omap_header_size: 0, omap_e
ntries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(2:5a82ae87:::rbd_data.125664d73937c.000000000001e5be:head@26668'14239, size: 4194
304, copy_subset: [0~4194304], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4194304, data_c
omplete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complet
e:false, omap_recovered_to:, omap_complete:false, error:false)) to osd.57

after this change, the time decreased to 0.2s(note that the objects were logically less fragmented (but physically more fragmented) since we kept writing more data in-between):

2019-08-08 16:57:26.293065 7fcc74caa700  7 osd.25 pg_epoch: 32256 pg[2.7c( v 32252'17065 (31733'15378,32252'17065] local-lis/les=31820/31821 n=24 ec
=206/206 lis/c 31820/31769 les/c/f 31821/31770/0 31818/31820/31820) [26,25,52]/[25,52] async=[26] r=0 lpr=31820 pi=[31769,31820)/2 rops=2 crt=32252'
17065 lcod 32246'17064 mlcod 31802'16905 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=14}}] build_push_op 2:3e0e0395:::rbd_data.12
5664d73937c.0000000000009d3e:head v 32099'17044 size 4194304 recovery_info: ObjectRecoveryInfo(2:3e0e0395:::rbd_data.125664d73937c.0000000000009d3e:
head@32099'17044, size: 4194304, copy_subset: [0~4194304], clone_subset: {}, snapset: 0=[]:[])


2019-08-08 16:57:26.488996 7fcc74caa700 20 osd.25 pg_epoch: 32256 pg[2.7c( v 32252'17065 (31733'15378,32252'17065] local-lis/les=31820/31821 n=24 ec
=206/206 lis/c 31820/31769 les/c/f 31821/31770/0 31818/31820/31820) [26,25,52]/[25,52] async=[26] r=0 lpr=31820 pi=[31769,31820)/2 rops=2 crt=32252'
17065 lcod 32246'17064 mlcod 31802'16905 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=14}}] send_pushes: sending push PushOp(2:3e0
e0395:::rbd_data.125664d73937c.0000000000009d3e:head, version: 32099'17044, data_included: [0~8192,32768~49152,98304~8192,139264~32768,180224~16384,
204800~16384,229376~8192,245760~16384,278528~32768,319488~32768,360448~32768,401408~81920,491520~8192,507904~65536,581632~81920,671744~32768,729088~
40960,786432~16384,811008~16384,835584~8192,851968~73728,933888~24576,966656~8192,991232~8192,1007616~32768,1048576~57344,1114112~49152,1171456~2457
6,1204224~40960,1253376~32768,1294336~8192,1310720~16384,1335296~24576,1376256~24576,1417216~90112,1515520~8192,1531904~16384,1556480~106496,1671168
~16384,1695744~40960,1744896~98304,1859584~8192,1875968~49152,1933312~32768,1974272~24576,2007040~8192,2023424~32768,2072576~16384,2097152~49152,215
4496~90112,2269184~24576,2326528~65536,2400256~24576,2433024~57344,2506752~24576,2539520~32768,2588672~8192,2613248~8192,2629632~24576,2662400~16384
,2686976~24576,2719744~90112,2818048~16384,2842624~8192,2867200~16384,2891776~16384,2924544~65536,3006464~57344,3072000~73728,3162112~32768,3203072~
49152,3268608~24576,3309568~73728,3399680~65536,3481600~40960,3530752~16384,3555328~24576,3596288~8192,3620864~65536,3694592~24576,3727360~8192,3743
744~16384,3768320~8192,3784704~98304,3891200~8192,3907584~8192,3923968~24576,3956736~98304,4063232~65536,4153344~40960], data_size: 3178496, omap_he
ader_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(2:3e0e0395:::rbd_data.125664d73937c.0000000000009d3e:head@320
99'17044, size: 4194304, copy_subset: [0~4194304], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovere
d_to:4194304, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovere
d_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false)) to osd.26


2019-08-08 16:57:26.492689 7fcc774af700  7 osd.25 pg_epoch: 32256 pg[2.7c( v 32252'17065 (31733'15378,32252'17065] local-lis/les=31820/31821 n=24 ec
=206/206 lis/c 31820/31769 les/c/f 31821/31770/0 31818/31820/31820) [26,25,52]/[25,52] async=[26] r=0 lpr=31820 pi=[31769,31820)/2 rops=2 crt=32252'
17065 lcod 32246'17064 mlcod 31802'16905 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=13}}] build_push_op 2:3e0a39bb:::rbd_data.12
5664d73937c.00000000000033c8:head v 32121'17050 size 4177920 recovery_info: ObjectRecoveryInfo(2:3e0a39bb:::rbd_data.125664d73937c.00000000000033c8:
head@32121'17050, size: 4177920, copy_subset: [0~4177920], clone_subset: {}, snapset: 0=[]:[])

2019-08-08 16:57:26.681088 7fcc774af700 20 osd.25 pg_epoch: 32256 pg[2.7c( v 32252'17065 (31733'15378,32252'17065] local-lis/les=31820/31821 n=24 ec
=206/206 lis/c 31820/31769 les/c/f 31821/31770/0 31818/31820/31820) [26,25,52]/[25,52] async=[26] r=0 lpr=31820 pi=[31769,31820)/2 rops=2 crt=32252'
17065 lcod 32246'17064 mlcod 31802'16905 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=13}}] send_pushes: sending push PushOp(2:3e0
a39bb:::rbd_data.125664d73937c.00000000000033c8:head, version: 32121'17050, data_included: [0~8192,16384~16384,40960~73728,122880~16384,147456~16384
,172032~8192,188416~8192,204800~57344,286720~32768,335872~8192,352256~8192,368640~16384,401408~24576,434176~57344,499712~65536,573440~8192,589824~16
384,614400~40960,663552~49152,720896~24576,753664~8192,770048~139264,917504~8192,950272~8192,966656~73728,1048576~40960,1105920~40960,1155072~32768,
1204224~57344,1269760~16384,1294336~8192,1310720~24576,1343488~40960,1392640~16384,1417216~65536,1499136~16384,1523712~16384,1548288~90112,1654784~8
192,1671168~49152,1728512~32768,1769472~16384,1802240~8192,1818624~131072,1957888~49152,2015232~24576,2048000~81920,2154496~16384,2179072~57344,2244
608~8192,2260992~8192,2285568~8192,2301952~16384,2326528~131072,2465792~24576,2498560~106496,2629632~16384,2670592~16384,2695168~32768,2736128~8192,
2760704~24576,2793472~49152,2850816~24576,2883584~32768,2932736~16384,2957312~24576,2998272~90112,3096576~65536,3186688~16384,3211264~98304,3317760~
40960,3366912~172032,3547136~8192,3571712~24576,3604480~65536,3686400~24576,3735552~40960,3784704~8192,3801088~40960,3850240~40960,3899392~122880,40
38656~24576,4071424~8192,4087808~16384,4120576~16384,4145152~8192,4161536~16384], data_size: 3235840, omap_header_size: 0, omap_entries_size: 0, att
rset_size: 2, recovery_info: ObjectRecoveryInfo(2:3e0a39bb:::rbd_data.125664d73937c.00000000000033c8:head@32121'17050, size: 4177920, copy_subset: [
0~4177920], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4177920, data_complete:true, omap_
recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recov
ered_to:, omap_complete:false, error:false)) to osd.26

@xiexingguo
Copy link
Member Author

@liewegas Care to comment?

@liewegas
Copy link
Member

When I looked before I had the same concern as @tchaikov that there was an implicit tradeoff here.. it doesn't seem right to read the zeros in the holes only to muck around to toss them out.

I think a more elegant solution would be to add a read_sparse() method to ObjectStore that allows BlueStore to do the read in parallel and efficiently. A wrapper can be put in ObjectStore.h that does basically what this PR does (either a fiemap + read the pieces, or read + fiemap + chop up result), and then BlueStore can implement it efficiently.

What do you think?

@xiexingguo
Copy link
Member Author

xiexingguo commented Aug 22, 2019

I think a more elegant solution would be to add a read_sparse() method to ObjectStore that allows BlueStore to do the read in parallel and efficiently.

yeah, that was my first version. but since I was planning to backport this fix back into luminous, I just ended up posting a minimal (and hence much reliable) fix at the last minute...

will come back to this later

@xiexingguo xiexingguo mentioned this pull request Sep 2, 2019
3 tasks
@xiexingguo
Copy link
Member Author

#30061 merged.

@xiexingguo xiexingguo closed this Sep 7, 2019
@xiexingguo xiexingguo deleted the wip-build-fie-in-batch branch September 7, 2019 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants