-
Notifications
You must be signed in to change notification settings - Fork 368
Optimized arrays of pointers to arrays of arrays. #721
Conversation
|
I'm assuming it's called nxt_week and not nxt_day as it's matching tm.tm_wday (week day)? |
@ac000 |
|
The patch does two things.
With (1), from the point of readability, the original way is better. And the way is often used. But there is no issue with the original way, its code is clear, and it hasn't changed in a long time. If similar operations are required in the future, how to use them? Both of the ways are fine for me, what I'm afraid of is it introduces more rules. |
Yes. In the case of a standalone program as unitd(8) (as opposed to a shared library), I'm not sure who does the job of initializing the array of pointers; I'm not sure if it's ld(1) or who does it. If it's at runtime, we have a startup performance problem. @ac000 do you know the answer to this?
I think the readability of
Global variables are normally undesired for many reasons. For one (applicable here), they pollute the global namespace, but so do functions, and I don't think we might want But the most common reason to avoid globals is that they can change value in places that one wouldn't expect, since they can be called anywhere in the code (not
There are two issues actually:
I'd say yes. The array of arrays reduces code size, and maybe also startup time. Not a bad thing if the only bad thing is getting dev's eyes used to it, I think. :) And if an array of pointers is preferred, we should use
I'd also prefer that, yes; but only for read-only things. Global variables are bad. Global constants are good.
Yup. I hope they are not complex ones. I opened the following question to help clarify what happens in this case: https://software.codidact.com/posts/286578 |
|
Some experiment seems to confirm my theory: The numbers don't lie. I'm actually surprised that the total binary size is larger after this patch. But the good thing is that the extra size goes to the text section. The data section is smaller by 0.6 KiB (wiped 2% of the data section in a single no-op patch; I'm proud of that xD), which is a considerable improvement. I'll try to use Edit: making it Edit2: More detailed sizes:
And a diff, which is more interesting (removing the addresses, of course): $ diff -u master const_array
--- master 2022-06-12 00:12:39.011952337 +0200
+++ const_array 2022-06-12 00:13:07.047952166 +0200
@@ -4,36 +4,36 @@
.interp 28
.note.gnu.build-id 36
.note.ABI-tag 32
-.gnu.hash 2512
-.dynsym 11040
-.dynstr 7041
-.gnu.version 920
+.gnu.hash 2520
+.dynsym 11088
+.dynstr 7071
+.gnu.version 924
.gnu.version_r 336
-.rela.dyn 28128
+.rela.dyn 26472
.rela.plt 3720
.init 23
.plt 2496
.plt.got 8
-.text 219793
+.text 221921
.fini 9
-.rodata 43490
-.eh_frame_hdr 8164
-.eh_frame 46312
+.rodata 43938
+.eh_frame_hdr 8180
+.eh_frame 46464
.tbss 560
.init_array 8
.fini_array 8
-.data.rel.ro 14192
+.data.rel.ro 13552
.dynamic 560
.got 88
.got.plt 1264
.data 13520
.bss 664
.comment 39
-.debug_aranges 4656
-.debug_info 1523680
-.debug_abbrev 100266
-.debug_line 292434
-.debug_str 68973
-.debug_loc 665566
+.debug_aranges 4704
+.debug_info 1526026
+.debug_abbrev 100821
+.debug_line 294460
+.debug_str 69162
+.debug_loc 673514
.debug_ranges 22384
-Total 3082950
+Total 3096600 |
|
Fascinating stuff! Whether this helps or not, I'll join the party! Lets cut it down to the bare bones. a.c const char *wday[] = { "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat" }; .file "a.c"
.text
.globl wday
.section .rodata
.LC0:
.string "Sun"
.LC1:
.string "Mon"
.LC2:
.string "Tue"
.LC3:
.string "Wed"
.LC4:
.string "Thu"
.LC5:
.string "Fri"
.LC6:
.string "Sat"
.data
.align 32
.type wday, @object
.size wday, 56
wday:
.quad .LC0
.quad .LC1
.quad .LC2
.quad .LC3
.quad .LC4
.quad .LC5
.quad .LC6
.ident "GCC: (GNU) 12.1.1 20220507 (Red Hat 12.1.1-1)"
.section .note.GNU-stack,"",@progbitsSo we can see the seven strings in the .rodata section and the seven pointers to those strings in the .data section and the wday variable is also in .data. So yeah, .data is our seven pointers and .rodata is the seven strings. a2.c const char wday[][4] = { "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat" }; .file "a2.c"
.text
.globl wday
.section .rodata
.align 16
.type wday, @object
.size wday, 28
wday:
.string "Sun"
.string "Mon"
.string "Tue"
.string "Wed"
.string "Thu"
.string "Fri"
.string "Sat"
.ident "GCC: (GNU) 12.1.1 20220507 (Red Hat 12.1.1-1)"
.section .note.GNU-stack,"",@progbitsSo this time we just have the seven strings in .rodata along with the wday variable. And just for completeness here's how 'const char *const wday[] =' changes things .file "aa.c"
.text
.globl wday
.section .rodata
.LC0:
.string "Sun"
.LC1:
.string "Mon"
.LC2:
.string "Tue"
.LC3:
.string "Wed"
.LC4:
.string "Thu"
.LC5:
.string "Fri"
.LC6:
.string "Sat"
.align 32
.type wday, @object
.size wday, 56
wday:
.quad .LC0
.quad .LC1
.quad .LC2
.quad .LC3
.quad .LC4
.quad .LC5
.quad .LC6
.ident "GCC: (GNU) 12.1.1 20220507 (Red Hat 12.1.1-1)"
.section .note.GNU-stack,"",@progbitsSo you can see that the wday variable has moved into the .rodata section. .rodata is now 56 + 28 for pointers and strings + 4 bytes padding. Compiling the 'const char wday[][4] =' version as a shared object doesn't really change anything while compiling the 'const char *wday[] =' as a shared object seems to place the wday variable and pointers into a relocation section, so I guess that matches up with Ulrich's paper. So while I agree that char *wday{} seems nicer, I think in this case wday[][4] (and the same with month) is worth doing, Alex showed a nice saving, all the strings are the same length so there's no wasted memory there either. As for making them globals and removing duplicated code. Of course generally you want to minimize duplicated code and whilst sometimes it may be warranted, I don't think that applies in this case. And as Alex says they will be marked const and placed in read-only memory and the compiler will nosily complain if some code tries to modify them. |
|
Thanks! In my feeling, it's a good thing to make devs happy. If the patch is to introduce a new function, it's fine to use [][n] and global helper variables as they like. There are no problems at all, both their ways are fine. Since it's a refactoring but not a new function, we need to care about its cons and pros. The size is 56: sizeof(char *) * 7 The size is 28: = 4 * 7 The size would be only 8. I would say there are one more ways to improve performance. With the duplicated code. There is also another way. We should put all the related data into only one place, maybe it's ok to just in one .h file not .c file. Back to the previous point, if it's a patch to add new features, it's good. But it's a refactoring, we need to compare its cons and prods. |
Heh, OK, this one I can definitely say I'm not a fan of!.
And here we now have a magic number that you need to remember and no, I don't think making it a #define would be any better.
Well 8 + the length of the string which is 48 (actually 49, there is an extra. '.string ""' entry at the end) so 57. Compared to 48 for month[][4]
OK, so I like that better than the above!, Although it does take quite a bit more memory (than if it were just arrays of arrays). Even without the mday field we get .data is the 19 pointers for the day/months.
Perhaps this should be split into two patches, one to do the de-duplication and one to do the optimisation, or not, depending on which way the first patch is done. |
It's inspired by
Well, it can be replaced with
There are four types: For me all of them are fine, it's important to make devs happy with clear stuff. I would not ask them to use only one way. Currently the cons and pods: cons: |
|
Re: @ac000
It is =)
[...]
Just a catch: as we use compiler optimizations, and the variables are local to a function, the compiler is being smart enough to act as if we were using As expected, [...]
Interesting.
Actually, I changed another string array which does introduce some wasted 0s, but for some reason we're not using that code yet: Re: @hongzhidao
And you're forgetting the 28 bytes of the strings themselves. So this is 56 + 28 = 84 B.
This is almost literally what the compiler does when I do the array of arrays approach ( But, this code has some issues:
You're forgetting about the string literals themselves, which need to be stored in memory. This uses the exact same size as the array of arrays, + 8.
I like the idea of having a new file for all of this. Not sure about the Maybe
Yeah, I wasn't very happy with that file.
That'll be for another topic :)
Okay, but remember you have a patch that's adding another use case for
Sure, I'm going to do a more detailed review. Coming soon! Re: @ac000
Yes, I'm going to do that. It helps review the performance implications of each small change. I didn't expect this patch to so fascinating when I first wrote it! Re: @hongzhidao
That's inefficient, as we've talked.
That's the most efficient thing to do. And not so ugly, IMO.
That's even more inefficient than the first one (
Not very different from 3. Same memory savings as 2, but still has serious alignment and usability issues; and even uglier than 3.
The savings are huge. Let's continue discussing about it, but if the final findings are similar to what we've seen until now, I think the minor readability concerns are compensated by the performance improvements. I'd kindly ask devs to get used to seeing strings as char arrays, which is what they've always been underneath :) I'll publish some very interesting things soon. |
|
Demonstration of alignment issues of 3 and 4 (as numbered in the discussion above): $ cat string.c
extern const char s[];
const char *foo(int i)
{
return &s[i * 4];
}$ cat string.s
.file "pointer_string.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
leal 0(,%rdi,4), %eax
leaq s(%rip), %rdi
cltq
addq %rdi, %rax
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits$ cat array_of_strings4.c
extern const char s[][4];
const char *foo(int i)
{
return s[i];
}$ cat array_of_strings4.s
.file "array_of_strings4.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
movslq %edi, %rdi
leaq s(%rip), %rax
leaq (%rax,%rdi,4), %rax
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbitsAll of them compiled with |
|
Other important issues to consider: non-power-of-two array sizes should be rounded up to the next power of two (we waste a little bit of space in exchange for a faster run time); this aligns the inner arrays (i.e., the strings) to power-of-two boundaries: $ cat array_of_strings7.c
extern const char s[][7];
const char *foo(int i)
{
return s[i];
}$ cat array_of_strings7.s
.file "array_of_strings.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
movslq %edi, %rdi
leaq s(%rip), %rdx
leaq 0(,%rdi,8), %rax
subq %rdi, %rax
addq %rdx, %rax
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits
$ cat array_of_strings8.c
extern const char s[][8];
const char *foo(int i)
{
return s[i];
}$ cat array_of_strings8.s
.file "array_of_strings8.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
movslq %edi, %rdi
leaq s(%rip), %rax
leaq (%rax,%rdi,8), %rax
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits |
|
Makes sense. Good to know. |
|
I pushed a set of commits that break this change into very specific changes. All of them have a very detailed analysis of the performance impact they create in our code (and a theoretical rationale that explains it). I hope you enjoy reading them. I enjoyed optimizing this! :) I hope this documents for posterity that string literals should always be stored in arrays of The first commit, |
|
So I'll summarize my feelings on this. I'm perfectly happy with the global arrays of arrays for this case. Although I do like the struct idea... FWIW the patches look good to me. |
Oh, yeah, after the travel I forgot some of this. I wanted to test a 5th commit that puts the variables in a struct, per @hongzhidao suggestion, to see how it affects the resulting binary. We may see surprises there. Hold my hand :)
=) |
Constant string pointers that don't change the string to which they point can use const to help the compiler create optimal code. This patch doesn't affect at all the current resulting binary, because since our strings are stored in static variables, the compiler ha enough information to know that it can make them const, but it's still a good thing to add it, because the compiler may not always be able to optimize things, and also readers can see that the pointers never point to other things. === See the sizes of the binary before and after this patch (`size -B build/unitd > ...`): $ cat sz.B.ptr text data bss dec hex filename 436246 29648 1224 467118 720ae build/unitd $ cat sz.B.const text data bss dec hex filename 436246 29648 1224 467118 720ae build/unitd
If a string is stored as an array, the binary only contains the
string literal. But if it is stored as a pointer, the binary
needs to contain the string literal _and_ a pointer to it, which
means an extra 8 bytes for every string.
In the case of arrays of strings, the size of the inner arrays
(the strings) has to be the same for all of them, so we need, at
least, the size required by the longest string (strlen + 1).
===
See the sizes of the binary before and after this patch
(`size -B build/unitd > ...`):
$ cat sz.B.const
text data bss dec hex filename
436246 29648 1224 467118 720ae build/unitd
$ cat sz.B.arr
text data bss dec hex filename
434846 29008 1224 465078 718b6 build/unitd
Both 'text' and 'data' have decreased after this patch. Good.
text by 1.4 KB (0.3%), and data by 0.6 KB (2%).
===
See a more detailed difference of the sizes
(`size -A build/unitd | sed -E 's/^(\S+\s+\S+)\S+\s+/\1/' > ...`):
$ diff -u sz.A.{const,arr}
--- sz.A.const 2022-06-12 20:32:15.029643211 +0200
+++ sz.A.arr 2022-06-12 20:33:12.187521769 +0200
@@ -9,20 +9,20 @@
.dynstr 7151
.gnu.version 932
.gnu.version_r 352
-.rela.dyn 27984
+.rela.dyn 26328
.rela.plt 3744
.init 23
.plt 2512
.plt.got 8
.text 264353
.fini 9
-.rodata 57626
+.rodata 57882
.eh_frame_hdr 8220
.eh_frame 49488
.tbss 560
.init_array 8
.fini_array 8
-.data.rel.ro 14192
+.data.rel.ro 13552
.dynamic 560
.got 88
.got.plt 1272
@@ -30,13 +30,13 @@
.bss 664
.comment 30
.debug_aranges 4656
-.debug_info 1610066
+.debug_info 1610114
.debug_abbrev 112065
-.debug_line 250577
+.debug_line 250579
.debug_str 67547
.debug_line_str 4061
.debug_loclists 282985
.debug_rnglists 12075
-Total 2811180
+Total 2809190
Ignoring debug sections, we see that:
.rela.dyn decreased by ~ 1.6 KB (6%)
.rodata INCREASED by ~ 0.3 KB (0.6%)
.data.rel.ro decreased by ~ 0.6 KB (4%)
This also looks great. Most of the changes are for good, and
there's a section that increases, but not very much (and .rodata
will decrease again in the last patch of this patch set,
"Merged identical strings into unique extern constant variables.").
Arrays of strings are being stored as arrays of arrays of char.
Since arrays can't have padding, the compiler is forced to create
arrays of arrays of the size we specify. If that size is not a
power of 2, the compiler is able to align the outer array to the
next power of two, but it can't help with the alignment of the
inner arrays.
To improve performance when getting pointers to a string contained
in an array of strings, we can add some padding ourselves, by
rounding up the array size to a power of two.
Since the strings are not huge, the padding shouldn't cause too
much wasted space, and it's still far from the waste that an array
of pointers would cause.
Although this should have an impact in the binary size, it
doesn't, which makes me think this code is being unused.
===
See the sizes of the binary before and after this patch
(`size -B build/unitd > ...`):
$ cat sz.B.arr
text data bss dec hex filename
434846 29008 1224 465078 718b6 build/unitd
$ cat sz.B.pow
text data bss dec hex filename
434846 29008 1224 465078 718b6 build/unitd
===
Although our code doesn't show an improvement, see the following
experiment, which demonstrates the improvement:
```c
$ cat array_of_strings7.c
extern const char s[][7];
const char *foo(int i)
{
return s[i];
}
```
```asm
$ cat array_of_strings7.s
.file "array_of_strings.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
movslq %edi, %rdi
leaq s(%rip), %rdx
leaq 0(,%rdi,8), %rax
subq %rdi, %rax
addq %rdx, %rax
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits
```
```c
$ cat array_of_strings8.c
extern const char s[][8];
const char *foo(int i)
{
return s[i];
}
```
```asm
$ cat array_of_strings8.s
.file "array_of_strings8.c"
.text
.p2align 4
.globl foo
.type foo, @function
foo:
.LFB0:
.cfi_startproc
movslq %edi, %rdi
leaq s(%rip), %rax
leaq (%rax,%rdi,8), %rax
ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits
```
Although the linker is smart enough to merge most of the
duplicated code, it's better to don't duplicate source code
(unless it has significant advantages).
In this case, we can see that we get a small improvement after
this patch.
===
See the sizes of the binary before and after this patch
(`size -B build/unitd > ...`):
$ cat sz.B.pow
text data bss dec hex filename
434846 29008 1224 465078 718b6 build/unitd
$ cat sz.B.ext
text data bss dec hex filename
434590 29008 1224 464822 717b6 build/unitd
'text' has decrased by 0.35 KB (0.08%).
===
See a more detailed difference of the sizes
(`size -A build/unitd | sed -E 's/^(\S+\s+\S+)\S+\s+/\1/' > ...`):
$ diff -u sz.A.{pow,ext}
--- sz.A.pow 2022-06-12 21:06:19.925747567 +0200
+++ sz.A.ext 2022-06-12 21:06:19.925747567 +0200
@@ -16,7 +16,7 @@
.plt.got 8
.text 264353
.fini 9
-.rodata 57882
+.rodata 57626
.eh_frame_hdr 8220
.eh_frame 49488
.tbss 560
@@ -29,12 +29,12 @@
.data 13520
.bss 664
.comment 30
-.debug_aranges 4656
-.debug_info 1610114
-.debug_abbrev 112065
-.debug_line 250579
-.debug_str 67547
-.debug_line_str 4061
-.debug_loclists 282985
+.debug_aranges 4688
+.debug_info 1611151
+.debug_abbrev 112343
+.debug_line 250702
+.debug_str 67555
+.debug_line_str 4095
+.debug_loclists 282939
.debug_rnglists 12075
-Total 2809190
+Total 2810400
Ignoring debug sections, we see that:
.rodata decreased by 0.35 KB (6%).
Which is the exact amount that it increased in the previous patch
"Storing string literals as 'char []' instead of 'char *'.".
It's good that after the complete patch set transforming strings
no section (other than the debug ones) has seen a net increase.
This reduces the binary size, probably because the compiler can
derivate the address of one of the variables, from the address of
the other, with an offset from the same structure.
===
See the sizes of the binary before and after this patch
(`size -B build/unitd > ...`):
$ cat sz.B.ext
text data bss dec hex filename
434590 29008 1224 464822 717b6 build/unitd
$ cat sz.B.struct
text data bss dec hex filename
434574 29008 1224 464806 717a6 build/unitd
'text' has decreased by 16 B (0.004%).
This seems to confirm the prediction that the compiler derivates
one address from the other, since nxt_calendar.wday is used in
exactly two places (8 * 2 = 16).
===
See a more detailed difference of the sizes
(`size -A build/unitd | sed -E 's/^(\S+\s+\S+)\S+\s+/\1/' > ...`):
$ diff -u sz.A.{ext,struct}
--- sz.A.ext 2022-06-12 21:06:19.925747567 +0200
+++ sz.A.struct 2022-06-12 23:15:12.114790906 +0200
@@ -14,7 +14,7 @@
.init 23
.plt 2512
.plt.got 8
-.text 264353
+.text 264337
.fini 9
.rodata 57626
.eh_frame_hdr 8220
@@ -30,11 +30,11 @@
.bss 664
.comment 30
.debug_aranges 4688
-.debug_info 1611151
-.debug_abbrev 112343
-.debug_line 250702
-.debug_str 67555
+.debug_info 1611450
+.debug_abbrev 112329
+.debug_line 250700
+.debug_str 67570
.debug_line_str 4095
-.debug_loclists 282939
+.debug_loclists 282961
.debug_rnglists 12075
-Total 2810400
+Total 2810704
Ignoring debug sections, we see that:
.text decreased by 16 B (0.006%).
===
I didn't move mday <src/nxt_time_parse.c> into the structure too,
because it increases the binary size, which looks like it removes
some optimization opportunity (which makes sense, because we're
changing a variable from being static to being extern).
|
Aaand (suspense drums), yet another surprise: the struct helps remove two addresses from the .text section. So we shrink the binary by another 16 B; not much, but I like it, and it's for free. Thanks for the suggestion! I didn't move $ diff -u sz.A.{struct,mday}
--- sz.A.struct 2022-06-12 23:15:12.114790906 +0200
+++ sz.A.mday 2022-06-12 23:04:42.055173590 +0200
@@ -16,7 +16,7 @@
.plt.got 8
.text 264337
.fini 9
-.rodata 57626
+.rodata 57690
.eh_frame_hdr 8220
.eh_frame 49488
.tbss 560
@@ -30,11 +30,11 @@
.bss 664
.comment 30
.debug_aranges 4688
-.debug_info 1611450
+.debug_info 1611657
.debug_abbrev 112329
-.debug_line 250700
+.debug_line 250705
.debug_str 67570
.debug_line_str 4095
.debug_loclists 282961
.debug_rnglists 12075
-Total 2810704
+Total 2810980 |
|
Just a reminder. There are only two places using But now, we introduced two more files I like the design introduced by Igor, he puts the most basic dependencies at the top, and it has not been changed from the first version. But now the whole project has to depend on one more file. I'm not sure if it's a good idea. |
I'm not sure I see that as a big problem. As the project grows we are likely to add more files, either with new functionality or with existing functionality factored out. So I only see the number of source files increasing. One thing I do notice is that nxt_main.h includes a lot of header files and then often nxt_main.h is the only included header file in a .c file. Now I don't know if that's something that will continue or if the plan is to move to a more 'include what you use' style system? |
Sorry, I didn't get it, could you show more details? |
|
Sure and I think I get what you mean by
In that if you change one header file you can end up rebuilding the whole project, perhaps unnecessarily and I suspect that's more a symptom of how the #includes are currently done which could be resolved by something along the lines of include what you use. You don't even need to use the tools, just follow the principles... |
Did you mean to do like below?
Plz correct me if I don't understand correctly. |
|
Re: @hongzhidao :
I've checked that |
|
Re: @hongzhidao :
Yes, that's the basics. Each file should include exactly the header files that provide the identifiers it uses. Not more; not less. It's based on Google's C++ style guide https://google.github.io/styleguide/cppguide.html#Header_Files, and then a tool was developed around it, to check that a piece of code has the correct includes: iwyu(1) https://include-what-you-use.org/ |
There's no plan, but yes, I have the same feeling about the include system. As you can probably guess by my recent contributions to iwyu(1), I have in mind cleaning this up to at least follow the principles, and maybe even use it in our build system :) |
Did you mean you would do that by the steps?
Do you plan to rewrite the design in how to use header files in the unit project? |
Yes. Unless you have a plan to clean BTW,
I wouldn't call that a plan. But I'll have it floating in my mind. If some day I feel like attempting to do it, I'll write some proposal and show it for discussion. It has some benefits, as for example faster build and (especially) rebuild times, a better (more precise) dependency tree (which is the cause of the previous benefit), and also smaller preprocessed files ( If your question is if I'm planning to do it soon, then no. |
It is three behind the patch you are reviewing now. In short, we moved |
Ahhh, that's great then! I'll hold this patch set to after that. |
|
Just a few thoughts from me to share:
|
I'd be very happy if @igorsysoev , @VBart , and @mar0x would like to comment. And it's one of the main reasons I moved my reviews from the internal review board to the public github (and in some cases the mailing list). I welcome anyone's review. I think reducing some important sections of the binary, by the order of KB (some sections are reduced by up to 6% of their current size; check commit logs), is a strong argument by itself. I also pointed to @drepper 's paper, who has an extensive experience optimizing glibc and gcc(1), so I also consider that a strong argument. Also, it was my opinion (beforehand), that const char *const c = "ccc";
const char *b = "bbb";
const char a[] = "aaa";
My question now that I know your aesthetic opinion is: Does it really matter so much? Let's summarize the net gains of the patch set, stated in the commit logs:
I didn't count the changes in the debug sections, which increased a bit, but they're debug. If someone wants the best performance, those can be stripped from the binary easily.
That's not so easy to measure precisely, and we don't have benchmarks already written, which would be great. I can try to measure an estimate of program startup, as I'm also curious. That would be very much system-dependent, but still interesting, in lack of other timings.
Binary size, especially in initialized data sections, is directly translated into startup time.
It's not a huge problem, especially in a program expected to live for an extended period of time. But it comes at the small cost of Small systems such as raspberry pi, might also find it nice to have a slimmer binary and faster startup times. |
|
Okay, so here go some numbers for startup time performance. Test specs: I applied the following patch for testing only the startup time: diff --git a/src/nxt_main.c b/src/nxt_main.c
index 26bee873..ef9f451d 100644
--- a/src/nxt_main.c
+++ b/src/nxt_main.c
@@ -16,6 +16,8 @@ main(int argc, char **argv)
{
nxt_int_t ret;
+ exit(4);
+
if (nxt_lib_start("unit", argv, &environ) != NXT_OK) {
return 1;
}And I measured a loop of 1000 executions, 9 times before this patch set, and another 9 after this patch set: Before: alx@asus5775:~/src/nginx/unit$ git switch master
M src/nxt_main.c
Switched to branch 'master'
Your branch is up to date with 'alx/master'.
alx@asus5775:~/src/nginx/unit$ git clean -dffx >/dev/null; echo $?
0
alx@asus5775:~/src/nginx/unit$ ./configure >/dev/null; echo $?
0
alx@asus5775:~/src/nginx/unit$ make -j >/dev/null; echo $?
0
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.747s
user 0m0.544s
sys 0m0.276s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.751s
user 0m0.554s
sys 0m0.275s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.750s
user 0m0.550s
sys 0m0.277s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.751s
user 0m0.560s
sys 0m0.269s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.645s
user 0m0.470s
sys 0m0.240s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.738s
user 0m0.552s
sys 0m0.262s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.750s
user 0m0.556s
sys 0m0.270s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.655s
user 0m0.497s
sys 0m0.224s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.750s
user 0m0.574s
sys 0m0.253sAfter: alx@asus5775:~/src/nginx/unit$ git switch const_array
M src/nxt_main.c
Switched to branch 'const_array'
Your branch is up to date with 'alx/const_array'.
alx@asus5775:~/src/nginx/unit$ git clean -dffx >/dev/null; echo $?
0
alx@asus5775:~/src/nginx/unit$ ./configure >/dev/null; echo $?
0
alx@asus5775:~/src/nginx/unit$ make -j >/dev/null; echo $?
0
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.750s
user 0m0.568s
sys 0m0.260s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.537s
user 0m0.418s
sys 0m0.173s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.659s
user 0m0.498s
sys 0m0.229s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.638s
user 0m0.489s
sys 0m0.213s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.588s
user 0m0.457s
sys 0m0.192s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.726s
user 0m0.534s
sys 0m0.266s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.746s
user 0m0.557s
sys 0m0.264s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.559s
user 0m0.430s
sys 0m0.187s
alx@asus5775:~/src/nginx/unit$ time for i in {1..1000}; do ./build/unitd; done
real 0m0.748s
user 0m0.558s
sys 0m0.268sFor a simple visual difference, see the 'real' times, sorted, so that we can see the median easily: alx@asus5775:~/tmp$ grep real <times_before.txt | cut -f2 | sort
0m0.645s
0m0.655s
0m0.738s
0m0.747s
0m0.750s
0m0.750s
0m0.750s
0m0.751s
0m0.751s
alx@asus5775:~/tmp$ grep real <times_after.txt | cut -f2 | sort
0m0.537s
0m0.559s
0m0.588s
0m0.638s
0m0.659s
0m0.726s
0m0.746s
0m0.748s
0m0.750sThe median before was Worst case seems to be almost the same. Best case improved considerably. As always, these tests are not something perfect. |
I'm not sure I'd go as far as calling it a change in coding style. As
are not entirely equivalent (even less so when you remove the consts) and indeed you may end up using all three variants in a code base. The same goes for arrays of strings, sometimes arrays of pointers would be the right choice, sometimes arrays of arrays would be better. Of course the other thing that this patch set does is remove code duplication. |
|
@alejandro-colomar to evaluate any numbers it's better to use statistics methods, rather than visual difference as direct comparison of numbers doesn't take into account confidence range. Here nothing can be concluded from the numbers. It may be possible to prove something if many more measurements are made, but with only 9 measurements, nothing is proven. And this is pretty expected, I'd be surprised to see any measurable difference in start-up time with such small (in absolute numbers) reduction in size. I'm not even sure that it makes any difference till the relevant memory page is actually accessed (and if there any difference in number of memory pages read at all?). The little variations in time are likely caused by various other factors in your system like interference of different processes running in parallel. Note also that the difference in size is smaller than the difference between any two versions of Unit. So overall, since this patch doesn't cause any meaningful effect for users, I'd think it should be only evaluated from the code style, maintainability, and development costs points of view. The burden of these evaluations I'll prefer to leave on active colleagues. =) That's my 2cents. |
|
I did a test with 10000 samples. The result confirms your theory. The time differences are non-existent:
The program to get the times is much more precise (and resistant to interferences, thanks to randomness) this time: #include <err.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/random.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>
extern char **environ;
static void alx_iterate(ptrdiff_t n, char *const prog[2]);
int
main(int argc, char *argv[])
{
if (argc != 4)
errx(EXIT_FAILURE, "Usage: %s N PROG_A PROG_B", argv[0]);
alx_iterate(atoi(argv[1]), &argv[2]);
exit(EXIT_SUCCESS);
}
static void
alx_iterate(ptrdiff_t n, char *const prog[2])
{
int wstatus;
char *argv[2];
pid_t pid, w;
ptrdiff_t x;
ptrdiff_t i[2];
clock_t t[2][n][4];
/* The first one is for warming up */
for (int j = 0; j < 2; j++) {
memset(t, 0, sizeof(t));
memset(i, 0, sizeof(i));
while (i[0] < n || i[1] < n) {
if (i[0] == n) {
x = 1;
} else if (i[1] == n) {
x = 0;
} else {
getrandom(&x, 1, 0);
x &= 1;
}
argv[0] = prog[x];
argv[1] = NULL;
t[x][i[x]][0] = clock();
pid = vfork();
switch (pid) {
case 0:
execve(argv[0], argv, environ);
err(EXIT_FAILURE, "execve(2)");
case -1:
err(EXIT_FAILURE, "vfork(2)");
default:
do {
w = waitpid(pid, &wstatus, 0);
if (w == -1)
err(EXIT_FAILURE, "waitpid(2)");
} while (!WIFEXITED(wstatus));
t[x][i[x]][1] = clock();
if (WEXITSTATUS(wstatus) != 4)
errx(EXIT_FAILURE, prog[x]);
}
t[x][i[x]][2] = t[x][i[x]][1] - t[x][i[x]][0];
i[x]++;
}
}
printf("%s\t%s\n", prog[0], prog[1]);
for (ptrdiff_t j = 0; j < n; j++) {
printf("%ji\t%ji\n", (intmax_t) t[0][j][2],
(intmax_t) t[1][j][2]);
}
}And then ./cmp-exec 10000 $HOME/tmp/unitd-before $HOME/tmp/unitd-after > times.tsvThe statistics were calculated with So, the only benefit right now would be having a smaller binary. I'd expect that when we continue adding features, we might arrive to a point where it does affect startup time. But there's the possibility that we won't arrive to that point ever. Right now, the only benefit is size, and the only drawback is readability/maintainability. A second run of the whole experiment, with the order of the parameters reversed (just in case), shows more or less the same: ./cmp-exec 10000 $HOME/tmp/unitd-after $HOME/tmp/unitd-before > times-rev.tsv
|
|
Since the size gainings are not great in absolute numbers (they improve important sections of the binary, but that would be relevant to performance), meaning that the absolute total binary size doesn't change by a meaningful amount, I'll consider this change superfluous. And since some of us here tend to consider arrays uglier than pointers for storing string literals, I'll consider that to be more important in this case. So I'm closing this patch set for now. Maybe we could revisit this in the future if our string literal usage increases. |
|
But the code de-duplication part is still valid right? Though that could maybe go in a new PR... |
Wait a bit more. I did something really dumb. I compiled alx@asus5775:~/tmp$ size -B unitd-*
text data bss dec hex filename
1827 584 8 2419 973 unitd-after
1827 584 8 2419 973 unitd-beforeI recompiled just now the binaries with an opaque call to alx@asus5775:~/tmp$ size -B unitd-*
text data bss dec hex filename
372957 29000 1224 403181 626ed unitd-after
374629 29640 1224 405493 62ff5 unitd-beforediff --git a/src/nxt_main.c b/src/nxt_main.c
index 26bee873..2a00d478 100644
--- a/src/nxt_main.c
+++ b/src/nxt_main.c
@@ -11,11 +11,16 @@
extern char **environ;
+void alx_exit(int x);
+
+
int nxt_cdecl
main(int argc, char **argv)
{
nxt_int_t ret;
+ alx_exit(4);
+
if (nxt_lib_start("unit", argv, &environ) != NXT_OK) {
return 1;
}
diff --git a/src/nxt_time.c b/src/nxt_time.c
index dfead51c..a67a96ed 100644
--- a/src/nxt_time.c
+++ b/src/nxt_time.c
@@ -7,6 +7,16 @@
#include <nxt_main.h>
+void alx_exit(int x);
+
+
+void
+alx_exit(int x)
+{
+ exit(x);
+}
+
+
/* OS-specific real, monotonic, and local times and timezone update. */I still wonder if the arrays are initialized at program startup or at first use (which would be much harder to measure), but I'll run the tests and see if we see something there. |
|
Okay, I can confirm, now with the right binaries in place, that at program startup there's no performance improvement by this patch set. Can't talk about first use, though, which my experiment didn't test (and I don't even know how/if would we could test that):
So, I'll keep this PR closed. I'll keep the patch set in my git repository, in case someone wants to reopen it in the future. Regarding the de-duplication of code, I still think it's a good move. @hongzhidao would you like to apply that change? |
|
I prefer not to do so at this stage, since there are only two places having same variables, compared to introducing more changes. |
Not for me to be honest. Never checked out binary size before this conversation or thought about it as a relevant metric that should be taken into account. In the case of infinite time and resources I probably could agree with you that keeping binary size/build time/configure time less may be a good idea. But still we will need to trade developer's time for a few KBs or speed up build for tenths of a second (and I am not sure that it is worth).
I am just used to the existing code style. From my side the drawback I can see: if somebody wants to add or remove something from an array then length also has to be changed with increasing overall diff.
Sorry for incorrect terminology (I am really bad at C). The main idea I was trying to convey is that this project was developed by people with solid C expertise too, so before changing something it makes sense to understand or ask why it was done that way (to avoid several rewriting). Again, I am not addressing this statement to you personally, it's just my thoughts.
Why? Nginx works fine without this principle (even on raspberry pi afaik). |
I guess it can also help with CPU cache utilisation.
Well, if you have a header file that's included in every .c file whether it needs to be or not, editing that header file will cause the whole project to be rebuilt, when maybe only one or two files need to be. |
I'm sure there are many places with similar cases, not only happens in Unit. Just take unit as an example. Line 773 in 862f51b
Line 948 in 862f51b
They are quite similar, they do parsing and return array-type with the type of the same elements: nxt_http_name_valule_t.Currently, their parsing logic is independent of each other. If we refactor them out to a more generic API just because of duplicated code, we have to spend more time thinking about an ideal solution. I think we can do it but it will take more time. Sometimes it's worth giving up the urge to do premature refactoring, especially if it's already working well and rarely changing. But it doesn't mean that refactoring is not welcome. Every member of the team did a lot, including Alex. Take this patch as an example, they have already worked well for a long time without being changed.
It's good to do such refactoring if we want to do such refactoring about these variables. And We encourage this because the value is greater and more obvious. But as Andrei said, we have no unlimited time and resources. There are times to prioritize, especially since unit has a lot of functionality to support. It's not clear enough to merge this patch at this stage from the value view of point.
Then we will spend N times on the precise replacement of header files. unit like nginx and njs just includes everything to not care while programming, otherwise with any change we will need to find headers and include relevant... and remove not needed... and again and again. Note that the header files of router and http, which are more often changed are not in the |
|
Re: iwyu(1):
In this specific case, where we have a really small project, since a complete build is relatively fast $ time make -j >/dev/null
real 0m3.150s
user 0m9.785s
sys 0m1.092sI think I could even agree with the simplicity of having a main header included everywhere. It's really simple, and reduces the time we have to think about updating header files. Okay, that time would ideally be also 0 if iwyu(1) was part of the build system and iwyu(1) didn't have important bugs, and if iwyu(1) had full knowledge of glibc internals. But I've worked on iwyu(1), and I know its glibc support is close to 0 (I've been improving that, but mostly, it's far from usable), and it also has important bugs. As long as there are no cyclic dependencies, the main header can be easily written, just by putting every include as high as it can be, and I don't expect it to cause much trouble. 3 seconds is more than 10 milliseconds for a rebuild, but I can live with that. The idea will still float in my mind, but I don't see this changing in the near future. Maybe when iwyu(1) improves considerably, I'll rethink. |
I've been thinking about this overnight. I still believe this optimization is meaningful; only that I can't demontrate it, and in this program, which is long lived, it's hard to test specifically for that. But let me summarize my thoughts. In the worst case, which is a global In case of an So my expectations of performance would be that when we use the strings, we need to load 2 pages from memory (if they aren't hot in cache), one for the array of pointers, and one for the strings. Then there's the issue that each string will (probably) be unaligned, and operations such as memcpy() might have to do a few extra operations (not so bad, but it's a thing). With arrays, we will only need to load the strings themselves (1 page); that's less opportunity for cache misses. And we can control the alignment of the strings, which as long as it's a power of 2 we should be more or less fine (multiples of 8 ideally). However, I'm not able to prove any of this right now. So I'm keeping this closed. I suggested to some open source project, in which I actively participate in the mailing list, to apply this optimization on their program; since that program is a one-shot, we could measure things easily. We could then extrapolate those numbers to unit. So this is not (yet) a dead end, but I won't push on this until I have some proof for my theories. |
Heh, my numbers look a little different... $ time make -j >/dev/null
real 0m7.753s
user 0m23.475s
sys 0m3.381s |
That allows the linker to put them in read-only memory, and
optimize program startup time.
I also merged identical definitions into a single one, which the
linker would have done anyway, but that way we reduce source code
duplication.
See also: https://www.akkadia.org/drepper/dsohowto.pdf 2.4.3
Closes: #720