Skip to content

Conversation

@yliniemi
Copy link

@yliniemi yliniemi commented Oct 8, 2023

I made a fast and accurate look up table and interpolation based version of sin and cos functions. They take half as long and are 62 times more accurate. They can be used with the flag #define USE_SIN_32 before including FastLED. They add two new funtions called sin32() and cos32() and improve the accuracy of sin16() and cos16() functions. The look up table only consumes 640 bytes of extra ROM.

Here is a little program to benchmark the two implementations.
https://github.com/yliniemi/fastled_sin32_benchmark

@zackees
Copy link
Member

zackees commented Oct 9, 2023

__attribute__((always_inline)) static inline float sqrtInvApprox(float number)          // optimized by Jan Kadlec
{
    union { float f; uint32_t u; } y = {number};
    y.u = 0x5F1FFFF9ul - (y.u >> 1);
    return 0.703952253f * y.f * (2.38924456f - number * y.f * y.f);
}

__attribute__((always_inline)) static inline float sqrtApprox(float number)             // optimized by Jan Kadlec
{
    union { float f; uint32_t u; } y = {number};
    y.u = 0x5F1FFFF9ul - (y.u >> 1);
    return number * 0.703952253f * y.f * (2.38924456f - number * y.f * y.f);
}

You are doing float calculations in integer space.

I'm really skeptical that this is cross-platform compatible. I've had trauma from these types of calculations working on the Google Earth project where fast sin/cos were off on the QNX platform. Are some of the MCU's little/big endian?

It would be nice if FastLED had cross-platform unit tests to ensure an upper bound of error on each chipset. But right now we only have cross-platform compile tests.

@yliniemi
Copy link
Author

Thanks for your comment. Perhaps I should have made the benchmark more simple. These functions are not part of my pull request just the benchmark. They are actually the inverse square root approximation that Quake 3 made famous. Just a more optimized version that's three times more accurate at no additional computational cost. I was going to ask about it after this pull request.

I did not realise that there would be issues with this. I just checked and both on x86 and ARM the floats and integers are both saved in little endian byte order. The problem only arises when there are processors where integers are stored little endian and floats are stored big endian. I did some research and there are some CPUs that save integers and float in this inconsistent way. I tested this on x86 and

But back to this pull request. On FastLED I changed trig8.h file and added the files sin32.h and sin32.c. I increased the accuracy of sin16 and cos16. The old implementation of these functions is optimised for 8 bit microcontrollers with very little ram and rom. I on the other hand do the calculations in 32 bits and truncate the answer to int16_t in the end. This increases the accuracy and is no slower on 32 bit systems. The 32 bit versions of these functions, sin32 and cos32, are even faster because there are even less calculations and even higher accuracy because no truncating has to take place. All the calculations are done with integers because I wanted to maximize the speed and on esp32 the floating point math is not available during interrupts.

I would be honored if you cloned the branch with this pull request and tested it out.
git clone https://github.com/yliniemi/FastLED -b sin32

@kriegsman
Copy link
Member

Thanks for this contribution; you are correct that the lib8tion math functions were definitely originally written to be as fast as possible on 8-bit systems (AVR), and to run acceptably on 32-bit systems.

Do you know how much the new 32-bit-native version expands the code size? I'm thinking mostly of the lookup table size; I think the executable is basically the same. I'm definitely in favor of having 32-bit-native versions; I just would like to document what the increase in file size is when we include the lookup table.

@yliniemi
Copy link
Author

Thanks for your comment. The lookup table uses just 642 bytes of extra rom. I could cut that down to 128 bytes but that would slow the functions down substantially. The current implementation takes only 13 clock cycles on an esp32.

I could make a version of the functions that uses less ram and more cpu cycles for cases where rom space is short supply.

Copy link
Member

@kriegsman kriegsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really liking this contribution, and I'd like to accept it. I had one question about whether to use an int32_t or a uint32_t, and I'd like to hear your thoughts on that. After that possible small change, I'm ready to merge this in.

__attribute__((always_inline)) static int32_t sin32(uint32_t angle)
{
uint8_t angle256 = angle / 65536;
int32_t subAngle = angle % 65536;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should subAngle be a uint32_t, or is the sign actually needed here? (Same question applies to the copies of the code below)

@zackees
Copy link
Member

zackees commented Mar 3, 2025

This looks good, i'm going to accept this. I needed to be rebased so I applied fixes. @yliniemi if you want the contributor badge then feel free to copy the new pr as your own and resubmit.

#1896

@zackees
Copy link
Member

zackees commented Mar 5, 2025

merged

@zackees zackees closed this Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants