Skip to content

Enhance detect singletons#231

Merged
s3alfisc merged 2 commits intopy-econometrics:masterfrom
styfenschaer:enhance-detect_singletons
Dec 31, 2023
Merged

Enhance detect singletons#231
s3alfisc merged 2 commits intopy-econometrics:masterfrom
styfenschaer:enhance-detect_singletons

Conversation

@styfenschaer
Copy link
Copy Markdown
Contributor

This pull request addresses a bug in the _detect_singletons function and enhances its efficiency (see issue #229). The performance improvement is primarily due to the elimination of array copies, achieved through the use of explicit loops implemented with Numba. The function performs optimally when provided with a column-major order array, which avoids the need for additional copies of the input array. Additionally, this pull request introduces a small test suite to further validate the changes.

To facilitate a comparison between the old and new implementations, the following code can be executed. It's important to note that for this comparison, the bug in the old implementation has been corrected.

See comparison
from time import perf_counter

import numba as nb
import numpy as np
from numba.extending import overload


def _prepare_fixed_effects(ary):
    pass


@overload(_prepare_fixed_effects)
def _ol_preproc_fixed_effects(ary):
    # If array is already an F-array we tolerate
    # any dtype because it saves us a copy
    if ary.layout == "F":
        return lambda ary: ary

    if not isinstance(ary.dtype, nb.types.Integer):
        raise nb.TypingError("Fixed effects must be integers")

    max_nbits = 32
    nbits = min(max_nbits, ary.dtype.bitwidth)
    dtype = nb.types.Integer.from_bitwidth(nbits, signed=False)

    def impl(ary):
        n, m = ary.shape
        out = np.empty((m, n), dtype=dtype).T
        out[:] = ary[:]
        return out

    return impl


@nb.njit
def new_detect_singletons(ids):
    """
    Detect singleton fixed effects in a dataset.

    Args:
        ids (np.ndarray): A 2D numpy array representing fixed effects, with a
                          shape of (n_samples, n_features).
                          Elements should be non-negative integers representing
                          fixed effect identifiers.

    Returns:
        np.ndarray: A boolean array of shape (n_samples,), indicating which
                    observations have a singleton fixed effect.

    Note:
        The algorithm iterates over columns to identify fixed effects.
        After completing each column traversal, it updates the record of
        non-singleton rows. This approach is preferred as removing an
        observation in column 'i' may lead to the emergence of new singletons
        in subsequent columns (i.e., columns > i).

        Since we are operating on columns, we enforce column-major order for
        the input array. Working on a row-major array leads to considerable
        performance losses.
    """
    ids = _prepare_fixed_effects(ids)
    n_samples, n_features = ids.shape

    max_fixef = np.max(ids)
    counts = np.empty(max_fixef + 1, dtype=np.uint32)

    n_non_singletons = n_samples
    non_singletons = np.arange(n_non_singletons, dtype=np.uint32)

    while True:
        n_non_singletons_curr = n_non_singletons

        for j in range(n_features):
            col = ids[:, j]

            counts[:] = 0
            n_singletons = 0
            for i in range(n_non_singletons):
                e = col[non_singletons[i]]
                c = counts[e]
                # Branchless version of:
                #
                # if counts[e] == 1:
                #     n_singletons -= 1
                # elif counts[e] == 0:
                #     n_singletons += 1
                #
                n_singletons += (c == 0) - (c == 1)
                counts[e] += 1

            if not n_singletons:
                continue

            cnt = 0
            for i in range(n_non_singletons):
                e = col[non_singletons[i]]
                if counts[e] != 1:
                    non_singletons[cnt] = non_singletons[i]
                    cnt += 1

            n_non_singletons = cnt

        if n_non_singletons_curr == n_non_singletons:
            break

    is_singleton = np.ones(n_samples, dtype=np.bool_)
    for i in range(n_non_singletons):
        is_singleton[non_singletons[i]] = False

    return is_singleton


def old_detect_singletons(ids):
    """
    Detect singleton fixed effects
    Args:have a
        ids (np.ndarray): A numpy array of fixed effects.
    Returns:
        An array of booleans indicating which observations have a singleton fixed effect.
    """

    N, k = ids.shape

    singleton_idx = np.full(N, False, dtype=bool)
    singleton_idx_tmp = np.full(N, False, dtype=bool)
    singleton_idx_tmp_old = singleton_idx.copy()

    ids_tmp = ids.copy()

    while True:
        for x in range(k):
            col = ids[:, x]
            col_tmp = ids_tmp[:, x]

            # note that this only "works" as fixed effects are integers from 0, 1, ..., n_fixef -1
            # and np.bincount orders results in ascending (integer) order
            counts = np.bincount(col_tmp)

            if np.any(counts == 1):
                idx = np.where(counts == 1)

                singleton_idx[np.isin(col, idx)] = True
                singleton_idx_tmp[np.isin(col_tmp, idx)] = True

                ids_tmp = ids_tmp[~singleton_idx_tmp, :].astype(ids_tmp.dtype)
                singleton_idx_tmp = singleton_idx_tmp[~singleton_idx_tmp]

            # break

        if np.array_equal(singleton_idx_tmp, singleton_idx_tmp_old):
            break

        singleton_idx_tmp_old = singleton_idx_tmp.copy()

    return singleton_idx


@nb.njit
def make_deterministic(shape, n_singletons, dtype=np.int64):
    out = np.zeros(shape, dtype=dtype)
    assert n_singletons <= shape[0]
    for i in range(n_singletons):
        j = i % shape[1]
        out[i, j] = i + 1
    return out


class SimpleTimer:
    def __init__(self):
        self.tic = None
        self.toc = None

    def __enter__(self):
        self.tic = perf_counter()
        return self

    def __exit__(self, *args):
        self.toc = perf_counter()

    def __str__(self):
        return f"Elapsed [s]: {self.elapsed:.2f}"

    def __truediv__(self, other):
        return self.elapsed / other.elapsed

    @property
    def elapsed(self):
        return self.toc - self.tic


if __name__ == "__main__":
    # warmup
    new_detect_singletons(np.zeros((1, 1), dtype=np.int64))

    N, M = 100_000, 100

    for i, a in enumerate(
        (
            make_deterministic((N, M), n_singletons=0),
            make_deterministic((N, M), n_singletons=N // 100_000),
            make_deterministic((N, M), n_singletons=N // 10_000),
            make_deterministic((N, M), n_singletons=N // 1_000),
            make_deterministic((N, M), n_singletons=N // 100),
            make_deterministic((N, M), n_singletons=N // 10),
            make_deterministic((N, M), n_singletons=N),
        )
    ):
        print(f"\nRunning case {i+1}")

        with SimpleTimer() as t1:
            r1 = old_detect_singletons(a)

        with SimpleTimer() as t2:
            r2 = new_detect_singletons(a)

        assert np.array_equal(r1, r2)

        print(t1)
        print(t2)
        print(f"Speedup: {t1 / t2:.1f} (#singletons {r1.sum()}/{r1.shape[0]})")

@s3alfisc
Copy link
Copy Markdown
Member

Hi @styfenschaer, I finally managed to take a first look - sorry for the delay! Two questions:

  • are you not using the _ol_preproc_fixed_effects() function? If yes, can it be deleted?
  • what is the purpose of the _prepare_fixed_effects() function? It does not seem to do anything?

I will make some updates to where _detect_singletons() should be applied - I will move it from demean_model() to model_matrix_fixest(), which is a more natural place for it (I believe). On the fly I will also fix some other bugs in the code that relate to singleton fixed effects and will push them to this branch.

Also, once again, thanks so much for this!

Best, Alex

@s3alfisc
Copy link
Copy Markdown
Member

Btw these speedups are once again pretty amazing:

Running case 1
Elapsed [s]: 0.20
Elapsed [s]: 0.05
Speedup: 4.0 (#singletons 0/100000)

Running case 2
Elapsed [s]: 0.48
Elapsed [s]: 0.07
Speedup: 7.0 (#singletons 1/100000)

Running case 3
Elapsed [s]: 1.02
Elapsed [s]: 0.07
Speedup: 15.6 (#singletons 10/100000)

Running case 4
Elapsed [s]: 6.43
Elapsed [s]: 0.08
Speedup: 83.6 (#singletons 100/100000)

Running case 5
Elapsed [s]: 6.43
Elapsed [s]: 0.07
Speedup: 88.4 (#singletons 1000/100000)

Running case 6
Elapsed [s]: 5.91
Elapsed [s]: 0.08
Speedup: 77.1 (#singletons 10000/100000)

Running case 7
Elapsed [s]: 3.10
Elapsed [s]: 0.05
Speedup: 68.9 (#singletons 100000/100000)

@s3alfisc s3alfisc merged commit 08aedbe into py-econometrics:master Dec 31, 2023
@styfenschaer
Copy link
Copy Markdown
Contributor Author

Hello @s3alfisc

Don't worry, there is no deadline, so there is no delay.

These functions are both needed. _prepare_fixed_effects is just a dummy python function. When Numba encounters this function, it will call _ol_preproc_fixed_effects with the types of the arguments and _ol_preproc_fixed_effects must return the appropriate implementation. You can read about this here. I did it this way because Numba does not support the dtype argument to the numpy.asfortranarray function. And even then, I wanted a bit more control over the data type of the resulting array.

Sure, I was also wondering if this function should become public, which would mean that the leading underscore should be removed from the name.

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Dec 31, 2023

Thanks for the explanation Styfen. As you can see, I've already merged the PR, added some simple tests against fixest and put 'detect_singletons' into its own module & also made it public. I hope you'll have a great start into 2024 tonight!

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @styfenschaer for code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

File README.md was not found in the repository (s3alfisc/pyfixest).

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @styfenschaer for code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

I've put up a pull request to add @styfenschaer! 🎉

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @styfenschaer for code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

@styfenschaer already contributed before to code

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @NKeleher for code and infra

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

I've put up a pull request to add @NKeleher! 🎉

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @Wenzhi-Ding for code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

I've put up a pull request to add @Wenzhi-Ding! 🎉

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @apoorvalal for code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

I've put up a pull request to add @apoorvalal! 🎉

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @styfenschaer for code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

@styfenschaer already contributed before to code

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @juanitorduz for infra, code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

This project's configuration file has malformed JSON: .all-contributorsrc. Error:: Unexpected string in JSON at position 400

@s3alfisc
Copy link
Copy Markdown
Member

@all-contributors please add @juanitorduz for infra, code

@allcontributors
Copy link
Copy Markdown
Contributor

@s3alfisc

I've put up a pull request to add @juanitorduz! 🎉

damandhaliwal pushed a commit to damandhaliwal/pyfixest that referenced this pull request Jun 17, 2025
* fix bug and improve performance of _detect_singletons (issue 229)

* add test cases for _detect_singletons (issue 229)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancing _detect_singletons Bug when selecting fixef_rm = "singleton"

2 participants