Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement virtual arrays #3364

Draft
wants to merge 83 commits into
base: main
Choose a base branch
from

Conversation

ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Jan 9, 2025

This is a PR that aims to implement virtual buffers that are materialized when their actual data is needed. Getting those data would typically be an expensive disk read.
The description shall be updated in the future and become a lot more analytical.

@ikrommyd ikrommyd marked this pull request as draft January 9, 2025 21:48
@ikrommyd ikrommyd force-pushed the virtual-arrays branch 3 times, most recently from e76182b to 3b9ad2f Compare January 21, 2025 13:41
@pfackeldey
Copy link
Collaborator

@pfackeldey This should be ready for the first review round. In the meantime, I'll be writing unit tests the following week.

cc @ianna @agoose77 @jpivarski

Let's first figure out unknown_length handling with virtual arrays. Then, we should have all required functionality in place. I'll start a review afterwards! It looks already in very good shape from first sight 🎉

@ikrommyd
Copy link
Collaborator Author

@pfackeldey This should be ready for the first review round. In the meantime, I'll be writing unit tests the following week.
cc @ianna @agoose77 @jpivarski

Let's first figure out unknown_length handling with virtual arrays. Then, we should have all required functionality in place. I'll start a review afterwards! It looks already in very good shape from first sight 🎉

Yup agreed. I wrote this comment before I realised we have this problem.

Copy link
Collaborator

@pfackeldey pfackeldey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are great! Thank you very much @ikrommyd 🎉
I have just minor comments that mainly reduce a few lines of code and question in rare cases if a delay of materialization is needed?

@ikrommyd
Copy link
Collaborator Author

Needs unit tests and some analysis-like testing. AGC and coffea benchmarks look good at the moment.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Feb 21, 2025

Does anyone know a good place to add a ak.materialize(stuff) before a numba compiled ArrayBuilder operation is run. In arrayview.py perhaps?
For example:

@numba.njit
def find_4lep(events_leptons, builder):
    """Search for valid 4-lepton combinations from an array of events * leptons {charge, ...}

    A valid candidate has two pairs of leptons that each have balanced charge
    Outputs an array of events * candidates {indices 0..3} corresponding to all valid
    permutations of all valid combinations of unique leptons in each event
    (omitting permutations of the pairs)
    """
    for leptons in events_leptons:
        builder.begin_list()
        nlep = len(leptons)
        for i0 in range(nlep):
            for i1 in range(i0 + 1, nlep):
                if leptons[i0].charge + leptons[i1].charge != 0:
                    continue
                for i2 in range(nlep):
                    for i3 in range(i2 + 1, nlep):
                        if len({i0, i1, i2, i3}) < 4:
                            continue
                        if leptons[i2].charge + leptons[i3].charge != 0:
                            continue
                        builder.begin_tuple(4)
                        builder.index(0).integer(i0)
                        builder.index(1).integer(i1)
                        builder.index(2).integer(i2)
                        builder.index(3).integer(i3)
                        builder.end_tuple()
        builder.end_list()

    return builder

This actually materializes and we don't have to materialize ourselves but I don't think that's always gonna happen. In this case it's going to materialize all the buffers of whatever events_leptons is. For external numba compiled functions we can't do anything but for ArrayBuilder stuff I think we could add an intentional ak.materialize call before the kernel runs.

Oh in _lookup.py maybe? A lookup is going to run before calling the arraybuilder right? So we can just materialize the whole layout that goes into the lookup.

@ikrommyd ikrommyd requested a review from pfackeldey February 21, 2025 04:11
@ikrommyd
Copy link
Collaborator Author

Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize.

@pfackeldey
Copy link
Collaborator

Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize.

I like that! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants