Skip to content

Can we have global work size a multiple of 16? #101

Open
@Anaphory

Description

@Anaphory

Out of a toy interest, I am trying to run OpenCL and the tree likelihood computation library BEAGLE to run on a PI. BEAGLE assumes that work sizes are divisible by 16, because that's handy for nucleotide substitution matrices, and it fails to run on the 12×12×12 work size limit of VC4CL on the Pi.

Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?

I have been looking for the source of the magic number here in the repository and found this comment

VC4CL/src/vc4cl_config.h

Lines 140 to 143 in 842d444

* "The work-items in a given work-group execute concurrently on the processing elements of a single compute
* unit." (page 24) Since there is no limitation, that work-groups need to be executed in parallel, we set 1
* compute unit with all 12 QPUs, allowing us to run 12 work-items in a single work-group in parallel and run
* the work-groups sequentially.

If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like
V3D::instance()->getSystemInfo(SystemInfo::QPU_COUNT), param_value_size, param_value, param_value_size_ret);

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions