The Wayback Machine - https://web.archive.org/web/20220404155203/https://github.com/arrayfire/arrayfire/issues/3042
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]Worse performance on matrix dot product than numpy using cpu backend #3042

Open
HO-COOH opened this issue Nov 5, 2020 · 8 comments
Open
Assignees
Labels

Comments

@HO-COOH
Copy link
Contributor

@HO-COOH HO-COOH commented Nov 5, 2020

Using almost the same code as blas.cpp, but slightly modified to test the dot product on 2 randomly initialized matrix like this:

#include <arrayfire.h>
#include <math.h>
#include <stdio.h>
#include <cstdlib>

using namespace af;

// create a small wrapper to benchmark
static array A;  // populated before each timing
static array B;
static void fn() {
    //array B= matmul(A, A);  // matrix multiply
    array C = matmul(A, B);
}

int main(int argc, char** argv) {
    double peak = 0;
    try {
        int device = argc > 1 ? atoi(argv[1]) : 0;
        setDevice(device);
        info();
        putchar('\n');

        for (int n = 1024; n <= 4096; n += 512) {
            printf("%4d x %4d: ", n, n);
            A = randu({ n, n }, f32);
            B = randu({ n, n }, f32);
            double time   = timeit(fn);  // time in seconds
            double gflops = 2.0 * powf(n, 3) / (time * 1e9);
            if (gflops > peak) peak = gflops;

            printf(" %4.0f Gflops\t time = %f\n", gflops, time);
            fflush(stdout);
        }
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    printf(" ### peak %g GFLOPS\n", peak);

    return 0;
}

I am getting worse performance than the numpy equivalent code

import time
import numpy

for n in range(1024, 4097, 512):
    a=numpy.random.randn(n, n).astype('f')
    b=numpy.random.randn(n, n).astype('f')

    before = time.perf_counter_ns()
    c=numpy.matmul(a, b)
    dtime = time.perf_counter_ns()-before
    gflops = 2.0*(n**3)/dtime
    print(f'{n} x {n}: {gflops} Gflops\t time = {dtime/(10**9)}')

From arrayfire, I got

ArrayFire v3.7.2 (CPU, 64-bit Windows, build 218dd2c)
[0] AMD: AMD Ryzen 7 1700 Eight-Core Processor
1024 x 1024:   112 Gflops        time = 0.019148
1536 x 1536:   118 Gflops        time = 0.061292
2048 x 2048:   142 Gflops        time = 0.121193
2560 x 2560:   149 Gflops        time = 0.224779
3072 x 3072:   153 Gflops        time = 0.379377
3584 x 3584:   159 Gflops        time = 0.577787
4096 x 4096:   158 Gflops        time = 0.868543
 ### peak 159.355 GFLOPS

From numpy I got

1024 x 1024: 263.95806728370025 Gflops	 time = 0.0081357
1536 x 1536: 260.4895596543941 Gflops	 time = 0.0278236
2048 x 2048: 253.4296488984282 Gflops	 time = 0.0677895
2560 x 2560: 283.2759427073044 Gflops	 time = 0.1184514
3072 x 3072: 306.46378034335623 Gflops	 time = 0.1891971
3584 x 3584: 309.9085367580112 Gflops	 time = 0.2970985
4096 x 4096: 287.6440082610381 Gflops	 time = 0.4778092
@HO-COOH
Copy link
Contributor Author

@HO-COOH HO-COOH commented Nov 5, 2020

@umar456 Thanks for the link. I was finding infos about arrayfire's CPU backend, but could find it.
I did not make any changes in environmental variables to use numpy, so I suppose it uses openblas maybe? The main issue is arrayfire out-of-box is slower.

@umar456
Copy link
Member

@umar456 umar456 commented Nov 6, 2020

ArrayFire uses MKL as the default BLAS library. I am not sure what library is being used by numpy for this task. Because you are using an AMD CPU for this test, you could be running into issues that could be alleviated by the steps listed in that blog post.

ArrayFire can also be built using other BLAS libraries but this is not offered through the official installers.

@HO-COOH
Copy link
Contributor Author

@HO-COOH HO-COOH commented Nov 6, 2020

@umar456 Nice, after adding MKL_DEBUG_CPU_TYPE=5 as the link described, the performance is comparable to numpy now on my machine. In that way would you consider maybe use OpenBLAS for AMD cpu? Since the performance impact of using default MKL on AMD cpu seems quite dramatic in the passage (although it's not that dramatic in my test) or maybe at least mention it in the docs.
For anyone concerning, this is what I got after setting the env variable

ArrayFire v3.7.2 (CPU, 64-bit Windows, build 218dd2c)
[0] AMD: AMD Ryzen 7 1700 Eight-Core Processor
1024 x 1024:   141 Gflops        time = 0.015234
1536 x 1536:   178 Gflops        time = 0.040671
2048 x 2048:   254 Gflops        time = 0.067612
2560 x 2560:   253 Gflops        time = 0.132779
3072 x 3072:   265 Gflops        time = 0.219109
3584 x 3584:   311 Gflops        time = 0.296531
4096 x 4096:   307 Gflops        time = 0.447387
 ### peak 310.501 GFLOPS

And the numpy.show_config() output:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
None

@WilliamTambellini
Copy link
Contributor

@WilliamTambellini WilliamTambellini commented Nov 6, 2020

@HO-COOH perhaps there are some openCL drivers for AMD CPU : if so, could be worth to try afopencl ?

@HO-COOH
Copy link
Contributor Author

@HO-COOH HO-COOH commented Nov 6, 2020

@WilliamTambellini AMD has dropped OpenCL for their CPU. You can see the discussion here

@umar456
Copy link
Member

@umar456 umar456 commented Nov 6, 2020

@HO-COOH You can use the intel runtime on AMD CPUs.

@HO-COOH
Copy link
Contributor Author

@HO-COOH HO-COOH commented Nov 6, 2020

@umar456 Tried, but no luck. My CPU is not appeared under clinfo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants