[Question]Worse performance on matrix dot product than numpy using cpu backend #3042

HO-COOH · 2020-11-05T22:08:09Z

Using almost the same code as blas.cpp, but slightly modified to test the dot product on 2 randomly initialized matrix like this:

#include <arrayfire.h>
#include <math.h>
#include <stdio.h>
#include <cstdlib>

using namespace af;

// create a small wrapper to benchmark
static array A;  // populated before each timing
static array B;
static void fn() {
    //array B= matmul(A, A);  // matrix multiply
    array C = matmul(A, B);
}

int main(int argc, char** argv) {
    double peak = 0;
    try {
        int device = argc > 1 ? atoi(argv[1]) : 0;
        setDevice(device);
        info();
        putchar('\n');

        for (int n = 1024; n <= 4096; n += 512) {
            printf("%4d x %4d: ", n, n);
            A = randu({ n, n }, f32);
            B = randu({ n, n }, f32);
            double time   = timeit(fn);  // time in seconds
            double gflops = 2.0 * powf(n, 3) / (time * 1e9);
            if (gflops > peak) peak = gflops;

            printf(" %4.0f Gflops\t time = %f\n", gflops, time);
            fflush(stdout);
        }
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    printf(" ### peak %g GFLOPS\n", peak);

    return 0;
}

I am getting worse performance than the numpy equivalent code

import time
import numpy

for n in range(1024, 4097, 512):
    a=numpy.random.randn(n, n).astype('f')
    b=numpy.random.randn(n, n).astype('f')

    before = time.perf_counter_ns()
    c=numpy.matmul(a, b)
    dtime = time.perf_counter_ns()-before
    gflops = 2.0*(n**3)/dtime
    print(f'{n} x {n}: {gflops} Gflops\t time = {dtime/(10**9)}')

From arrayfire, I got

ArrayFire v3.7.2 (CPU, 64-bit Windows, build 218dd2c)
[0] AMD: AMD Ryzen 7 1700 Eight-Core Processor
1024 x 1024:   112 Gflops        time = 0.019148
1536 x 1536:   118 Gflops        time = 0.061292
2048 x 2048:   142 Gflops        time = 0.121193
2560 x 2560:   149 Gflops        time = 0.224779
3072 x 3072:   153 Gflops        time = 0.379377
3584 x 3584:   159 Gflops        time = 0.577787
4096 x 4096:   158 Gflops        time = 0.868543
 ### peak 159.355 GFLOPS

From numpy I got

1024 x 1024: 263.95806728370025 Gflops	 time = 0.0081357
1536 x 1536: 260.4895596543941 Gflops	 time = 0.0278236
2048 x 2048: 253.4296488984282 Gflops	 time = 0.0677895
2560 x 2560: 283.2759427073044 Gflops	 time = 0.1184514
3072 x 3072: 306.46378034335623 Gflops	 time = 0.1891971
3584 x 3584: 309.9085367580112 Gflops	 time = 0.2970985
4096 x 4096: 287.6440082610381 Gflops	 time = 0.4778092

The text was updated successfully, but these errors were encountered:

umar456 · 2020-11-05T23:19:21Z

Can you try the techniques listed here:

https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/

HO-COOH · 2020-11-05T23:55:56Z

@umar456 Thanks for the link. I was finding infos about arrayfire's CPU backend, but could find it.
I did not make any changes in environmental variables to use numpy, so I suppose it uses openblas maybe? The main issue is arrayfire out-of-box is slower.

umar456 · 2020-11-06T00:12:05Z

ArrayFire uses MKL as the default BLAS library. I am not sure what library is being used by numpy for this task. Because you are using an AMD CPU for this test, you could be running into issues that could be alleviated by the steps listed in that blog post.

ArrayFire can also be built using other BLAS libraries but this is not offered through the official installers.

HO-COOH · 2020-11-06T02:46:35Z

@umar456 Nice, after adding MKL_DEBUG_CPU_TYPE=5 as the link described, the performance is comparable to numpy now on my machine. In that way would you consider maybe use OpenBLAS for AMD cpu? Since the performance impact of using default MKL on AMD cpu seems quite dramatic in the passage (although it's not that dramatic in my test) or maybe at least mention it in the docs.
For anyone concerning, this is what I got after setting the env variable

ArrayFire v3.7.2 (CPU, 64-bit Windows, build 218dd2c)
[0] AMD: AMD Ryzen 7 1700 Eight-Core Processor
1024 x 1024:   141 Gflops        time = 0.015234
1536 x 1536:   178 Gflops        time = 0.040671
2048 x 2048:   254 Gflops        time = 0.067612
2560 x 2560:   253 Gflops        time = 0.132779
3072 x 3072:   265 Gflops        time = 0.219109
3584 x 3584:   311 Gflops        time = 0.296531
4096 x 4096:   307 Gflops        time = 0.447387
 ### peak 310.501 GFLOPS

And the numpy.show_config() output:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = ['D:\\a\\1\\s\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
None

WilliamTambellini · 2020-11-06T05:41:45Z

@HO-COOH perhaps there are some openCL drivers for AMD CPU : if so, could be worth to try afopencl ?

HO-COOH · 2020-11-06T16:20:06Z

@WilliamTambellini AMD has dropped OpenCL for their CPU. You can see the discussion here

umar456 · 2020-11-06T16:50:39Z

@HO-COOH You can use the intel runtime on AMD CPUs.

HO-COOH · 2020-11-06T17:43:47Z

@umar456 Tried, but no luck. My CPU is not appeared under clinfo

9prady9 assigned umar456 Dec 8, 2020

9prady9 mentioned this issue Dec 8, 2020

[Perf]blas_cpu-example is slower on E5-2695 V4 than on i7-6700 #3056

Open

9prady9 added the question label Dec 29, 2020

Mar	APR	May
	04
2021	2022	2023

arrayfire / arrayfire Public

[Question]Worse performance on matrix dot product than numpy using cpu backend #3042

[Question]Worse performance on matrix dot product than numpy using cpu backend #3042

HO-COOH commented Nov 5, 2020

umar456 commented Nov 5, 2020

HO-COOH commented Nov 5, 2020

umar456 commented Nov 6, 2020

HO-COOH commented Nov 6, 2020 •

edited

WilliamTambellini commented Nov 6, 2020

HO-COOH commented Nov 6, 2020

umar456 commented Nov 6, 2020

HO-COOH commented Nov 6, 2020

arrayfire / arrayfire Public

[Question]Worse performance on matrix dot product than numpy using cpu backend #3042

[Question]Worse performance on matrix dot product than numpy using cpu backend #3042

Comments

HO-COOH commented Nov 5, 2020

umar456 commented Nov 5, 2020

HO-COOH commented Nov 5, 2020

umar456 commented Nov 6, 2020

HO-COOH commented Nov 6, 2020 • edited

WilliamTambellini commented Nov 6, 2020

HO-COOH commented Nov 6, 2020

umar456 commented Nov 6, 2020

HO-COOH commented Nov 6, 2020

HO-COOH commented Nov 6, 2020 •

edited