Memory diagnoser fix for Tiered Compilation #1543

adamsitnik · 2020-10-01T15:03:23Z

In #1542 @ronbrogan has reported a very unusual bug - a code that was CPU bound and not allocating at all was reporting allocations for .NET Core 3.1 (it works fine for 2.1 as there we use GC.GetAllocatedBytesForCurrentThread instead GC.GetTotalAllocatedBytes):

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

namespace BenchmarkAllocationDebug
{
    class Program
    {
        static void Main() => BenchmarkRunner.Run<Bench>();
    }

    [KeepBenchmarkFiles]
    public class Bench
    {
        [Benchmark] public ulong Benchmark1() => TestMethod();
        [Benchmark] public ulong Benchmark2() => TestMethod();
        [Benchmark] public ulong Benchmark3() => TestMethod();
        [Benchmark] public ulong Benchmark4() => TestMethod();
        [Benchmark] public ulong Benchmark5() => TestMethod();

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static ulong TestMethod()
        {
            var r = 1ul;
            for (var i = 0; i < 50_000_000; i++)
            {
                r /= 1;
            }
            return r;
        }
    }
}

After some investigation, I've narrowed down the problem to Tiered JIT thread that from time to time would be promoting methods from Tier 0 to Tier 1 and allocating memory during the iteration where we call GC.GetTotalAllocatedBytes.

I had few ideas, but the only one worked was putting the thread to sleep for 250ms before we call GC.GetTotalAllocatedBytes. In this time TC thread kicks-in and promotes the methods. It's of course far from perfect as TC might not finish the promotion before we make the first call to GC.GetTotalAllocatedBytes. I don't want to prolong the sleeping period because it would increase the time we need to run the benchmarks.

@kouvel do you have a better idea of how we could prevent TC from working at a given point of time?

I've confirmed that it works as expected by modifying the GetExtraStats method to emit some extra events and filtering the TC events in PerfView to this particular period of time:

private (GcStats, ThreadingStats) GetExtraStats(IterationData data)
{
    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStart(data.IterationMode, data.IterationStage, -2);

    if (RuntimeInformation.IsNetCore)
    {
        // we put the current thread to sleep so Tiered Compiler can kick in, compile it's stuff
        // and NOT allocate anything on the background thread when we are measuring allocations
        System.Threading.Thread.Sleep(TimeSpan.FromMilliseconds(250));
    }

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStop(data.IterationMode, data.IterationStage, -2);

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStart(data.IterationMode, data.IterationStage, -1);

    // we enable monitoring after main target run, for this single iteration which is executed at the end
    // so even if we enable AppDomain monitoring in separate process
    // it does not matter, because we have already obtained the results!
    EnableMonitoring();

    IterationSetupAction(); // we run iteration setup first, so even if it allocates, it is not included in the results

    var initialThreadingStats = ThreadingStats.ReadInitial(); // this method might allocate
    var initialGcStats = GcStats.ReadInitial();

    WorkloadAction(data.InvokeCount / data.UnrollFactor);

    var finalGcStats = GcStats.ReadFinal();
    var finalThreadingStats = ThreadingStats.ReadFinal();

    IterationCleanupAction(); // we run iteration cleanup after collecting GC stats

    GcStats gcStats = (finalGcStats - initialGcStats).WithTotalOperations(data.InvokeCount * OperationsPerInvoke);
    ThreadingStats threadingStats = (finalThreadingStats - initialThreadingStats).WithTotalOperations(data.InvokeCount * OperationsPerInvoke);

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStop(data.IterationMode, data.IterationStage, -1);

    return (gcStats, threadingStats);
}


          add failing test


          store the results of Runtime checks


          for .NET Core, when MemoryDiagnoser is used and we perform one extra …

…iteration, the Tiered JIT might kick-in and allocate some memory and affect the results as a workaround, we can put the thread to sleep for more than 200ms to TC thread kicks in before we start memory measurements it's far from perfect but it works fixes #1542

kouvel · 2020-10-01T18:05:55Z

Hmm you had mentioned on the other issue that 5.0 is free of this issue, is this really specific to 3.1? Timings have changed between 3.1 and 5.0 such that the rejits would happen sooner in 5.0 but it seems unlikely that that would affect microbenchmarks (the change is easily visible in real-world cases with many thousands of methods but there would only be a few methods here). For allocation measurements especially with long-running methods and with a strong assertion about allocation (that there shouldn't be any) generally my suggestion is to disable tiered compilation. Tier 0 jitted code may allocate where optimized code would not, and the former may allocate much more than the latter. An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1, which should tier-up more quickly, but it would still not guarantee that things would get tiered-up in time, if the involved methods are very long-running some methods may not reach tier 1.

Aside from that, I wonder what is actually causing allocation during jitting. There are some small allocation-dependent things I'm aware of regarding virtual methods that can be fixed, but otherwise on a rejit since all static construction, etc. should have already been done I'm not sure what would be allocating. Do you have an idea?

kouvel · 2020-10-01T18:12:12Z

Also I don't think there should be any guarantee that there would not be any allocation happening in the background. As we move more stuff to managed code like the thread pool, things may happen in the background unrelated to the app's code that may cause allocation and that should be ok.

adamsitnik · 2020-10-02T16:23:23Z

Hmm you had mentioned on the other issue that 5.0 is free of this issue, is this really specific to 3.1?

Yes, it looks like it's specific to 3.1 (I did not test 2.2 and 3.0 as they are not supported anymore)

BenchmarkDotNet=v0.12.1.20201002-develop, OS=Windows 10.0.18363.1082 (1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-WTPPRD : .NET 5.0.0 (5.0.20.45114), X64 RyuJIT
  Job-MUUDGX : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-TXYPKS : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-IEOLEB : .NET Framework 4.8 (4.8.4220.0), X64 RyuJIT


|     Method |              Runtime | Allocated |
|----------- |--------------------- |----------:|
| Benchmark1 |             .NET 5.0 |         - |
| Benchmark1 |        .NET Core 2.1 |         - |
| Benchmark1 |        .NET Core 3.1 |       9 B |
| Benchmark1 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark2 |             .NET 5.0 |         - |
| Benchmark2 |        .NET Core 2.1 |         - |
| Benchmark2 |        .NET Core 3.1 |      20 B |
| Benchmark2 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark3 |             .NET 5.0 |         - |
| Benchmark3 |        .NET Core 2.1 |         - |
| Benchmark3 |        .NET Core 3.1 |      21 B |
| Benchmark3 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark4 |             .NET 5.0 |         - |
| Benchmark4 |        .NET Core 2.1 |         - |
| Benchmark4 |        .NET Core 3.1 |     178 B |
| Benchmark4 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark5 |             .NET 5.0 |         - |
| Benchmark5 |        .NET Core 2.1 |         - |
| Benchmark5 |        .NET Core 3.1 |     101 B |
| Benchmark5 |   .NET Framework 4.8 |         - |

kouvel · 2020-10-02T16:54:27Z

Ok I think it is the virtual slot backpatching storage. It's probably the rejit timing that is causing the allocation to happen earlier and not show up in the benchmark in 5.0.

adamsitnik · 2020-10-02T17:00:58Z

An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1

I am afraid that this could lead to BDN reporting "too perfect" results that could differ from what end-users with default settings are experiencing

Aside from that, I wonder what is actually causing allocation during jitting

I was also curious and tried to use VS Memory Profiler to find out, but I've failed.

In this particular case the VS Profiler shows me memory allocated for JITTing of the methods that are executed for the first time:

But it does not show me anything attributed to the TP Thread and TieredCompilationManager.OptimizeMethods

adamsitnik · 2020-10-02T17:03:26Z

Also I don't think there should be any guarantee that there would not be any allocation happening in the background.

I totally agree. But I also expect the users to be quite suprised when they see that BDN reports allocated memory for a code that clearly does not allocate anything in explicit way.

kouvel · 2020-10-02T17:14:24Z

An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1

I am afraid that this could lead to BDN reporting "too perfect" results that could differ from what end-users with default settings are experiencing

Possibly. With aggressive tiering, it still goes through the normal tiering stages, just more quickly. It would change rejit timings and more methods may be rejitted, though likely the code quality would be similar to default mode (no guarantee though since it can depend on timing and code paths hit in the calls before rejit). The other option is to give the test more warmup time to stabilize.

adamsitnik added 3 commits Oct 1, 2020

add failing test

82daa11

store the results of Runtime checks

7db9adc

adamsitnik added the Area:Diagnosers label Oct 1, 2020

adamsitnik added this to the v0.12.2 milestone Oct 1, 2020

adamsitnik requested a review from AndreyAkinshin Oct 1, 2020

dotnet / BenchmarkDotNet

Memory diagnoser fix for Tiered Compilation #1543

Memory diagnoser fix for Tiered Compilation #1543

adamsitnik commented Oct 1, 2020

kouvel commented Oct 1, 2020 •

edited

kouvel commented Oct 1, 2020

adamsitnik commented Oct 2, 2020

kouvel commented Oct 2, 2020 •

edited

adamsitnik commented Oct 2, 2020

adamsitnik commented Oct 2, 2020

kouvel commented Oct 2, 2020

Sep	OCT	Nov
	11
2019	2020	2021

dotnet / BenchmarkDotNet

Join GitHub today

Memory diagnoser fix for Tiered Compilation #1543

Memory diagnoser fix for Tiered Compilation #1543

Conversation

adamsitnik commented Oct 1, 2020

kouvel commented Oct 1, 2020 • edited

kouvel commented Oct 1, 2020

adamsitnik commented Oct 2, 2020

kouvel commented Oct 2, 2020 • edited

adamsitnik commented Oct 2, 2020

adamsitnik commented Oct 2, 2020

kouvel commented Oct 2, 2020

Essential cookies

Always active

Analytics cookies

kouvel commented Oct 1, 2020 •

edited

kouvel commented Oct 2, 2020 •

edited