My Journey to the Land of Extreme Concurrent Performance in .NET

Szymon Kulec

@Scooletz

http://scooletz.com

Agenda

RampUp

Goals

CPU

CPU - true || false

  • CPU executes operation in the same order they were written FALSE
  • Accessing a memory in a predictable manner can improve a performance TRUE
  • CPU tries to predict the outcome of the conditional expressions TRUE
  • Accessing the same CPU cache line on the modern CPU from multiple threads affects performance TRUE

CPU - some architecture

  • registries
  • caches (L1, L2, L3)
  • prefetcher
  • predictor
  • pipeline
  • write buffer

Checkpoint - CPU friendly code

  • Boringly (yawn here) predictable
  • Linear access patterns
  • No locks, let the CPU buffer & reorder some writes

Memory models

Memory models

  • Consider two virtual operations: LOAD & STORE
  • Four possible combinations:
    • LOAD-LOAD
    • LOAD-STORE
    • STORE-STORE
    • STORE-LOAD

Memory models - barriers

  • Inserted between two LOAD/STORE operations.
  • A memory barrier is a simple mechanism that disables some reorderings
  • A full memory barrier prohibits ANY reorderings
  • A full memory barrier is emitted by:
    • lock (obj)
    • Interlocked.Add/Exchange
    • when using high level constructs like WaitHandle

Memory models - barriers

  • Aquire fence
    • Volatile.Read
    • Prohibits LOAD-LOAD
    • Prohibits LOAD-STORE
  • Release fence
    • Volatile.Write
    • Prohibits STORE-LOAD
    • Prohibits STORE-STORE
Writer uses Volatile.Write to ensure that status will be visible ONLY when Value was stored before
// writer o.Value = 1; Volatile.Write(ref o.Status, 1); // this is equal to STORE Value 1 RELEASE FENCE STORE Status 1
Reader uses Volatile.Read to ensure that when it reads status, if it's changed, the value was stored before
// reader if (Volatile.Read(ref o.Status) == 1) { var value = o.Value; } // this is equal to LOAD Status AQUIRE FENCE LOAD Value

Memory models - allowed reorderings

Name LOAD-LOAD L-S S-S S-L
Full barrier (lock)
Aquire fence
Release fence
No fence

Checkpoint - memory models

  • Even in a high level language like C# we can ensure nice performance properties
  • To achieve an extreme performance you need to leave some space for CPU optimizations
  • Aquire & Release lease is a pair memory barriers that should be used together as in the reader/writer example
  • The reader/writer example can be treated as a producer/consumer pair a well

Allocations

Allocations

  • Allocations are not fun at all
  • The objects are allocated on the heap and require GC
  • The memory cannot be read in a linear way
  • But hey, structs are stack allocated so it's cheap to create them (no friction, no garbage collection)
  • Another approach would be to run in situ...

Latency & throughput

Latency & throughput

  • response time = latency + service time
  • They say "throughput = 1/latency"

Latency & throughput

  • Consider disk IO and batching writes. It reduces response time as well amortizing the costly system calls adding just a bit of overhead for the batch
  • Consider accessing a concurrent queue. Is it cheaper to access it once for 100 messages lowering the friction, or 100 times in a row?
  • Using batches is the way to go

Actors & SEDA

Actors

Actors

  • Inbox - a queue of incoming messages
  • Ability to send/publish a message to other inboxes
  • Multiple actors - multiple possible producers for a queue
  • Can be described as a single threaded consumer of a queue

SEDA

  • Staged event-driven architecture
  • Queues + multiple workers
  • Can be thought of as multiple actors for the same inbox
  • Consumers of the queue are using resources concurrently
  • No longer wanted in the previous shape (Cassandra issue)
  • EventStore is described as SEDA but it has single threads for the important queues

Checkpoint - Actors & SEDA

  • Workers consuming messages sent to a queue
  • It's a multi producer/single consumer (MPSC), with other actors as publisher and a worker as a consumer

Checpoints of checkpoints :)

  • CPU loves predictability, like linear access to the memory
  • C# enables to use low level primitives limiting the need of locking
  • A multi producer/single consumer approach used by actor models & partially by SEDA is a special case of a reader/writer

A plan of attack to provide The Fastest .NET App Ever (TM)

A plan of attack to provide The Fastest .NET App Ever (TM)

  • Build a good multi producer/single consumer queue
  • Support obtaining a few messages in one query to allow "smart batching"
  • Don't alloc at all, having in mind that structs on the stack are ok
  • Use structs for message types as they're stack allocated
  • Pass messages by ref, otherwise structs are copied

Implementation

Implementation

  • An unmanaged buffer allocated with VirtualAlloc.
  • Big enough to handle peaks
  • Represents a circular overlapping buffer
  • Uses its end for storying control variables

0816243240485664128
 head:0tail:0

Implementation - writer

  • A writer reserves next bytes by atomically increasing the tail (Interlocked.Add) by message length (8) + header (8)

0816243240485664128
 head:0tail:0tail:16

Implementation - writer

  • Checks if it does not overlap with the head
  • Writes it's (-length, message id) with Volatile.Write

0816243240485664128
-8|5 head:0tail:16

Implementation - writer

  • Copies data.
  • Overrites it's (-length) with (length) using Volatile.Write

0816243240485664128
-8|5 8|5 ..data.. head:0tail:16
A pseudo code for the writer.
byte* buffer; long* tail = (long*)(buffer+TAIL_OFFSET); var position = Interlocked.Add(ref *tail, 16); byte* writeTo = buffer+position; var negativeHeader = -length | (messageId << 32); Volatile.Write(ref *(long*)(writeTo), negativeHeader) CopyMessage(buffer, writeTo+8); Volatile.Write(ref *(int*)(writeTo), length); // this is equal to STORE -length|messageId RELEASE FENCE COPY_MESSAGE RELEASE FENCE STORE length

Implementation - writer

  • One Interlocked.Add only, will create some friction with other writers though
  • The rest of operations ordered with Volatile.Write
  • An interesting way of writing the length twice, first a negated value, then ack by writing a positive value

Implementation - reader

Implementation - reader

  • Loop reading the length located in the position marked by head.
  • Use Volatile.Read
  • Once the length is positive read the message
  • Zero whole memory.

Checkpoint - reader & writer

Head Tail Length
Writer Volatile.Read Interlocked.Add Volatile.Write x2
Reader Volatile.Write - Volatile.Read

Implementation - API


public interface IBus
{
    void Publish<TMessage>(ref TMessage msg) 
        where TMessage : struct;

    void Send<TMessage>(ref TMessage msg, 
        ActorId receiver) 
        where TMessage : struct;
}
						

Implementation - API

  • The API has generic methods based on message types
  • A message must be converted to a slice of bytes to be written by the writer
  • You could try to take a pointer to the parameter, but the C# compiler won't let you :(
  • IL Emit it

Implementation - API

  • If a struct is stack allocated (it's not an array element)
  • If a struct has no managed members
  • CLR can treat ref msg as TMessage*
  • When you have one pointer you can cast it to (byte*)
  • This is a slice we need

Summary

Summary

  • You can write ultra fast code in .NET
  • You can go low level with Volatile, Interlocked
  • Proper abstractions are enablers
  • Single threaded processing is simpler
  • Sometimes you need to emit IL
  • It's simple but not easy to write that fast code
Simplicity is prerequisite for reliability.

Edsger W. Dijkstra

Questions?

Szymon Kulec

@Scooletz

http://blog.scooletz.com