Loved the details about how memory access actually maps addresses to channels, ranks, blocks and whatever, this is rarely discussed.
Not sure how this works for larger data structures, but my first thought was that this should be implemented as some microcode or instruction.
Most computation is not thaat jitter sensitive, perception is not really in the nano to microsecond scale, but maybe a cool gadget for like dtrace or interrupt handers etc.
I like the project: taking it from refresh-induced tail latency to racing threads assigned to addresses that are de-correlated by memory channel. Connecting this to a lookup table which is broadcasted across memory channels to let the lookup paths race makes for a nice narrative, but framing this as reducing tail latency confused me because I was expecting this to do a join where a single reader gets the faster of the two racers.
From a narrative standpoint, I agree it makes more sense to focus on a duplicated lookup table and fastest wins, however, from an engineering standpoint, framing it in terms of channel de-correlated reads has more possibilities. For example, if you need to evaluate multiple parallel ML models to get a result then by intentionally partitioning your models by channel you could ensure that a model does reads on only fast data or only slow data. ML models might not be that interesting since they are good candidates for being resident in L3.
I wonder if there will be a hardware solution in the future that duplicates memory over multiple channels and gives the first result back transparently without threads and racing.
That somewhat existed in servers in the past, my R720xd had RAM mirroring mode. IDK if it used it for reducing latency, but you could take out a stick and the server would continue running as normal and report an alarm in iDRAC.
Out of my area, but yeah, I have never heard of an optimized read using that. On the surface, it seems like a task much better suited for HW and there are companies that would probably pay for the ram per core penalty to get that low jitter in latency.
No, as far as I can tell, it does not reduce latency for reads. The latency for writes is worse for both the average and worst case conditions as writes have to be sent to two dimms. The purpose is high reliability. I believe it's most analogous to RAID1 systems which, generally, only issue a read to a single device rather than taking the first succeeding of simultaneous reads.
Source: not only do I have an R720xd (and two regular R720s), I checked the Intel Xeon E5-2600v2 reference manuals.
@lauriewired, I think the most interesting thing that I learned from this is that memory refresh causes readwrite stalls. For some reason I thought it was completely asynchronous.
But otherwise, nice work tying all the concepts together. You might want to get some better model trains though.
But practically speaking, in a real application - isn’t any performance benefit going to be lost by the reduced cache hit rate caused by having a larger working set?
Or are the reads of all-but-one of the replicas non-cached?
Maybe - but if that’s the case you are likely using the wrong data structure.
Additionally you are going to be memory starving every other thread/process because you are hogging all the memory channels, and making an already bad L3 cache situation worse.
Outside of extremely niche realtime use cases (which would generally fit in L3 cache) I can’t see how this would improve overall throughput, once you take into account other processes running on the same box.
The one that comes to mind is HPC, where you avoid over allocation of the physical cores. If the process has the whole node for itself for a brief period, inefficient memory access might have a bigger impact than memory starvation.
IBM also has their RAID-like memory for mainframes that might be able to do something similar. This feels like software implemented RAID-1.
This addresses the “short long tail” (known bounded variance due to the multiple physical operations underlying a single logical memory op), but for hard real time applications the “long long tail” of correctable-ECC-error—and-scrub may be the critical case.
Nope, there isn’t a tradeoff; median latency isn’t affected. I don’t think you understand the code. The p50 is identical between a single read and the hedged strategy.
The clflush is there because the technique targets data that will miss the cache anyway. If your working set fits in L1, you don’t need this.
Also, AWS Graviton instances absolutely do not expose per-channel memory controller counter PMUs. That’s why you have to use timing-based channel discovery.
The IBM z-system is neat! But my technique will work on commodity hardware in userspace, and you can easily only sacrifice half the space if you accept 2-way instead of 8+ way hedging. It’s entirely up to you how many channel copies you want to use.
Your reply was quite rude, but I hope this is informative.
I was just trying to reconcile his reply with the charts. Have you tested how this scales down for smaller systems, as one might find in on the management side of a network switch?
You were rude for absolutely no reason. You could point out where you think the article comes short and make suggestions on how to improve it. With this approach, you achieved nothing.
Being competent requires being knowledgeable AND getting things done. You might be knowledgeable, but you need to learn how to work with other people.
Loved the details about how memory access actually maps addresses to channels, ranks, blocks and whatever, this is rarely discussed.
Not sure how this works for larger data structures, but my first thought was that this should be implemented as some microcode or instruction.
Most computation is not thaat jitter sensitive, perception is not really in the nano to microsecond scale, but maybe a cool gadget for like dtrace or interrupt handers etc.
* Announcement [1]
* Video [2]
1. https://x.com/lauriewired/status/2041566601426956391 (https://xcancel.com/lauriewired/status/2041566601426956391)
2. https://www.youtube.com/watch?v=KKbgulTp3FE
I like the project: taking it from refresh-induced tail latency to racing threads assigned to addresses that are de-correlated by memory channel. Connecting this to a lookup table which is broadcasted across memory channels to let the lookup paths race makes for a nice narrative, but framing this as reducing tail latency confused me because I was expecting this to do a join where a single reader gets the faster of the two racers.
From a narrative standpoint, I agree it makes more sense to focus on a duplicated lookup table and fastest wins, however, from an engineering standpoint, framing it in terms of channel de-correlated reads has more possibilities. For example, if you need to evaluate multiple parallel ML models to get a result then by intentionally partitioning your models by channel you could ensure that a model does reads on only fast data or only slow data. ML models might not be that interesting since they are good candidates for being resident in L3.
I wonder if there will be a hardware solution in the future that duplicates memory over multiple channels and gives the first result back transparently without threads and racing.
That somewhat existed in servers in the past, my R720xd had RAM mirroring mode. IDK if it used it for reducing latency, but you could take out a stick and the server would continue running as normal and report an alarm in iDRAC.
Out of my area, but yeah, I have never heard of an optimized read using that. On the surface, it seems like a task much better suited for HW and there are companies that would probably pay for the ram per core penalty to get that low jitter in latency.
No, as far as I can tell, it does not reduce latency for reads. The latency for writes is worse for both the average and worst case conditions as writes have to be sent to two dimms. The purpose is high reliability. I believe it's most analogous to RAID1 systems which, generally, only issue a read to a single device rather than taking the first succeeding of simultaneous reads.
Source: not only do I have an R720xd (and two regular R720s), I checked the Intel Xeon E5-2600v2 reference manuals.
@lauriewired, I think the most interesting thing that I learned from this is that memory refresh causes readwrite stalls. For some reason I thought it was completely asynchronous.
But otherwise, nice work tying all the concepts together. You might want to get some better model trains though.
Very interesting work.
But practically speaking, in a real application - isn’t any performance benefit going to be lost by the reduced cache hit rate caused by having a larger working set? Or are the reads of all-but-one of the replicas non-cached?
Apologies if I am missing something.
Once your cache hit ratios for some data structure go < .1%, I'd rather have 75% less tail latency even if it reduces cache hit rate further.
Maybe - but if that’s the case you are likely using the wrong data structure.
Additionally you are going to be memory starving every other thread/process because you are hogging all the memory channels, and making an already bad L3 cache situation worse.
Outside of extremely niche realtime use cases (which would generally fit in L3 cache) I can’t see how this would improve overall throughput, once you take into account other processes running on the same box.
Do you have an example use case?
> Do you have an example use case?
The one that comes to mind is HPC, where you avoid over allocation of the physical cores. If the process has the whole node for itself for a brief period, inefficient memory access might have a bigger impact than memory starvation.
IBM also has their RAID-like memory for mainframes that might be able to do something similar. This feels like software implemented RAID-1.
This addresses the “short long tail” (known bounded variance due to the multiple physical operations underlying a single logical memory op), but for hard real time applications the “long long tail” of correctable-ECC-error—and-scrub may be the critical case.
Kudos... Wonderful work that I think has real value in certain situations... Thanks for sharing!
My understanding is that this is making a trade off of using more space to get shorter access times. Do I have that right?
OT: Tail Slayer. Not Tails Layer. My brain took longer to parse that than I’d have wanted.
Yeah, it improves mean (but not median) access time by using more memory.
[flagged]
Nope, there isn’t a tradeoff; median latency isn’t affected. I don’t think you understand the code. The p50 is identical between a single read and the hedged strategy.
The clflush is there because the technique targets data that will miss the cache anyway. If your working set fits in L1, you don’t need this.
Also, AWS Graviton instances absolutely do not expose per-channel memory controller counter PMUs. That’s why you have to use timing-based channel discovery.
The IBM z-system is neat! But my technique will work on commodity hardware in userspace, and you can easily only sacrifice half the space if you accept 2-way instead of 8+ way hedging. It’s entirely up to you how many channel copies you want to use.
Your reply was quite rude, but I hope this is informative.
I was just trying to reconcile his reply with the charts. Have you tested how this scales down for smaller systems, as one might find in on the management side of a network switch?
[flagged]
You were rude for absolutely no reason. You could point out where you think the article comes short and make suggestions on how to improve it. With this approach, you achieved nothing.
Being competent requires being knowledgeable AND getting things done. You might be knowledgeable, but you need to learn how to work with other people.
You were rude. Be nice or don't post.
I really hope you one day re-read these comments and understand just how horrible they are. For absolutely no reason.
So yeah, you will be 'tone-policed' because you're clearly a very rude person.
Did you make a yt video explaining memory details to a broad audiance?
No you did not.
So i give your comment a F on all fronts.
The video was about how rowhammer works, the lib was byproduct.