Valve Hardware Day 2006 - Multithreaded Edition
by Jarred Walton on November 7, 2006 6:00 AM EST- Posted in
- Trade Shows
Test Setup
Obviously valve is pretty excited about what can be done with additional processing power, and they have invested a lot of time and resources into building tools that will take advantage of the possibilities. However, Valve is a software developer as opposed to a hardware review site, and our impression is that most of their systems are typical of any business these days: they are purchased from Dell or some other large OEM, which means they are a bit more limited in terms of what kind of hardware is available. That's not to say that Valve hasn't tested AMD hardware, because they have, but as soon as they reached the conclusion that Core 2 Duo/Core 2 Quad would be faster, they probably didn't bother doing a lot of additional testing. We of course are more interested in seeing what these new multiprocessor benchmarks can tell us about AMD and Intel hardware -- past, present, and future -- and we plan on utilizing these tests in future articles. As a brief introduction to these benchmark utilities, however, we thought it would be useful to run them on a few of our current platforms to see how they fare.
In the interest of time, we did not try to keep all of the tested platforms identical in terms of components. Limited testing did show that the processor is definitely the major bottleneck in both benchmarks, with a variance between benchmark runs of less than 5% on all platforms. Besides the processor, the only other area that seems to have any significant impact on benchmark performance is memory bandwidth and timings. We tested both benchmarks three times on each platform, then we threw out the high and low scores and took the remaining median score. In many instances, the first run of the particle simulation benchmark was slightly slower than the next two runs, which were usually equal in performance. The variability between benchmark runs of the map compilation test was less than 1%, so the results were very consistent.
Here are the details of the tested systems.
We did test all of the systems with the same graphics card configuration, just to be consistent, but it really made little to no difference. On the Athlon 64 configuration, for example, we got the same results using the integrated graphics as we got with the X1900. We also tested at different resolutions, and found once again that on the graphics cards we used resolution seemed to have no impact on the final score. 640x480 generated the same results as 1920x1200, even when enabling all of the eye candy at the high resolution and disabling everything at the low resolution. To be consistent, all of the benchmarking was done at the default 1024x768 0xAA/8xAF. We tried to stay consistent on the memory that we used -- either for DDR or DDR2 - though the Pentium D test system had issues and would not run the particle simulation benchmark. Finally, to give a quick look at performance scaling, we overclocked all of the tested systems by 20%.
For now we are merely providing a short look at what Valve has been working on and some preliminary benchmarks. We intend to use these benchmarks on some future articles as well where we will provide a look at additional system configurations. Note that performance differences of one or two points should not be taken as significant in the particle simulation test, as the granularity of the reported scores is relatively coarse.
Obviously valve is pretty excited about what can be done with additional processing power, and they have invested a lot of time and resources into building tools that will take advantage of the possibilities. However, Valve is a software developer as opposed to a hardware review site, and our impression is that most of their systems are typical of any business these days: they are purchased from Dell or some other large OEM, which means they are a bit more limited in terms of what kind of hardware is available. That's not to say that Valve hasn't tested AMD hardware, because they have, but as soon as they reached the conclusion that Core 2 Duo/Core 2 Quad would be faster, they probably didn't bother doing a lot of additional testing. We of course are more interested in seeing what these new multiprocessor benchmarks can tell us about AMD and Intel hardware -- past, present, and future -- and we plan on utilizing these tests in future articles. As a brief introduction to these benchmark utilities, however, we thought it would be useful to run them on a few of our current platforms to see how they fare.
In the interest of time, we did not try to keep all of the tested platforms identical in terms of components. Limited testing did show that the processor is definitely the major bottleneck in both benchmarks, with a variance between benchmark runs of less than 5% on all platforms. Besides the processor, the only other area that seems to have any significant impact on benchmark performance is memory bandwidth and timings. We tested both benchmarks three times on each platform, then we threw out the high and low scores and took the remaining median score. In many instances, the first run of the particle simulation benchmark was slightly slower than the next two runs, which were usually equal in performance. The variability between benchmark runs of the map compilation test was less than 1%, so the results were very consistent.
Here are the details of the tested systems.
Athlon 64 3200+ 939 | |
CPU | Athlon 64 3200+ (939) - 2.0GHz 512K OC 3200+ @ 10x240 HTT = 2.40GHz |
Motherboard | ASUS A8N-VM CSM - nForce 6150 |
Memory | 2x1GB OCZ OCZ5001024EBPE - DDR-400 2-3-2-7 1T OC DDR-480 3-3-2-7 1T |
GPU | X1900 XT |
HDD | Seagate SATA3.0Gbps 7200.9 250GB 8MB cache 7200 RPM |
Athlon X2 3800+ 939 | |
CPU | Athlon X2 3800+ (939) - 2.0GHz 2x512K OC 3800+ @ 10x240 HTT = 2.40GHz |
Motherboard | ASUS A8R32-MVP - ATI Xpress 3200 |
Memory | 2x1GB OCZ OCZ5001024EBPE - DDR-400 2-3-2-7 1T OC DDR-480 3-3-2-7 1T |
GPU | X1900 XT |
HDD | Western Digital SATA3.0Gbps SE16 WD2500KS 250GB 16MB cache 7200 RPM |
Athlon X2 3800+ AM2 | |
CPU | Athlon X2 3800+ (AM2) - 2.0GHz 2x512K OC 3800+ @ 10x240 HTT = 2.40GHz |
Motherboard | Foxconn C51XEM2AA - nForce 590 SLI |
Memory | 2x1GB Corsair PC2-8500C5 - DDR2-800 4-4-4-12 OC DDR2-960 4-4-4-12 |
GPU | X1900 XT |
HDD | Western Digital SATA3.0Gbps SE16 WD2500KS 250GB 16MB cache 7200 RPM |
Core 2 Duo E6700 NF570 | |
CPU | Core 2 Duo E6700 - 2.67GHz 4096K OC E6700 @ 10x320 FSB = 3.20GHz |
Motherboard | ASUS P5NSLI - nForce 570 SLI for Intel |
Memory | 2x1GB Corsair PC2-8500C5 - DDR2-800 4-4-4-12 OC DDR2-960 4-4-4-12 |
GPU | X1900 XT |
HDD | Western Digital Raptor 150GB 16MB 10000 RPM |
Core 2 Quad QX6700 975X | |
CPU | Core 2 Quad QX6700 - 2.67GHz 2 x 4096K OC QX6700 @ 10x320 FSB = 3.20GHz |
Motherboard | ASUS P5W DH Deluxe - 975X |
Memory | 2x1GB Corsair PC2-8500C5 - DDR2-800 4-4-4-12 OC DDR2-960 4-4-4-12 |
GPU | X1900 XT |
HDD | 2 x Western Digital Raptor 150GB in RAID 0 |
Pentium D 920 945P | |
CPU | Pentium D 920 - 2.8GHz 2 x 2048K OC 920 @ 14x240 HTT = 3.36GHz |
Motherboard | ASUS P5LD2 Deluxe - 945P |
Memory | 2x1GB Corsair PC2-8500C5 - DDR2-667 4-4-4-12 OC DDR2-800 4-4-4-12 |
GPU | X1900 XT |
HDD | Western Digital SATA3.0Gbps SE16 WD2500KS 250GB 16MB cache 7200 RPM |
We did test all of the systems with the same graphics card configuration, just to be consistent, but it really made little to no difference. On the Athlon 64 configuration, for example, we got the same results using the integrated graphics as we got with the X1900. We also tested at different resolutions, and found once again that on the graphics cards we used resolution seemed to have no impact on the final score. 640x480 generated the same results as 1920x1200, even when enabling all of the eye candy at the high resolution and disabling everything at the low resolution. To be consistent, all of the benchmarking was done at the default 1024x768 0xAA/8xAF. We tried to stay consistent on the memory that we used -- either for DDR or DDR2 - though the Pentium D test system had issues and would not run the particle simulation benchmark. Finally, to give a quick look at performance scaling, we overclocked all of the tested systems by 20%.
For now we are merely providing a short look at what Valve has been working on and some preliminary benchmarks. We intend to use these benchmarks on some future articles as well where we will provide a look at additional system configurations. Note that performance differences of one or two points should not be taken as significant in the particle simulation test, as the granularity of the reported scores is relatively coarse.
55 Comments
View All Comments
JarredWalton - Tuesday, November 7, 2006 - link
What's with the octal posting? Too many CPU cores running? ;)I deleted the other 7 identical posts for you. Careful with that Post Comment button!
saratoga - Tuesday, November 7, 2006 - link
Server kept timing out when I hit post, so I assumed it wasn't committing :)exdeath - Tuesday, November 7, 2006 - link
You can see my recent comments on this topic here:http://www.dailytech.com/Article.aspx?newsid=4847&...">http://www.dailytech.com/Article.aspx?newsid=4847&...
In my experience relying on atomic CPU swap operations isn't enough as it only works with a single value (32 bit word for example).
While you lock and swap a 32 bit Y value, someone else has just finished reading the newly written X value but beat you to the lock to read the old Y value before you've updated. Clearly whole data structures need to be coherent, not just small atomic values.
Also it’s unusual to modify objects observable states mid frame. Even if you avoided the above example so that the X,Y pair was always updated together, you'd still have different objects interpreting the position as a whole of that object in different places at different times. State data must be held constant to all observers throughout the context of a single frame.
exdeath - Tuesday, November 7, 2006 - link
Even if you avoided the above example so that the X,Y pair was always updated together, you'd still have different objects interpreting the position as a whole of that object in different places at different times in the same frame.JarredWalton - Tuesday, November 7, 2006 - link
I'm assuming your comment is in regards to the PS3/Cell comments on the last page? It's sort of sounds like you're arguing about the way Valve has chosen to go about doing things, or that you disagree with some of the opinions they've expressed concerning other hardware. We have only tried to provide a very high-level overview of what Valve is doing, and we hardly touched the low-level details -- Valve didn't spend a lot of time on specific implementation issues either. All they did was provide us with some information about what they are doing, and a bit of opinion on what they think of the rest of the hardware options.Preventing anything else from doing write operations to the world state during an entire frame in order to keep things coherent is a big problem with multithreading. Apparently Valve has found a way around that, or at least found a way to do it more efficiently, using lock free and wait free algorithms. No, I can't honestly say I really understand what those algorithms do, but if they say it worked better for their code base I'm willing to trust them.
As far as the PS3/Cell processor goes, Valve did say that they have various thoughts on how to properly utilize the architecture. It is simply going to be more difficult to do relative to Xbox 360 and PC. It's not impossible, and companies are definitely going to tackle this problem. As far as how they tackle it, I'm more than a bit rusty on my coding background, and other than high-level details I'm not too concerned how they improve their multithreading code on any specific platform, just that they do it.
exdeath - Tuesday, November 7, 2006 - link
The other issue is OS support.Compiler add-on's or third party APIs can only serve to hide the details or make things look cleaner. But no matter what, the final barrier between the application and the OS are the API calls provided by the OS threading model. Thus no third party implementation can be better than the OS thread model itself in terms of performance and overhead. All those can do is make it easier to use at the top by handling the OS details.
I imagine threading APIs on popular OSes will start to evolve, just like graphics APIs have, once everyone gets on the multi-core bandwagon and starts to get a feel for what's available in the OS APIs and what they'd rather have. So far, Vista's thread pool API looks good, but I still don't see an API to determine such basic things as checking if the work queue is empty and all threads are idle, etc.
Currently I find it's easier to implement my own thread pool manager which does atomic increments and decrements on a 'task count' variable as tasks are entered or completed in the queue. Checking if all tasks are done involves testing that task count against 0 and signaling an event flag that wakes any management threads sleeping until all its work tasks to complete. It also allows for more flexibility in 'before and after' housekeeping as work threads move from task to task and that kind of control isn't offered in the XP’s built in thread pool API, nor Vista’s as far as I can tell.
exdeath - Tuesday, November 7, 2006 - link
Not arguing their methods, a lot of things in this article are in line with my own opinions on multithreading, pretty much the best way to got about it. I'm just pointing out that atomic lock/swap operations in hardware are very primitive and typically operate only on CPU word size values, not entire data structures. Thus it's possible between doing two atomic operations on two variables on one core, another core can get an old version of one variable and a new version of another.core1: compute X
core2: ...
core1: lock/write x
core2: read x, get newly written version
core1: compute Y
core2: read Y, get old y before the update
core1: lock/write Y
core2: ...
The task on core2 is working with inconsistent data, the new X and the old Y. If the task on core2 only uses the data as input, i.e.: AI tracking another AI entity, it has the wrong position, and won't know about it since it has no need to perform its own lock/write (so it never gets the exception that says the value changed). Even if it did, it would have to throw out all work and redo it with the new Y, and then it could possibly change again.
Looping and retrying seems wasteful. And I’m thinking the only way to catch such a hardware error on a failed lock/write update is via exceptions, and handling a thrown exception on an attempt to write a single 32 bit value is very wasteful of CPU cycles.
In my own research I have had excellent results with double buffering any modified data. Each threaded task only updates its hidden internal working state for frame n+1 while all reads to the object are read from its external current state for frame n. At the end of the frame when all parallel tasks have completed, the current/working states are swapped, and the work queue is filled again to start the next frame.
This ensures that throughout the entire computation of frame n+1, the current frame n state will be available to all threads, and guaranteed to not be modified through the duration of current frame. So basically all threads can read anything they want and modify their own data. On PC/360 the time to swap everything is basically nothing; you just swap a few pointers, or a single pointer to an array/structure of current/working data for the frame.
On the PS3 some data copying and moving will be required, but this is mandatory due to design anyway and assisted by an extremely smart and powerful DMAC.
One place to be critical about is message passing between objects since it requires posting (writing) data to be picked up by another object. But the time to lock/post/unlock a queue is negligible compared to the time it takes to process the results leading up to the creation of the message. This is similar to the D3D notion of doing as much as you can before you lock and only do the minimal work needed inside the lock and unlock as quickly as possible.
GhandiInstinct - Tuesday, November 7, 2006 - link
Jarred Walton,My question: Will Valve's games in 2007 be released with specificaitons such as: "For minimum requirements you need a dual-core cpu, for maximum results you need a quad-core" or anything to that nature? Because I seem to be confused in what Valve is working on dual or quad or both or neither or something different, and what I should get to best utilize their games and multi-core software in general.
Thanks.
JarredWalton - Tuesday, November 7, 2006 - link
Episode Two should come out sometime in 2007, and before that happens you will get the multithreading patch affecting previous Source engine titles. Right now, it doesn't sound like anything released in the next year or so from valve is going to require dual cores. That's what I was trying to get out on the conclusion page where I mentioned that they are targeting an "equivalent experience" regardless of what sort of processor you are running.So just like you could turn down the level of detail in Half-Life 2 and run it on DX8 or even DX7 hardware, Source engine should be able to accommodate single core processors all the way up through N-core processors. The engine will spawn as many threads as you have processor cores, with one main thread serving as the controller and N - 1 helper threads. Xbox 360 for example would have 5 helper threads plus the master thread, because it has three course each capable of executing to threads simultaneously.
Patrese - Tuesday, November 7, 2006 - link
Great article, good to see dual-quad cores being used for something in games. By the way, the kitchen examples made me hungry... :)