• Home
  • blog
  • Intel’s new Capabilities Assessment Tools

Intel’s new Capabilities Assessment Tools

blog image

They’re benchmarks, designed to test a PC’s gaming or “Digital Home” capabilities. It’s probably the first time that a processor company is trying to create benchmarks for a whole PC rather than specific components. It’s also the first time in the PC industry that real “User Experience” based on perceptual research is embedded in benchmarks. We present a first look of beta versions of these very interesting new performance assessment tools.

Sept 13~Oct 24, 2005 by Devon
Cooke
with Mike Chin

The gigahertz race has ended, and Intel is heading in a new direction. Instead
of focussing their marketing efforts purely on processor speed, Intel is taking
advantage of its presence in all sectors of the tech industry — and perhaps
the success of Centrino with its intergated suite of components — to sell their products
together as a platform rather than individually.

Mike Chin, Editor of SPCR, recently reported on this phenomenon in his article,
Paradigm Shift at Intel:
IDF Fall 2005
. Among the elements of the paradigm shift he mentioned
was a new approach to platform testing that moves away from synthetic benchmarks
and timedemos. In its place, Intel recommends using benchmarks based on how
a platform will actually be used. Two tools have been announced,
one for evaluating gaming performance and one that assesses a system’s capabilities
for use as a home theater PC.

Both of these tools are of interest in their own right but of limited use to
SPCR; we generally focus on noise and heat rather than pure performance. What
is relevant to SPCR is the thinking behind these tools: Intel is attempting
to “Objectify the Subjective”, to put down in formal, verifiable terms
how a particular system will behave in terms of the end user’s experience. As
Mike pointed out, this echoes SPCR’s raison d’être: “The essence
of our interest is the enhancement of the computing experience… You could
call it ergonomics in the broadest sense.”

What follows is a preliminary
evaluation of these tools, based on some on firsthand experience, Intel’s
press presentation about these tools at IDF in August, and a series of exchanges between Intel’s CAT development team members and SPCR.

METHODOLOGY

There is no shortage of benchmarks on the web. Most performance-centered web
sites run at least half a dozen benchmarks in a typical hardware review. So,
why does Intel think it’s worth it to design its own tools? Current benchmarks
tend to be of two varieties: Synthetic benchmarks and timedemos. Both of these
do a good job of testing and compiling the raw technical capabilities of a piece
of hardware. Synthetic benchmarks are excellent for testing things like data
throughput and memory latency, while timedemos do a reasonable job of testing
complex loads that combine the various performance aspects of a piece of hardware.

The problem with these benchmarks is that they are difficult to interpret:
What is the actual, subjective effect of a 500 Mbps increase in memory bandwidth,
for example? It may be a 25% increase compared to a 2 Gbps baseline, but what
does this mean in terms of actual user experience? Will a user actually be 25%
more satisfied with it? Probably not; as long as a system does
what is needed, the amount of extra resources is almost irrelevant.
On the other hand, if the 25% increase is enough to suddenly make a new game
playable, the user may be 100% more satisfied. After all, he can do something
he couldn’t before!

The problem is similar with timedemos, which typically report average frames
per second. Often the results are 100 FPS or higher — high enough that some
frames may never actually be displayed on the monitor, which typically refreshes
60-85 times per second. So, once again, users must assume that a higher average
FPS will translate into a better overall experience, but it is almost impossible
to tell what frame rate they need for their purposes.

The goal of Intel’s new tools is to integrate the hard data from a traditional
benchmark with a model that predicts the user’s actual experience. To do this,
the tools rely heavily on research about how (and when) raw performance affects
the usability of a system.

While Intel’s tools measure the same kinds of things that
other benchmarks do — latency, bandwidth, FPS, etc. — they do not
simply report the result. Instead, they make an effort to
interpret the results in a way that is meaningful to the end user. Thus,
instead of saying, “System X ran the demo at 24 FPS and System Y ran it
at 100 FPS”, Intel’s tools would say “System X would be rated ‘Poor’
by an average user, but System Y would be rated ‘Excellent'”.

A significant amount of
research has been done to correlate the raw data of traditional
benchmarks to actual user experience. What is
new is not how measurement is done, but how it is interpreted.

Intel wants to call these “Capabilities
Assessment Tools” rather than benchmarks. That said, the
definition of “benchmark” at dictionary.com
fits Intel’s tools
as well as any other benchmark: “A standard by which something can be measured
or judged”. Our opinion is that Intel’s tools are benchmarks, but we’ll call them by the Intel’s names:

  • Gaming Capabilities Assessment Tool (G-CAT)
  • Digital Home Capabilities Assessment Tool (DH-CAT)

G-CAT

With any benchmark, it is a good idea to ask what is being measured. If you
can’t answer this question, then you are unlikely to understand what the results
it produces mean. There are actually two questions here:

  • What hardware components are being measured?
  • What kind of performance is being measured?

These are related, as different things can be measured on different
kinds of hardware.

Traditionally, the approach has been to test individual components separately.
A benchmark that tests the VGA card attempts to isolate it from
the rest of the system, for example, by disabling CPU-intensive tasks, such
as physics and AI. Then, individual tests are run for each type of performance:
Memory bandwidth, clock speed, latency, rendering time, etc. Then, the results
are compared to similar tests from competing components, and some judgment
is made between them.

This is a useful approach when considering a single piece of hardware. Assuming
that no other components have the potential to affect the tests (i.e. there
are no bottlenecks that come from other components), such tests can help judge
between two similar products, at least as far as performance is concerned.

What these tests cannot do is create a link between the performance numbers
their usefulness in real-world applications; they cannot predict whether the
gaming experience for a specific game with specific settings will be different
when a GeForce 6800GT is used instead of a 6600GT. In fact, such a link is impossible:
Games are not played on a VGA card in isolation; the other system components
also affect the gaming experience.

This is why it is important to ask what kind of performance is being measured.
Most benchmarks measure a specific aspect of performance on a specific piece
of hardware. But, most users aren’t interested in the raw performance numbers;
what they want to know is whether that hardware will make a noticeable difference
in their system.

So, why not measure what is noticeable? This is the question that motivated
the research behind Intel’s Gaming Capabilities Assessment Tool.

The G-CAT is unusual in two respects:

  • Results are based on three minute sessions of actual gameplay.
  • Results are given not in average frames per second (although the data is
    available), but in a five-point “user satisfaction” scale.

The G-CAT uses software that already exists: FRAPS,
a video capture / benchmark application that collects statistical data about
frame rate during a gaming session. Tests sessions are three minutes long and
use specific game settings, although an experimental mode that allows other
game settings to be used is also available. Once the three minutes are up, the
statistics from FRAPS are loaded into the G-CAT, which then transforms them
into a gaming experience rating based on research about how actual gamers rate
their experience.

THE RESEARCH BEHIND THE G-CAT

Intel contracted the
research firm Gartner
, to conduct a study on how gamers respond to changes
in frame rate. In December 2004 Gartner conducted a large scale test at the Cyberathlete
Professional League
event. Approximately 175 people participated in the test, which
invited the participants to play three different games, Doom 3, Half-Life
2
, and Unreal Tournament 2004, on five different systems and then
rate their gaming experience on a five point scale. FRAPS was used to collect
statistics about the frame rate during each gaming session.

All of the usual statistical safeguards were in place: Users were not told
in advance what kind of machine they were using, the sequence in which the games
and systems were tested was varied, and the five-point scale was deliberately
left vague so that participants were not shepherded to a particular result.
Test cases where the user died in the game were also excluded to
ensure that they did not affect the results.

Once the test was complete, the user’s opinions were plotted against the average
frame rate to see if a relationship between the two could be found. Surprisingly,
no model could be found that could predict how the users would rate their experience
on the basis of average frame rate. To quote Aashish Rao, who presented the
tool at a special presentation for the press: “Average FPS cannot be used to
predict the gaming experience on a PC”. So, another measurement needed to be
found that could predict how users would react.


User satisfaction depends on more than just average FPS.

PREDICTING USER RESPONSES

What Intel came up with is still related to frame rate, but it is no longer
the average. Instead, two separate mathematical models are used to predict
how actual users would react: The Threshold Model and the Bayesian
Model
.


Intel summarizes the Pros and Cons of the two models in this table.

The Threshold Model takes into account the fact that frame rate is irrelevant
as long as it is imperceptible to humans. In each of the three games tested,
frame rate had no effect on how the users rated their experience so long as
it was above 40-45 fps. Below this threshold, frame rate did
affect the user experience. So, instead of using the average
frame rate to predict user experience, the Threshold Model uses the number
of frames below the threshold (in a three minute period). The higher the number
of frames below the threshold, the lower the users would rate their experience.


Average FPS does not predict how users experience the game, especially
above 60 FPS.
Note that this is how high the
average frame rate needs to be —
the minimum is around 40-45 FPS.


Graphing the number of FPS against time makes it easy to see when the
frame rate drops below the threshold.

This model turned out to correspond well to the data collected in the Gartner
study, but it still takes only a single factor into account. In hopes of finding
a more accurate model, the Bayesian Model was developed. This model takes
the variability of the frame rate into account as well as the speed, but its
mathematical complexity makes it difficult to understand exactly how it works.
Although it works well for most scenarios, there are still a few cases where
its error of prediction can be quite high.

The end result of running the tool is a frame rate graph from FRAPS, plus two
“Gaming Experience” ratings, one for each model. In keeping with the
statistical methods behind the tool, the “score” is not a single number but a confidence
interval. This shows the margin of error and allows different results
to be properly compared.


Both models produced statistically identical results in this test, although
the confidence interval of the Bayesian Model is much smaller (more reliable).

Neither model is perfect. However, both reflect the actual user’s
experience, not just the average frame rate, which turns out to be a poor
indicator of the gaming experience.

DIGITAL HOME CAPABILITIES ASSESSMENT TOOL

The purpose of the DH-CAT is to evaluate the
performance of a digital media system.
Although
most any modern system is powerful enough to handle simple DVD playback, the
recording and playing of HDTV footage is more demanding.

The perceptual science behind DH-CAT isn’t based on market research, per se; it is based on the opinions of many end-users (thousands in some case). Their tool uses a video quality assessment tool developed by Psytechnics, a leading firm in this area. Intel’s User-Centered Design group in Oregon was also instrumental in providing video quality assessment tools in other usage scenarios (playback/streaming).

DH-CAT evaluates a system based on which tasks
it can perform with adequate performance. Because the tool is designed to test
a system’s suitability for use as a Media Center / Home Theater PC, it is
obvious what tasks it must be capable of: Playing, recording, and streaming
various audio and video formats. The exact capabilities that it tests are shown
below in a slide from Intel’s presentation about the tool.


Different tasks are tested for capability in different ways.

Each of these tasks is tested in a different way. Some are straightforward;
either the system can handle the task, or it can’t. Others are based on video
quality; the system must maintain a certain threshold of video
quality to pass the test. Finally, some tasks are based on response time,
meaning that the system must complete the task in a certain amount
of time.

These core tasks are grouped into three standard
levels of capability:

  • Basic (level 1)
  • HD (level 2)
  • Connected
    (level 3)

Most systems should be able to hand the Basic level, which
requires playing, recording, and transcoding standard definition TV and DVD
content. The next level is HD capability, which requires the same tasks to be
done using HDTV footage. Level three adds the ability to stream media as a DLNA
compliant stream — guaranteeing compatibility with other DLNA compliant devices.


Three basic levels of capability are tested.

The way that core tasks are divided into the three levels is complex. Each level of capability
contains a number of mandatory and optional “scenarios”, which themselves contain
one or more core capabilities. The most difficult capability tests involve multitasking:
Two, three, four or more core capabilities are tested simultaneously, and if
any of them cannot be completed to a satisfactory level of quality, the system
is considered incapable of performing that scenario. A system must be able to
complete all mandatory scenarios to be considered capable of a given level.

In addition, optional “extra credit” scenarios not considered
in the capability levels are used to differentiate between systems that are
otherwise equally capable. These scenarios may be mandatory for a higher level,
or simply an unusual (or especially system-intensive) pattern of core tasks.
Extra credit scenarios contribute to the “Overall Capability Level Score”. A more detailed table
of how well the system performed each scenario can also be viewed, which makes
it possible to tell at a glance exactly which capabilities the system is — and
isn’t — capable of.


The hierarchy of capabilities: Individual tasks (“usage primitives”) are
combined into scenarios,
which must all be completed for compliance with a particular level of capability.

THE RESEARCH BEHIND THE DH-CAT

As with the G-CAT, the DH-CAT uses the opinions of real viewers to determine
what level of performance is acceptable. A number of factors are considered relevant to the viewing
experience: Dropped frames for playback, frame delay for streaming video, and
image quality for recording. Each of these factors was researched separately. The results are
summarized below:

Dropped Frames

Each test subject was asked to rate their viewing experience
on a five-point scale. All factors except the number of dropped frames were
held constant so that changes in the ratings could be attributed to the number
of dropped frames. All test subjects were shown the same 24 second video clip
under the same viewing conditions. At the end of it all, the results were
compiled and, on average, the acceptable threshold (rated “Fair” or above)
was found to be 88 dropped frames in a 720 frame video, or about 12%.


A maximum acceptable threshold for the number of dropped frames determines
what “acceptable” video quality is.

Frame Delay

A similar approach was used to determine acceptable
frame delay for streaming video. Viewers were shown a 24 second clip and
asked to rate the quality on a five point scale. This time, the independent
variable was the difference between actual playback time and theoretical playback
time. The threshold is around
a 12% deviation: Video quality was deemed unacceptable when the 24 second
video was delayed by more than 2.92 seconds.


Video streaming quality is determined by comparing theoretical vs. actual
playing time.

Image Quality

For recording or capturing video, image quality was identified as the most
important factor in user experience. But, how do you measure image quality?
Unlike dropped frames or frame delay, image quality cannot be easily quantified,
which makes it very difficult to measure.

Fortunately, the broadcast industry has studied this subject for years. Instead of duplicating this research, Intel adopted an existing standard: ITU
Standard J144
. The research and perceptual models that it is based
on were carried out by a British company called Psytechnics,
and Intel has licensed their software for use in the DH-CAT. The software
works in much the same way as the G-CAT: It is a mathematical model
based on a database of viewers’ responses.

Psytechnics’ software rates image quality on a five point “Mean Opinion Score”. The Mean Opinion Score is supposed to reflect the amount of degradation from
an original reference source, not an “absolute” measurement of the
video quality. This removes the possibility that the image content might affect
the final score. A perfect quality score would be given to a video that is
identical to the original source. Any deviation from this source is assumed
to be unwanted and thus reduces the final score.

Four basic factors are taken into account:

  1. Spatial Frequency Analysis: A mathematical comparison of the degraded
    video with the original source.
  2. Color Analysis: Measures how well the degraded video maintains
    the color information in the original source.
  3. Texture Analysis: Measures how much detail is maintained.
  4. Contour Analysis: Measures how well sharp edges are displayed.
    Also takes into account video “blocking”, where rapid movement appears as
    a mosaic of squares.


A complex mathematical model maps measurable changes in image quality
to the opinions of actual viewers.

Because image quality is evaluated based on Psytechnics’ methods, it is the
only part of the tool that does not belong to Intel. Intel will be making pieces of DH-CAT’s code available to developers, but their tools won’t be “open-source” in the GPL sense of that term. Developers will have access to the code, and can modify it in order to make improvements. They will able to make more of the code available to developers, but the term for obtaining the source code won’t be GPL.

EVALUATING THE INTEL CATs

Earlier, I mentioned that understanding the results of a benchmark requires
answering two questions:

  • What hardware components are being measured?
  • What kind of performance is being measured?

Now that the both of Intel’s tools have been described, it is easy enough to
answer these questions. The tools are similar enough that the answer is the
same for both tools.

The answer to the first question is that they measure a whole system, not a
single component. For measuring hardware in isolation, timedemos and synthetic
benchmarks are still just as valid as they always were. However, instead of
trying to filter out the effects of other components, the tools measure all
the components together as a whole. Let me repeat that in different words: the
G-CAT and DH-CAT are designed for testing PC systems
,
not individual components
.

So, what’s the use of testing whole systems if users buy only single components
at a time? I can think of two uses:

  1. The vast majority of PC users do not build their own systems — they buy
    them pre-configured from the major OEMs. DIYers consist of only 1~2% of the PC marketplace. The CATs are designed to provide
    useful results to a wide market segment, not just the enthusiats.
  2. Testing a whole system while changing only a single component
    is a good method for judging the effect of that component on a system. This
    shows the actual effect of the component on the system rather than its isolated
    performance, which may be limited by the rest of the system.

Intel’s new CATs — benchmarks — will probably be adopted by the mainstream tech media (such as the glossy computer magazines in drug stores) and by OEMs as part of their marketing. They will be useful for consumers who have no time or interest in doing extensive research before making a purchase. Certainly, something needs to replace Intel’s long focus on CPU clock speed as a simple measure of PC performance.

The second use is of direct interest to a hardware review web site like SPCR.

In the past, the approach to evaluating performance has been to test and judge
the hardware in isolation. When testing a graphics card, for example, the rest
of test bench was usually as fast a machine as possible to reduce the possibility
of it affecting the benchmark. But this approach is less valid for Intel’s tools
because the rest of the system is supposed to influence the results.

Using the G-CAT or DH-CAT to evaluate a specific component requires maintaining
a static system and changing only the component being tested between different
tests. Any change in the result can then be attributed to the new component.
However, it is not the absolute performance of the hardware that is being
evaluated
. Instead, the CATs measure the impact of that hardware
on a specific system
. In other words, the result will only change if
the new hardware produces a subjective difference in how the system performs.

The answer to the second question — What kind of performance is being measured?
— is that, in essence, user experience is being measured. Strictly speaking, that’s
not quite true; the G-CAT uses FRAPS to measure frame rate, and the DH-CAT measures
dropped frames, frame delays and variations in image quality. However, with
the research that Intel has done, these raw measurements are interpreted
in more meaningful ways.
These benchmarks are most useful to everyday users who are not technically inclined
but want to know what kind of system they need to do what they want.

RELIABILITY, REPEATABILITY

Anyone who is seriously considering using one of Intel’s tools will
inevitably ask the question, how accurate are they? The short
answer is that there is simply no way of knowing without actually using it.
Ultimately, the accuracy of these tools will not be known until (and unless)
it has been exposed to the wild. A huge range of systems need to be tested under
a huge number of circumstances before any serious attempt as answering this
question can be made.

However, a guess at the accuracy of the tools can be made by examining two
factors:

  1. How closely the tools models the experiences of actual users, i.e. how reliably
    the tools reports what it intends to report.
  2. How much inter-run variance there is and whether it is possible to generate
    different results on the same machine, i.e. how repeatable the
    testing is.

Reliablility

The reliability of the tools will be governed by how thorough the research
into user experience is, and how well developed the prediction models are.
This is the question of how well the tools predict the user experience from
the data. If the confidence intervals produced by the G-CAT and the capability
table produced by the DH-CAT actually do predict how end users of the tested
system would judge the system, the tools can be considered reliable.

We are cautiously optimistic that the tools will produce reliable results.
Ultimately, we only have Intel’s word about the research behind the tools,
but everything that we’ve been shown has indicated that a lot of care has
been taken to conduct the research in a way that produces statistically valid
and useful results. In fact, I was asked to refrain from using the term “market
research” to describe the research because of the imprecision and sloppiness
that the term implies.

If anything, this is the most important thing for Intel to get right, as
they will be suspected of biasing the benchmark if the tools do not produce
impartial results. Intel is certainly making an effort to remain impartial:
Some of the code in the tools will be available for close examination, and much research has been contracted to independent
firms. Perhaps the best proof of Intel’s good intentions is one of the people behind the tools. Dave Salvator was recently hired by Intel after a stint with the
tech web site ExtremeTech
. Prominently on his resume is a piece published on ET in 2003 about nVidia’s GeForce 5000 series using some questionable optimizations on the 3DMark03 benchmark. Personally, I find it unlikely that Intel would want to release a biased benchmark with this fellow around.

Repeatability

Repeatability refers to the amount of variance between different test runs.
No matter how reliably the tools interpret the data, the results are useless
if the data is not accurate in the first place. The key to obtaining accurate
data is to make sure that the a single system always produces the same data.
Unfortunately, the focus of the tools on real-world testing will inevitably
hurt the repeatability of the tools, especially the G-CAT, which bases its
results on actual gaming sessions.

There is no better way to simulate “real game play” than to actually play
the game. Unfortunately, this creates a problem. Gaming is a random activity,
which means that tests are not repeatable. Every session of testing will produce
a slightly different result because the game will not be played the same way
every time.

Timedemos or even scripted bots produce “gameplay” is the same every time.
For this reason they are reliable and repeatable, which is very nice for testing
purposes. However, these tests are often done with AI and Physics disabled
and do not accurately reflect the load on the system during actual gaming.

So, should the realism that comes with the randomness of gaming be included
in the test, or should it be sacrificed in the name of repeatability? This
is a question that cannot be answered without examining the tool in person,
so I will not try to answer it here. Instead, I would like to put it in a
slightly different way for each tool.

For the G-CAT: Will all tests on a single system have confidence intervals
that overlap, no matter what game, what level, and what settings are used?
Intel admits that a player death during the testing can skew results (and
so recommends re-running such tests), but are there other factors that also
affect the end result?

For the DH-CAT: Will the capability level (and extra-credit
score) stay consistent across different tests on the same platform?

LIMITATIONS AND DRAWBACKS

In their current form, the tools are still quite limited. Neither feels like
it has progressed beyond the beta stage, and there are a multitude of factors
that might be relevant but have not yet been researched enough to know whether
they affect the user’s experience. That said, Intel also does not yet consider either
tool a finished product. Work is still being done on both tools, and it will
continue even after they are released. Both tools have
substantial roadmaps for future features and improvements. For the G-CAT, this
means further research on different games (perhaps including new genres), while
the DH-CAT can expect to see more primitive tasks and a completely new category:
Premium Connected.


Many improvements for the G-CAT are planned.

Of the planned improvements to the G-CAT, the most pressing is probably the
need to determine the effect of different game settings. During the presentation
of the tool to the press, many people expressed concern that the settings used
— default graphics options at 1024 x 768 resolution — does not test the full
capabilities of most graphics cards. This is less of a concern that it might
seem — good quality testing is possible no matter what settings are used so
long as the basic test parameters remain the same — but expanding the capabilities
of the tool to encompass more variables could definitely improve the reliability
of the end result.

The amount of research needed to make this
feature possible is substantial. Proper experimental techniques dictates that only
one setting can be changed at a time, which means that testing the effect of
multiple settings requires multiple tests and an increase in the sample size.
In my opinion, the basic settings used in the initial market research are perfectly
adequate for the tool in its present form. The average gamer often does not
even look at the graphics options in his games — he simply plays at the default
settings unless he needs to turn things off for performance reasons.


Some of the features that may show up in future versions of the tool.

Of the features listed above, the most important from the standpoint of user
experience is probably A/V sync. Intel has stated that this feature is in
active development, so we expect to see it sooner than later.

For SPCR, the most interesting feature is “Thermal / Acoustic“.
A software tool that could determine whether a system produces an acceptable
level of noise or heat would be invaluable — but it’s probably a pipe
dream. Even thermal testing would be difficult to do properly, due to the
variations of thermal monitoring on different systems, and acoustic testing
seems almost impossible. How could a self-contained system measure how it
sounds from a meter away? Even if a sound meter was embedded in the motherboard
chipset, how could it determine how the system sounds from a standard position outside the
case?

CONCLUSIONS

There is much to like about Intel’s new Capabilities Assessment Tools. Measuring
performance in terms of subjective user experience is a significant step forward.
So is the idea of measuring a “system” instead of a single piece of hardware.
And, last but not least, evaluating a system in terms of its capabilities rather
than abstract numerical terms is exactly the way we believe it should be evaluated.

So much for the good. The theoretical goals behind the CATs are beyond reproach,
but these lofty goals must contend with the limitations of what can be tested,
and determining what affects user experienceis by no means simple.
In their current form, Intel’s tools do not yet feel complete. Intel has recognized
this, and has promised much in the roadmaps for the two tools, but many questions
still remain.

Group perceptual research is useful for determining what
level of performance on specific parameters is satisfactory, but it says very little about what factors
contribute to user experience. Before any market research is done, a specific
theory about what affects user experience must be developed. In other words,
there may be factors that affect user experience that are simply not being measured
because nobody has recognized that they are significant.

This is why it is helpful to examine as many factors as possible. This is especially
important for the DH-CAT, which basically provides the user with a set of checkboxes
for each possible task. The more checkboxes there are, the more likely it is
that the tool will be helpful. This approach is commendable — it moves
away from trying to simplify performance as a single number — but if the
checkboxes don’t reflect what users are actually doing with their computers,
it’s not much help.

All in all, the Capabilities Assessment Tools reflect a holistic approach to
computer performance. Thinking about performance in
terms of capabilities and actual benefits is a big step towards maturity
in this industry. Perhaps the biggest obstacle that the tools face is not their
actual implementation, but convincing people that this is the way to think about
performance, and helping them adapt to the language of user experience rather
than bandwidth, throughput and latency. In addition, hardware sites will
need to re-think their test methodologies, since the CATs test entire systems,
not separate components.

Another potential obstacle is the fact that the market research is based on
the “average user”, while most review sites cater to the enthusiast
market. While the occasional drop below the FPS threshold hardly matters to
a casual gamer, it may be of great concern to a serious gamer who has been
lag-killed. Similarly, the default settings in the G-CAT tool reflect the settings
used by everyday users, not enthusiasts. The same issue applies to the research
behind video quality: How does the opinion of an “average user” apply
to a couch-potato who has trained his eye to notice visual flaws? Has too much
detail been glossed over by calibrating the “acceptable” level of
quality to the average user’s tastes?

Intel has some appreciation of the challenges to acceptance of their new performance
assessment approach. They’ve taken steps to try and include the media as well
as early adopters among PC users by creating a site about the new CATs: http://www.intel.com/performance/newtools.htm.
In the IDF Fall 2005 press conference where much of the info here was presented,
Intel promised to make both of these tools available for free download. They
also strongly urged the hundred or so attendees for feedback on the tools to
make them better. SPCR staff members have been invited to participate in the beta
testing program for both tools. There appears to be an open attitude regarding
the development of these tools in order to ensure that the end results will
be embraced by the review community.

In the end, we can toss in our two cents worth, then wait and see. Once the tools are released, it is a matter of enough testers testing enough systems to verify the accuracy of the tools in predicting user experience. It’s safe to predict that flaws will surely be found, and the tools will likely evolve to keep up with changes in PC components.

* * *

EDITOR’S NOTE: We believe Intel’s new approach to performance assessment is an important step for the industry. It’s possible that they’re being developed partly because Intel has been edged consistently by AMD in standard processor benchmarks for the last year or two. A scan through the major gaming and performance oriented web sites shows a broad consensus of praise for AMD’s top models as the processors of choice. It’s possible that this step down from the performance pinnacle of market perception is a factor. A new way of assessing PC performance that deemphasizes the role of raw speed might be helpful to cool the current enthusiasm for AMD processors among performance nuts. This is conjecture, indeed. In our view, it hardly matters. What does matter is that real users’ experience and perception are finally being included in assessment tools being developed by the most powerful entity in the PC hardware world.

We’ve been invited to visit the labs in Oregon where some of Intel’s research on acoustics is being done. One fascinating and ambitious project undertaken by Intel was to measure the ambient noise levels inside certain “standard” types of buildings in selected cities around the globe. This was an effort to answer the question, What is typical ambient background noise? It’s an issue that is fundamental to silent computing. In general, a computer only has to be quieter than ambient by some small degree (aha, another research project!) for it to be effectively silent. Some of the cities mentioned were Shanghai, NY, and Berlin. Obviously, there are many methodological challenges, but the simple fact that Intel has embarked on such a research project bodes well for the silent PC future. We’ll certainly make a report if this visit materializes.

* * *

Comment on this article in our Forums.

Leave a Comment

Your email address will not be published. Required fields are marked *