AMD Opteron Coverage - Part 1: Intro to Opteron/K8 Architecture

Name: AMD Opteron Coverage - Part 1: Intro to Opteron/K8 Architecture
Item: AMD Opteron Coverage - Part 1: Intro to Opteron/K8 Architecture
Author: Anand Lal Shimpi

by Anand Lal Shimpi on April 23, 2003 3:03 AM EST

Posted in
CPUs

1 Comments | Add A Comment

1 Comments

Go Deep

At this point, AMD had two options - go wide, or do deep. To go wide would mean to increase the number of execution units in the core, which would make sense if the K8 were to be an enterprise-only core as a wider core is better suited for massively parallel applications. However, the desire to make K8 a desktop solution and the resulting desire for higher clock speeds led AMD to go deep, that is, deepen the pipeline.

Regardless of what route AMD chose, one thing was for sure; in order to take advantage of the fact that more instructions would be coming down the pipe between branch mispredicts, they'd need a larger buffer to store these instructions in. The buffer that keeps instructions "in flight" is known as the scheduling window, and if you look back up at the increase in the number of entries for the integer scheduler you'll now understand why AMD made the change; more instructions in flight, requires more entries in the schedulers to keep up. We can only assume that the floating point scheduling window was already large enough that it did not need any expansion; keep in mind that most branch instructions occur in integer code, which would help explain why the floating point scheduler remains unchanged from the K7 despite the more accurate branch predictor.

AMD Integer Pipeline Comparison
Clock Cycle	K7 Architecture	K8 Architecture
1	Fetch	Fetch 1
2	Scan	Fetch 2
3	Align 1	Pick
4	Align 2	Decode 1
5	EDEC	Decode 2
6	IDEQ/Rename	Pack
7	Schedule	Pack/Decode
8	AGU/ALU	Dispatch
9	L1 Address Generation	Schedule
10	Data Cache	AGU/ALU
11		Data Cache 1
12		Data Cache 2

So AMD opened up the integer scheduling window, and gave the K8 a couple more pipeline stages to play with. The purpose of these additional stages, as we mentioned before, is solely to allow the K8 core to reach higher clock speeds. The more pipeline stages you have, the less work is done per clock and thus the higher you're able to clock your CPU; this is the reason the 20-stage Xeon is currently at speeds of 3GHz, compared to the 12-stage Opteron which is debuting at 1.8GHz.

The difference in pipeline architectures is what makes a clock-for-clock comparison between the Xeon and Opteron invalid (much like the Pentium 4 to Athlon XP comparison was invalid on a clock-for-clock basis). The Xeon's architecture allows it to reach high clock speeds at the expense of doing less work per clock cycle, the appropriate comparison ends up being one of cost and real-world performance, not one of clock speed.

The K8 caches are identical to the K7 caches, although with slightly lower latencies and higher bandwidth as you can see from the following performance comparisons:

L1 Cache Bandwidth Comparison
Opteron (SledgeHammer) vs. Athlon XP (Barton)

Opteron 1.80GHz

Athlon XP 1.80GHz

20674

17332

|
0

|
4135

|
8270

|
12404

|
16539

|
20674

|
24809

L1 Cache Latency Comparison
Opteron (SledgeHammer) vs. Athlon XP (Barton)

Opteron 1.80GHz

Athlon XP 1.80GHz

1.67ns

|
0

|
1

|
2

L2 Cache Bandwidth Comparison
Opteron (SledgeHammer) vs. Athlon XP (Barton)

Opteron 1.80GHz

Athlon XP 1.80GHz

5774

4586

|
0

|
1155

|
2310

|
3464

|
4619

|
5774

|
6929

L2 Cache Latency Comparison
Opteron (SledgeHammer) vs. Athlon XP (Barton)

Opteron 1.80GHz

Athlon XP 1.80GHz

8.90ns

11.12ns

|
0

|
2

|
4

|
7

|
9

|
11

|
13

The final microarchitectural changes (before we get into the two big ones) to the K7 core come in the form of what AMD likes to call the K8's "Large Workload TLBs."

Click to Enlarge

The number of entries in the K8's Translation Lookaside Buffers has been increased in order to cope with the usage model the Opteron will most likely find itself in - servers with very large memory requirements. The performance impact of the increase in TLB entries on the desktop will most likely be minimal as we noticed when AMD first increased TLB sizes with the K7 Palomino core; 3D rendering applications did receive a somewhat reasonable performance boost if you recall.

AMD Opteron Coverage - Part 1: Intro to Opteron/K8 Architecture

Go Deep

Post Your Comment

1 Comments

View All Comments

skunklet - Thursday, June 30, 2005 - link

Log in

Don't have an account? Sign up now