Introducción 2024
Introducción 2024
Computer Architecture
Introduction to Computer
Architecture and
Programming Models
Fernando.Rincón@uclm.es, JoseAntonio.Torre@uclm.es
Outline
● Architecture
Computer Architecture 2
The Concept of Architecture
● Considers three aspects:
● ISA (Instruction Set Architecture): design and properties of the
machine instructions.
– External architecture of the hardware.
● Organization: the structure formed by the main components of a
computer (memory, CPU, buses...), as well as the more abstract
characteristics of those components.
– Also referred as internal architecture or microarchitecture.
● Hardware: detailed implementation of the structure. Main aspects:
– Detailed logical design
– Implementation as an Integrated Circuit (IC)
Computer Architecture 3
Main Aspects in the Design of a
Computer
Computer Architecture 4
Organization & Implementation
● All 3 aspects are interdependent
● Ex.: an ideal architecture for a MSI implementation may
not be right for a VLSI one.
● The computer designer cannot design an ISA ignoring
implementation aspects
● Implementation technology evolution has a deep impact
on the performance of microarchitectural units:
– Ex: memory keeps improving bandwidth but not latency → need
for deeper caches
– Ex: power consumption increases but has less area for
dissipation → downclocking
Computer Architecture 5
Current Architectures
Sensor networks
Cameras Set-top boxes
Game consoles
Robots
Computer Architecture 6
But this is how we started
● Eniac (1946)
● 30 tons
● 17.000 vacuum tubes
– Transistors not invented
● 5K additions per second
● 175kW
● 167m2 area
● $6,904,265 equivalent price in 2022
Computer Architecture 7
Computer technology evolution
● IC technology evolution quite steady
● Computer architecture evolution not that much
● 5 phases:
● Initial 25 years: improvement mainly due to IC technology
– Clock and size
● Birth of RISC architectures: performance boost (ILP)
– Better compilers thanks to simpler architectures
● Power & ILP wall
– Power dissipation in smaller areas → heat
– Bandwidth & latency mismatch
● Birth of multicores
– From Implicit Level Parallelism to explicit one
● The end of Moore’s Law
– Specialized processors as an alternative to Von-Neumann architecture
Computer Architecture 8
CPUs Performance
End of Moore’s
Law
Incrementar freq reloj
Computer Architecture 9
Trends in technology
● Integrated circuit technology (Moore’s Law)
● Transistor density: 35%/year
● Die size: 10-20%/year
● Integration overall: 40-55%/year
● Nowadays coming to an end
● DRAM capacity: 25-40%/year (slowing)
● 8 Gb (2014), 16 Gb (2019), possibly no 32 Gb
● Flash capacity: 50-60%/year
● 8-10X cheaper/bit than DRAM
● Magnetic disk capacity: recently slowed to 5%/year
● Density increases may no longer be possible, maybe increase from 7 to 9 platters
● 8-10X cheaper/bit than Flash
● 200-300X cheaper/bit than DRAM
Computer Architecture 10
Trends in technology
Latency vs Bandwidth
Cuello de botella memorias cpu
Computer Architecture 11
Trends in technology
Milestones
Computer Architecture 12
Trends in technology
Milestones
Computer Architecture 13
Power and Energy
Capacidad de consumo
Computer Architecture 15
Energy and power in the chips
Complementary MOS (silicio) -> combinan transistor -> se comporta como interruptor -> Las puertas lógicas son combinaciones de transistores
Computer Architecture 16
Dynamic energy and power
● Dynamic energy consumed by a single transistor during a
transition (0→1 or 1→0):
te enfrentas a ciertas variables en el diseño del circuito
● Edyn = k x CL x V2
– CL: capacitive load; tiene que ver con las características fiscas de los transistores
– V: voltage
– K: proportionality constant
Computer Architecture 17
Dynamic energy and power
● If there are N active transistors in the integrated
circuit (IC):
● PdynIC = Pdyn x N = k x CL x V2 x F x N
N = (transiciones que se producen)
Computer Architecture 18
Static energy and power
Es fija
● Static power per transistor:
● Pst ∝ Cst x V
● Cst: static current
● Inside the chip
● It increases due to the raising number of transistors
● Some subsystems may be switched off when not used
– May affect performance
● Is around 25% of the overall power consumed
– Raising up to a 50% for high performance designs
Computer Architecture 19
Trends in power and energy
● Energy consumption per unit of time (power) in
microprocessors keeps raising:
● For a single transistor it decreased (lower V and CL)
● But we increase in the number of transistors
● And also increase in the clock frequency (but not much:
Power wall. See next slide)
● More heat is generated which must be dissipated
● Otherwise the chip will get burned
● Power consumption is one of the main concern of
current chip designers
Computer Architecture 20
Evolution of clock frequency in
microprocessors
Computer Architecture 21
The problem of power
consumption
● How to design the power supply system to provide
as much power as required
● How to design the cooling system to avoid
overheating
● Current chips reduce their clock frequency when they
reach a critical temperature
● But efficiency is better measured with energy than
with power. Which chip is more efficient, A or B?
Se planifica la tarea (subir o bajar V)
● A compare to B: requires 20% more power, but takes
75% of the time to complete a task
– A consumes 90% of the energy (1,2 * 0,75 = 0,9) to complete
the same task
Computer Architecture 22
Minimizing energy consumption
● Several alternatives:
● Turn off inactive modules
Consumo estatico -> Coste levantar modulo
● Underclocking
Subir frecuencia- > riesgo
Computer Architecture 23
Dark silicon problem
No podemos alimentar todas las partes (gran parte apagada)
Evolution of
the dark silicon
problem with
the arrival of
multicores
Computer Architecture 24
Trends in architecture
● Cannot continue to leverage Instruction-Level parallelism (ILP)
Paralelismo oculto -> pasos que se pueden hacer en paralelo, los compiladores y arquitecturas son capaces de explotar esto
● Single processor performance improvement ended in 2003
● New models for performance:
Paralelismo explicito -> depende del programador
GPUs ● Data-level parallelism (DLP)
● Thread-level parallelism (TLP)
Cloud ● Request-level parallelism (RLP)
● These require explicit restructuring of the application
● ILP was automatically handled by the compiler
● Parallel programming requires the programmer to be aware of the architecture
– Granularity of the parallelism
– Memory hierarchy bandwidth and latencies
– Data structures and their behavior with respect to caches
Computer Architecture 25
Trends in architecture:
Ex: CPUs evolution
Computer Architecture 26
Design complexity
● CPI evolution
● 80s: 5.0 → 1.15 (decade of pipelining)
● 90s: 1.15 → 0.5 (decade of superscalar) Factor proporcional de reducción pero nunca exacto
Limitado por el tanto porciento de paralelismo del sistema
Aumentan el rendimiento pero consumen más, son más agresivos y desechan instrucciones condicionales
Computer Architecture 27
The End of Uni-processors
● In the early 2000s we got to a point where increasing
ILP and clock freq. results in more problems than
benefits:
● Consequence:
● Around 2005 Intel and other manufacturers switched to
multicore as the way to increase performance
– Integrating several identical processors in the same chip vs a
bigger and more complex unicore is:
● Simpler to design and verify → cheaper
● More energy efficient (no clock increase and selective powering of the
cores used)
mas grados de libertad
● The birth of multi-cores is the most important landmark in
computer architecture since pipelining and ILP
Computer Architecture 28
The Context Changes
● The arrival of multicore can be seen a the response to new
big challenges
● With respect to consumption
● Before: Energy cheap / transistors expensive
● Now: “Power Wall”. Energy expensive / transistors very cheap
– A chip can integrate more transistors than can be powered
● With respect to parallelism
● Before: Enough with ILP extraction via compilers (out-of-order,
speculation, VLIW, …)
● Now: “ILP Wall” Hw complexity avoids improvements in the
parallelism degree
Computer Architecture 29
The Context Changes II
● With respect to memory:
● Before: Multiplications slow / access to memory fast
● Now: “Memory Wall” memory very slow / multiplications very fast
● Processor performance:
● Before: improvement 2x/1.5 years
● Now:
– Improvement 3% per year
– Improvement 2x processors per chip/ ~ 2 years
● With respect to complexity:
● Before: Design and verification of increasingly large cores is very expensive
“Complexity Wall”
– 100s of engineers for 3-5 years
– And caches are easy to design, but locality is limited
● Now:
– 2x cores/chip with each CMOS generation
– It doesn’t compromise clock frequency
Computer Architecture 30
Superscalar vs Multicore
● Typical architectures Multicore: Interconexión mediante buses, hay que orquestar y describir como
trabajar con los distintos procesadores
Computer Architecture 31
Trends in architecture
● Nowadays all computers are parallel
● A parallel computer is a system where some
processors or computers collaborate for the
resolution of a problem
● Only those where parallelism is visible to the eyes of the
programmer are considered parallel computers
● Even older computers include some kind of parallelism:
– There's parallelism when at least during some instants some
computing events happen at the same time
Computer Architecture 32
Trends in architecture
● During the last decades of XX century, performance
improvement relied on the increase in ILP
● All current processors include ILP techniques (studied
during the course)
● Now we should outline:
● MIMD architectures
– Task parallelism
● Vectorial architectures & Graphic processing units (GPU
[Graphic Processor Unit], GPGPU [General-Purpose GPU])
– They exploit data parallelism
Computer Architecture 33
Trends in architecture
● Combination of multi-core, GPU and other specific
processors results in heterogeneous systems:
● Multi-cores deal with task parallelism
● GPUs deal with data parallelism SIMD
● Can be integrated in a single chip
– Technology has led to Systems-on-Chip
Computer Architecture 34
Hybrid MIMD
Example:
ARM big-little architecture
Computer Architecture 35
Specific processors
Neural
Processing
Unit
Computer Architecture 36
Classes of Computers
● PMD:
● e.g. start phones, tablet computers, wearables
● Clusters / Warehouse Scale Computers
● Used for “Software as a Service (SaaS)”
● Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
Computer Architecture 37
Review: Computing
performance
● Latency & Throughput:
● Throughput or bandwidth:
– Amount of work performed per unit of time
Cantidad de tiempo desde la ejecución hasta finalizar
● Latency or response time (wall-clock time):
– Time passed between the start and end of an event
● If the event is the execution of a program: running time
● It's not CPU time, which doesn't include stalls, nor idle times
Computer Architecture 38
CPU performance
● Measured as the inverse of the CPU execution time
● CPU time has two components:
● CPU user time: devoted to the user program
● CPU system time: devoted to OS tasks related to the program
● CPU time equations:
– TCPU = CC * T = IC * CPI * T = IC * CPI / F
● CC: total number of clock cycles
● IC: number of instructions executed
● CPI: average number of instructions per cycle
● T: clock period time
● F: clock frequency
Computer Architecture 39
Amdahl's law
● Which is the global improvement when executing a
certain task if a part of that task has been
accelerated?
● Data:
– To: original time to perform the task
– F: fraction of the task that gets accelerated
– Gp: partial gain (acceleration rate) Las soluciones son una nube de puntos que
dependen de las prestaciones
● We want to compute Ti, the time of the improved version,
and the relationship between both of them
Ejercicio
A -> 2 Cores
B -> 4 Cores, 70% + de transistores (importa consumo)
El 70% se corresponde con el Runtime que distribuye a cada core, existe un punto de sincronización
Cual es la Ganancia global respecto A El resto se corresponde con el procesador master del sistema
La Frecuencia de B < A Ley de Amdahl -> No, dos factores que afectan al tiempo (frecuencia B distinta)
| 70% (se puede beneficiar más de 1 core) |
Ta = Cc (Ciclo reloj totales) * Ta (Tiempo reloj A) Cca = Cc(1 * 0,7) + 0,7/2 = 0,65cc
Ciclo de reloj 20% mayor Tb = Cc * Tb Ccb = Cc(1 * 0,7) + (0,7/4)Cc = 0,475
Tb = Ta * 1,2
Ta 0,65 = 14%
Tb 0,425 * 1,2
Computer Architecture 40
Amdahl's law
● How do we compare Ti versus To?
● Dividing To/Ti which represents the global gain
● Ti = F · To / Gp + (1 – F) · To
● Gg = To / Ti = 1 / (1-F + F / Gp)
● But:
● The law only applies if both the accelerated and non-
accelerated parts don't overlap
● There's a limit in the global gain, no matter what the partial
gain is: 1 / (1 – F)
● We should focus the improvement on the bottlenecks
Computer Architecture 41