stackoverflow:Purpose of memory align/内存对齐的目的(原文+翻译)

原文:https://stackoverflow.com/questions/381244/purpose-of-memory-alignment
翻译:joeyhtml

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
现代处理器上的内存系统,都对于内存存取的粒度和是否对齐有所限制,这就是不少理由的源头。
(译注:本人的理解就是两点,必须从指定的地址开始读取和每次必须读一个字)编程

Speed 速度

Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput(aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
现代处理器都有多级缓存,数据必须通过这些缓存。因为要支持单字节的存取,使得内存系统的吞吐量和执行单元(EU)的吞吐量密切相关(又名:cpu-bound)。这让人联想起编程实现IO(PIO)模式是怎样被直接内存存取(DMA)取代的,在许多硬件中也是由于如上相同的理由。
The CPU always reads at its word size(4 bytes on a 32-bit processor), so when you do a unaligned address access -- on a processor that supports it -- the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
CPU老是一次读取一个字的数据(对于一个32位的处理器,一个字是4字节,64位则是8字节),因此当你进行一次非对齐的内存存取时——若是处理器支持的话(译注:有的处理器不支持非对齐的内存存取)——处理器会从内存中读取多个字。CPU会读取每一个你请求读取的变量横跨过的单元,这就形成了在请求存取指定数据的时候(能够是float,double等),至多会有相较于对齐时2倍的内存单元访问量。
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
正由于这样,读取两个字节很容易就能够比读四个字节慢。好比,假设在内存中有一个以下的结构体:
(译注:好比两个字节是分散在两个不对齐的word里,就要读取两次,而若是四个字节都在一个对齐的word里,就只须要读取一次)缓存

struct mystruct {
    char c; // one byte
    int i; // four bytes
    short s; // two bytes
}

On a 32-bit processor it would most likely be aligned like shown here:
在32位机上,对齐后内存布局大概像这样:

The processor can read each of these members in on transaction.
对于每一个成员变量,处理器经过读取一个字(word)均可以读到。
Say you had a packed version of the struct, maybe from the network where is was packed for transmission efficiency, it might look something like this:
假如你有一个“拥挤版”的结构体(译注:我的认为就是未对齐的版本),可能对于网络传输来讲,是为了传输效率,可能看起来像这样:

Reading the first byte is going to be the same.
读char c的时候,未对齐版本和对齐版本是同样的。
When you ask the processor to give you 16bits form 0x0005 it will have to read a word from 0x004 and shift left 1 byte to place it in a 16-bit register, some extra work, but most can handle that in one cycle.
当你让处理器从0x0005给你16位数据时,它会从0x0004开始,读取一个字,并把数据左移(<<)一8位,以便将其放入一个16位的寄存器里,此外还有些额外的工做,可是几乎均可以在一个周期内(译注:总线周期?)完成。
When you ask for 32bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
当你从0x0001请求32位数据时,这将会是双倍的开销。处理器会首先从0x0000开始读取数据放入结果寄存器,并将其左移8位,而后再从0x0004开始读取,把数据放入暂存寄存器里,并将数据右移24位,最后将暂存寄存器中的数据与结果寄存器里的数据做“或”运算。网络

Range 地址范围

For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
对于给定的地址空间,若是处理器构架能够假设地址的最低两位恒为0(好比32位机),而后它就能够访问四倍大于如今地址空间的内存(由于保留的两位能够表示四个不一样的状态)(译注:我的理解为地址线中权重最高的两位都为0了),拿走地址线的最低两位会为你带来4字节的对齐,也能够理解为是一次地址线的变更,地址空间就跨过4个字节。每次地址的增长,都增长的是地址线的bit2,而不是bit0。也就是说,最低两个仍然将一直为00。
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
这甚至能够影响到系统的物理设计(译注:不就是芯片设计嘛,说那么文绉绉的干吗)。若是地址总线不须要那两个位,CPU上能够少设计两个引脚,电路板上也能够少走两根线。数据结构

Atomicity 原子性

The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
CPU对于内存中已经对齐了的字的操做,是原子的,也就是说没有其余的指令能够打断这个操做,这对于不少无锁数据结构和并发编程范式的正确操做是相当重要的。架构

Conclusion 总结

The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
处理器的内存系统比本文描述的要复杂的多,这篇文章可能会有所帮助:how an x86 processor actually addresses memory can help
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
从这篇文章中你能够读到很是多遵照内存对齐的好处。
computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
计算机的首要任务是传输数据,为了促进吞吐吞吐更多数据,现代内存架构和技术已经被优化了几十年,它是一个存在于更多更快的执行单元的一个高度可靠的通道。并发

相关文章
相关标签/搜索