经过内存分配来学习 go 中的机制

时间 2020-06-22

标签经过内存分配学习机制繁體版

原文原文链接

前言

在前一篇博客中，我介绍了逃逸分析的基础场景。可是还有一些其余场景，我并无作介绍。为了介绍其余场景，我专门写了了一个程序用于 debug，这个程序中分配内存的方式比较让人吃惊。正则表达式

程序

为了更多的学习io包，我尝试了一个快速的项目。找到字节流中的字符串 elvis，而且替换为首字母大写的字符串 Elvis。算法

代码中列出了两个用于解决这个这个问题的函数。这个博客主要集中于函数algOne，由于这个函数用到了io包。shell

下面的数据中，一个是输入，一个是但愿经过函数algOne做用以后的输出。数组

Listing 1bash

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis
复制代码

下面是函数algOneide

Listing 2函数

80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
复制代码

我想知道这个函数的表现以及函数给堆上的压力。为了了解这些，咱们须要运行下 benchmark。工具

Benchmarking

下面是用来运行函数algOne来处流数据的 benchmark 函数性能

Listing 3学习

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }
复制代码

有了这个函数，咱们就能够运行go test了，而且可使用选项-bench，-benchtime和-benchmem选项。

Listing 4

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op
复制代码

在运行 benchmark 以后，咱们能够看到函数algOne函数的每次操做都分配了两次内存，而且分配的内存大小为 117 字节。这个表现很是好了，可是咱们须要知道是哪些代码形成了这些内存的分配。为了知道这些，咱们须要产生运行 benchmark 的 profiling data。

Profiling

为了产生 profile data，咱们须要运行 benchmark，不过此次须要使用选项 -memprofile选项。

Listing 5

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op
复制代码

在程序运行完以后，就会产生两个新的文件。

Listing 6

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go
复制代码

源码所在的文件夹为memcpu，函数algOne就存在于文件stream.go中，函数BenchmarkAlgorithmOne存在于stream_test.go。两个产生的文件分别是mem.out和memcpu.test。文件mem.out包含了 profiles data。文件memcpu.test是一个二进制文件，当咱们须要看 profile data 的时候须要使用到这个文件。

有了 profile data 和二进制文件，咱们就能够运行pprof工具来学习 profile data。

Listing 7

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _
复制代码

当须要 profiling memory 而且寻找容易解决的问题的时候，咱们须要使用选项-alloc_space而不是默认的选项-inuse_space。这个选项会展现每次分配内存的状况，而无论你 take the profile 的时候，分配的内存是否还在使用。

经过pprof的做用，咱们可使用list命令来检查函数algOne的状况。list命令接受一个正则表达式，用于匹配表达式匹配的函数。

Listing 8

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _
复制代码

基于这个 profile，咱们能够知道input以及切片buf的底层数组被分配到了堆。因为input是指针，因此这个 profile 是说明，input所指向的bytes.Buffer是分配的到堆的。因此咱们先聚焦于变量input的变量的分配，而且理解是如何分配的。

因为函数bytes.NewBuffer建立的变量，和函数algOne共享，因此致使变量分配到堆。而且flat列(pprof 输出的第一列)出现的值告诉咱们这个值是分配到堆的，由于函数algOne共享变量的缘由致使的变量分配逃逸到堆。

flat列表示的是函数的堆的分配，能够看看list命令展现函数Benchmark是如何调用函数algOne的。

Listing 9

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _
复制代码

因为只有第二列cum才有值，因此函数Benchmark函数并不直接的建立任何变量到堆的。在循环内部，每次对函数调用的时候都会分配变量到堆。你能够看到两次对list命令调用的时候，分配的值到堆是匹配的(译者注：$$318.53 + 16.50 = 335.03$$)。

到此呢，咱们仍然不知道为何bytes.Buffer会建立变量到堆。这个时候可使用go build命令的-gcflags "-m -m"选项了。profiler会告诉咱们值逃逸到的堆，而go build命令会告诉咱们为何。

编译器报告

咱们可让编译器告诉咱们代码里面变量逃逸到堆的缘由。

Listing 10

$ go build -gcflags "-m -m"
复制代码

这个命令会产生很是多的输出。咱们须要找到的就是包含stream.go:83的行，由于stream.go是文件的名称，而且第 83 行含有代码来构建bytes.buffer的值。在搜索以后，找到了以下 6 行。

Listing 11

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
复制代码

第一行是很是有意思的

Listing 12

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }
复制代码

这句话告诉了咱们bytes.Buffer逃逸到堆的缘由并非对函数bytes.Buffer调用形成的。由于bytes.Buffer压根没有被调用，函数的操做被内联到了调用的地方。

第 83 行的的以下代码

Listing 13

83     input := bytes.NewBuffer(data)
复制代码

因为编译器选择把bytes.NewBuffer内联到代码里面，因此上面的代码在实际调用的时候是以下的

Listing 14

input := &bytes.Buffer{buf: data}
复制代码

这就意味着函数algOne是直接建立bytes.Buffer的。那么究竟是什么致使 input 被分配到堆中的呢？答案就在剩下的五行报告中。

Listing 15

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
复制代码

上面的这些内容告诉咱们是第 93 行形成的值逃逸的。由于input变量被赋值给了一个接口。

接口

我并无印象在代码中对接口有过赋值的操做。可是若是看了第 93 行代码，问题就变得清晰了。

Listing 16

93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
复制代码

因为调用了io.ReadFull函数，因此形成了对接口的赋值。若是你看了io.ReadFull的定义，你能够看到函数io.ReadFull接受的第一个参数是一个接口。

Listing 17

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}
复制代码

这个说明了，把bytes.Buffer的地址传递给函数，而后函数把这个地址做为一个接口存储，这就形成了变量逃逸到了堆。如今咱们看到了使用接口的代价：变量分配到堆和变量的间接使用(若是分配到栈，变量的访问速度会更快)。若是使用接口并无使得代码变得更好，那就最好别使用接口。我跟随这下面这些指导来使用接口

当有下面几种状况的时候，我会使用接口

用户须要本身实现接口的细节
API 有许多实现方法，须要各自维护其细节
API 的部分操做随着时间会改变，须要解耦

不须要使用接口的状况以下

为了使用接口而使用接口
用于完成一个算法
当用户能够本身定义接口的时候

如今咱们须要问本身，这个算法真的须要使用io.ReadFull函数吗？答案是否认的，由于bytes.Buffer类型有一系列方法可使用，而且使用这些方法能够有效的避免变量被分配到堆。

如今咱们能够移去io包，并使用input变量已有的方法Read。

下面的代码移去了io包，为了保持新的代码行和原来的代码行不变，使用了变量_来避免导入io包。这样就能够保持io包还在引入的行列中。

Listing 18

12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
复制代码

当咱们再次运行 benchmark 的时候，就能够看到变量bytes.Buffer再也不分配到堆中了。

Listing 19

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op
复制代码

也能够从上面的输出看到，代码性能提高了约 29%。代码花费的时间由 2570 ns/op 到 1814 ns/op。既然这个问题解决了，咱们如今就能够聚焦于切片buf背后的数组分配到了堆的问题。若是咱们使用新的代码，来运行获得 profile 的结果，咱们也许就能够解决这个问题了。

Listing 20

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
复制代码

如今惟一分配到堆的一行就是第 89 行了，这部分的分配就是切片底层的数组。

栈帧

咱们须要知道为何buf底层的数组分配到了堆。再次运行go build指令，而且使用参数-gcflags "-m -m"，在输出的结果中搜索stream.go:89。

Listing 21

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89
复制代码

报告中说的是分配的数组对于栈来讲太大了。这个信息是很是的有迷惑性的。由于并非底层数组太大了，而是编译器在编译的时候不知道底层数组的大小。

只有在编译器在编译期间知道值的大小的时候，值才会被分配到栈。这是由于每一个函数的栈帧的大小都是在编译期间计算的。若是编译器不知道一个值的大小，那么编译器会把值分配到堆上。

为了展现这个，咱们暂时硬编码切片的大小为 5 到代码中去

Listing 22

89     buf := make([]byte, 5)
复制代码

这个时候再运行 benchmark，全部的分配到堆的操做都没有了。

Listing 23

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
复制代码

若是再次查看编译器的报告，你会发现没有变量的逃逸行为

Listing 24

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape
复制代码

显然，并不能硬编码切片的大小到代码中，因此代码中知道存在着一次的变量分配到堆的操做。

分配和性能

有了三次的修改，咱们能够查看、对比每次修改后的性能

Listing 25

Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
复制代码

在第一优化的时候，性能提高大约 29%。第二次优化以后，性能提高约 33%。经过这些数据，咱们能够看到变量分配到堆是影响程序性能的。

结论

go 有许多让人吃惊的工具，来让咱们理解编译器在涉及到逃逸分析是所作的决定的起因。基于这些信息，咱们能够修改代码以保持能够存在于栈中的变量避免存在于堆中。你并不须要完成一个在堆上不分配内存的程序，可是你须要使得这些操做尽量的避免。

永远不要基于程序的性能写代码，由于你不想猜想程序的性能。咱们应该首先基于正确性来写代码。这意味着须要聚焦于总体性，可读性和简单性。在你有了一个程序的时候，确认下程序是否运行的足够块。若是不够快，那么可使用 go 提供的工具来找到修复程序运行慢的问题。