【原创】OpenStack Swift源码分析(二)ring文件的生成

    上一遍源码分析,关注swift-ring-bin文件,其中最为复杂,也是最为重要操做要数rebalance方法了,它是用来从新生成ring文件,再你修改builder文件后(例如增减设备)使系统中的partition分布平衡(固然,在rebalance后,须要从新启动系统的各个服务)。其中一致性的哈希算法,副本的概念,zone的概念,weight的概念都是经过它来实现的。python

源码片断:算法

swift-ring-builder rebalance方法。    swift

def rebalance():
        """
swift-ring-builder <builder_file> rebalance
    Attempts to rebalance the ring by reassigning partitions that haven't been
    recently reassigned.
        """
        devs_changed = builder.devs_changed #devs_changed表明builder中的devs是否改变,默认是Flase,当调用add_dev,set_dev_weight,remove_dev,会把devs_changed设置为True。
        try:
            last_balance = builder.get_balance()#调用builder.get_balance方法,返回ring的banlance  也就是平衡度 例如0.83%。
            parts, balance = builder.rebalance()#主要的重平衡方法,返回从新分配的partition的数目和新的balance。
        except exceptions.RingBuilderError, e:
            print '-' * 79
            print ("An error has occurred during ring validation. Common\n"
                   "causes of failure are rings that are empty or do not\n"
                   "have enough devices to accommodate the replica count.\n"
                   "Original exception message:\n %s" % e.message
                   )
            print '-' * 79
            exit(EXIT_ERROR)
        if not parts:
            print 'No partitions could be reassigned.'
            print 'Either none need to be or none can be due to ' \
                  'min_part_hours [%s].' % builder.min_part_hours
            exit(EXIT_WARNING)
        if not devs_changed and abs(last_balance - balance) < 1:
            print 'Cowardly refusing to save rebalance as it did not change ' \
                  'at least 1%.'
            exit(EXIT_WARNING)
        try:
            builder.validate()#安全功能方法,捕捉bugs,确保partition发配到真正的device上,不被分配两次等等一些功能。
        except exceptions.RingValidationError, e:
            print '-' * 79
            print ("An error has occurred during ring validation. Common\n"
                   "causes of failure are rings that are empty or do not\n"
                   "have enough devices to accommodate the replica count.\n"
                   "Original exception message:\n %s" % e.message
                   )
            print '-' * 79
            exit(EXIT_ERROR)
        print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \
              (parts, 100.0 * parts / builder.parts, balance)#打印rebalance结果
        status = EXIT_SUCCESS
        if balance > 5: #balnce大于5会提示,最小的系统平衡时间。
            print '-' * 79
            print 'NOTE: Balance of %.02f indicates you should push this ' % \
                  balance
            print '      ring, wait at least %d hours, and rebalance/repush.' \
                  % builder.min_part_hours
            print '-' * 79
            status = EXIT_WARNING
        ts = time()#截取时间。
        builder.get_ring().save( #保存新生成的builder ring文件
            pathjoin(backup_dir, '%d.' % ts + basename(ring_file)))
        pickle.dump(builder.to_dict(), open(pathjoin(backup_dir,
            '%d.' % ts + basename(argv[1])), 'wb'), protocol=2)
        builder.get_ring().save(ring_file)
        pickle.dump(builder.to_dict(), open(argv[1], 'wb'), protocol=2)
        exit(status)

 

    其中我加入了一些本身的注释,方便理解。其实是调用了builder.py中的rebalance方法。安全

 builder.py 中的rebalance方法:app

def rebalance(self):
        """
        Rebalance the ring.

        This is the main work function of the builder, as it will assign and
        reassign partitions to devices in the ring based on weights, distinct
        zones, recent reassignments, etc.

        The process doesn't always perfectly assign partitions (that'd take a
        lot more analysis and therefore a lot more time -- I had code that did
        that before). Because of this, it keeps rebalancing until the device
        skew (number of partitions a device wants compared to what it has) gets
        below 1% or doesn't change by more than 1% (only happens with ring that
        can't be balanced no matter what -- like with 3 zones of differing
        weights with replicas set to 3).

        :returns: (number_of_partitions_altered, resulting_balance)
        """
        self._ring = None #令实例中的ring为空
        if self._last_part_moves_epoch is None:
            self._initial_balance() #增长一些初始化设置的balance方法,
            self.devs_changed = False
            return self.parts, self.get_balance()
        retval = 0
        self._update_last_part_moves()#更新part moved时间。
        last_balance = 0
        while True:
            reassign_parts = self._gather_reassign_parts()#返回一个list(part,replica)对,须要从新分配。
            self._reassign_parts(reassign_parts) #从新分配的实际动做
            retval += len(reassign_parts)
            while self._remove_devs:
                self.devs[self._remove_devs.pop()['id']] = None #删除相应的dev
            balance = self.get_balance()#获取新的平衡比
            if balance < 1 or abs(last_balance - balance) < 1 or \
                    retval == self.parts:
                break
            last_balance = balance
        self.devs_changed = False
        self.version += 1
        return retval, balance

    程序会根据_last_part_moves_epoch是否为None来决定,程序执行的路线。若是为None(说明是第一次rebalance),程序会调用_initial_balance()方法,而后返回结果,其实它的操做跟_last_part_moves_epoch不为None时,进行的操做大致相同,只是_initial_balance会作一些初始化的操做。而真正执行rebalance操做动做的是_reassign_parts方法。dom

 builder.py中的_reassign_parts分配part的动做方法。ide

def _reassign_parts(self, reassign_parts):
        """
        For an existing ring data set, partitions are reassigned similarly to
        the initial assignment. The devices are ordered by how many partitions
        they still want and kept in that order throughout the process. The
        gathered partitions are iterated through, assigning them to devices
        according to the "most wanted" while keeping the replicas as "far
        apart" as possible. Two different zones are considered the
        farthest-apart things, followed by different ip/port pairs within a
        zone; the least-far-apart things are different devices with the same
        ip/port pair in the same zone.

        If you want more replicas than devices, you won't get all your
        replicas.

        :param reassign_parts: An iterable of (part, replicas_to_replace)
                               pairs. replicas_to_replace is an iterable of the
                               replica (an int) to replace for that partition.
                               replicas_to_replace may be shared for multiple
                               partitions, so be sure you do not modify it.
        """
        for dev in self._iter_devs():
            dev['sort_key'] = self._sort_key_for(dev)#设置每个dev的sort_key
        available_devs = \ #迭代出可用的devs根据sort_key排序
            sorted((d for d in self._iter_devs() if d['weight']),
                   key=lambda x: x['sort_key'])

        tier2children = build_tier_tree(available_devs)#生产层结构devs

        tier2devs = defaultdict(list)#devs层
        tier2sort_key = defaultdict(list)#sort_key层
        tiers_by_depth = defaultdict(set)#深度层
        for dev in available_devs:#安装不一样方式分类排序。
            for tier in tiers_for_dev(dev):
                tier2devs[tier].append(dev)  # <-- starts out sorted!
                tier2sort_key[tier].append(dev['sort_key'])
                tiers_by_depth[len(tier)].add(tier)

        for part, replace_replicas in reassign_parts:
            # Gather up what other tiers (zones, ip_ports, and devices) the
            # replicas not-to-be-moved are in for this part.
            other_replicas = defaultdict(lambda: 0)#不一样的zone ip_port device_id标识
            for replica in xrange(self.replicas):
                if replica not in replace_replicas:
                    dev = self.devs[self._replica2part2dev[replica][part]]
                    for tier in tiers_for_dev(dev):
                        other_replicas[tier] += 1#不须要从新分配的会被+1

            def find_home_for_replica(tier=(), depth=1):
                # Order the tiers by how many replicas of this
                # partition they already have. Then, of the ones
                # with the smallest number of replicas, pick the
                # tier with the hungriest drive and then continue
                # searching in that subtree.
                #
                # There are other strategies we could use here,
                # such as hungriest-tier (i.e. biggest
                # sum-of-parts-wanted) or picking one at random.
                # However, hungriest-drive is what was used here
                # before, and it worked pretty well in practice.
                #
                # Note that this allocator will balance things as
                # evenly as possible at each level of the device
                # layout. If your layout is extremely unbalanced,
                # this may produce poor results.
                candidate_tiers = tier2children[tier]#逐层的找最少的part
                min_count = min(other_replicas[t] for t in candidate_tiers)
                candidate_tiers = [t for t in candidate_tiers
                                   if other_replicas[t] == min_count]
                candidate_tiers.sort(
                    key=lambda t: tier2sort_key[t][-1])

                if depth == max(tiers_by_depth.keys()):
                    return tier2devs[candidate_tiers[-1]][-1]

                return find_home_for_replica(tier=candidate_tiers[-1],
                                             depth=depth + 1)

            for replica in replace_replicas:#对于要分配的dev作相应的处理
                dev = find_home_for_replica()
                dev['parts_wanted'] -= 1
                dev['parts'] += 1
                old_sort_key = dev['sort_key']
                new_sort_key = dev['sort_key'] = self._sort_key_for(dev)
                for tier in tiers_for_dev(dev):
                    other_replicas[tier] += 1

                    index = bisect.bisect_left(tier2sort_key[tier],
                                               old_sort_key)
                    tier2devs[tier].pop(index)
                    tier2sort_key[tier].pop(index)

                    new_index = bisect.bisect_left(tier2sort_key[tier],
                                                   new_sort_key)
                    tier2devs[tier].insert(new_index, dev)
                    tier2sort_key[tier].insert(new_index, new_sort_key)

                self._replica2part2dev[replica][part] = dev['id']#某个part的某个replica分配到dev['id']

        # Just to save memory and keep from accidental reuse.
        for dev in self._iter_devs():
            del dev['sort_key']

这个函数实现了从新分配的功能,其中重要的概念是三层结构,也就是utrls.py文件,会针对一个dev 或者一个devs,返回三层结构的字典。函数

源码中给咱们举了一个例子:源码分析

  Example:ui

    zone 1 -+---- 192.168.1.1:6000 -+---- device id 0

            |                       |

            |                       +---- device id 1

            |                       |

            |                       +---- device id 2

            |

            +---- 192.168.1.2:6000 -+---- device id 3

                                    |

                                    +---- device id 4

                                    |

                                    +---- device id 5

    zone 2 -+---- 192.168.2.1:6000 -+---- device id 6

            |                       |

            |                       +---- device id 7

            |                       |

            |                       +---- device id 8

            |

            +---- 192.168.2.2:6000 -+---- device id 9

                                    |

                                    +---- device id 10

                                    |

                                    +---- device id 11

    The tier tree would look like:

    {

      (): [(1,), (2,)],

      (1,): [(1, 192.168.1.1:6000),

             (1, 192.168.1.2:6000)],

      (2,): [(1, 192.168.2.1:6000),

             (1, 192.168.2.2:6000)],

      (1, 192.168.1.1:6000): [(1, 192.168.1.1:6000, 0),

                              (1, 192.168.1.1:6000, 1),

                              (1, 192.168.1.1:6000, 2)],

      (1, 192.168.1.2:6000): [(1, 192.168.1.2:6000, 3),

                              (1, 192.168.1.2:6000, 4),

                              (1, 192.168.1.2:6000, 5)],

      (2, 192.168.2.1:6000): [(1, 192.168.2.1:6000, 6),

                              (1, 192.168.2.1:6000, 7),

                              (1, 192.168.2.1:6000, 8)],

      (2, 192.168.2.2:6000): [(1, 192.168.2.2:6000, 9),

                              (1, 192.168.2.2:6000, 10),

                              (1, 192.168.2.2:6000, 11)],

    }


经过zone,ip_port,device_id 分红三层,以后的操做会根据层次,进行相关的操做(这其中就实现了zone,副本等概念)。


这样一个ring rebalance操做就作好了,最后会保存新的 builder文件,和ring文件,ring文件时根据生产的builder文件调用了RingData类中的方法保存的比较简单,这里不作分析。


这样大致上就分析了swift-ring-builder, /swift/common/ring/下的文件,其中具体的函数具体的功能与实现,能够查看源码。下一篇文章我会分析一下swift-init,用经过start方法来讲明服务启动的流程。

相关文章
相关标签/搜索