上一遍源码分析,关注swift-ring-bin文件,其中最为复杂,也是最为重要操做要数rebalance方法了,它是用来从新生成ring文件,再你修改builder文件后(例如增减设备)使系统中的partition分布平衡(固然,在rebalance后,须要从新启动系统的各个服务)。其中一致性的哈希算法,副本的概念,zone的概念,weight的概念都是经过它来实现的。python
源码片断:算法
swift-ring-builder rebalance方法。 swift
def rebalance(): """ swift-ring-builder <builder_file> rebalance Attempts to rebalance the ring by reassigning partitions that haven't been recently reassigned. """ devs_changed = builder.devs_changed #devs_changed表明builder中的devs是否改变,默认是Flase,当调用add_dev,set_dev_weight,remove_dev,会把devs_changed设置为True。 try: last_balance = builder.get_balance()#调用builder.get_balance方法,返回ring的banlance 也就是平衡度 例如0.83%。 parts, balance = builder.rebalance()#主要的重平衡方法,返回从新分配的partition的数目和新的balance。 except exceptions.RingBuilderError, e: print '-' * 79 print ("An error has occurred during ring validation. Common\n" "causes of failure are rings that are empty or do not\n" "have enough devices to accommodate the replica count.\n" "Original exception message:\n %s" % e.message ) print '-' * 79 exit(EXIT_ERROR) if not parts: print 'No partitions could be reassigned.' print 'Either none need to be or none can be due to ' \ 'min_part_hours [%s].' % builder.min_part_hours exit(EXIT_WARNING) if not devs_changed and abs(last_balance - balance) < 1: print 'Cowardly refusing to save rebalance as it did not change ' \ 'at least 1%.' exit(EXIT_WARNING) try: builder.validate()#安全功能方法,捕捉bugs,确保partition发配到真正的device上,不被分配两次等等一些功能。 except exceptions.RingValidationError, e: print '-' * 79 print ("An error has occurred during ring validation. Common\n" "causes of failure are rings that are empty or do not\n" "have enough devices to accommodate the replica count.\n" "Original exception message:\n %s" % e.message ) print '-' * 79 exit(EXIT_ERROR) print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \ (parts, 100.0 * parts / builder.parts, balance)#打印rebalance结果 status = EXIT_SUCCESS if balance > 5: #balnce大于5会提示,最小的系统平衡时间。 print '-' * 79 print 'NOTE: Balance of %.02f indicates you should push this ' % \ balance print ' ring, wait at least %d hours, and rebalance/repush.' \ % builder.min_part_hours print '-' * 79 status = EXIT_WARNING ts = time()#截取时间。 builder.get_ring().save( #保存新生成的builder ring文件 pathjoin(backup_dir, '%d.' % ts + basename(ring_file))) pickle.dump(builder.to_dict(), open(pathjoin(backup_dir, '%d.' % ts + basename(argv[1])), 'wb'), protocol=2) builder.get_ring().save(ring_file) pickle.dump(builder.to_dict(), open(argv[1], 'wb'), protocol=2) exit(status)
其中我加入了一些本身的注释,方便理解。其实是调用了builder.py中的rebalance方法。安全
builder.py 中的rebalance方法:app
def rebalance(self): """ Rebalance the ring. This is the main work function of the builder, as it will assign and reassign partitions to devices in the ring based on weights, distinct zones, recent reassignments, etc. The process doesn't always perfectly assign partitions (that'd take a lot more analysis and therefore a lot more time -- I had code that did that before). Because of this, it keeps rebalancing until the device skew (number of partitions a device wants compared to what it has) gets below 1% or doesn't change by more than 1% (only happens with ring that can't be balanced no matter what -- like with 3 zones of differing weights with replicas set to 3). :returns: (number_of_partitions_altered, resulting_balance) """ self._ring = None #令实例中的ring为空 if self._last_part_moves_epoch is None: self._initial_balance() #增长一些初始化设置的balance方法, self.devs_changed = False return self.parts, self.get_balance() retval = 0 self._update_last_part_moves()#更新part moved时间。 last_balance = 0 while True: reassign_parts = self._gather_reassign_parts()#返回一个list(part,replica)对,须要从新分配。 self._reassign_parts(reassign_parts) #从新分配的实际动做 retval += len(reassign_parts) while self._remove_devs: self.devs[self._remove_devs.pop()['id']] = None #删除相应的dev balance = self.get_balance()#获取新的平衡比 if balance < 1 or abs(last_balance - balance) < 1 or \ retval == self.parts: break last_balance = balance self.devs_changed = False self.version += 1 return retval, balance
程序会根据_last_part_moves_epoch是否为None来决定,程序执行的路线。若是为None(说明是第一次rebalance),程序会调用_initial_balance()方法,而后返回结果,其实它的操做跟_last_part_moves_epoch不为None时,进行的操做大致相同,只是_initial_balance会作一些初始化的操做。而真正执行rebalance操做动做的是_reassign_parts方法。dom
builder.py中的_reassign_parts分配part的动做方法。ide
def _reassign_parts(self, reassign_parts): """ For an existing ring data set, partitions are reassigned similarly to the initial assignment. The devices are ordered by how many partitions they still want and kept in that order throughout the process. The gathered partitions are iterated through, assigning them to devices according to the "most wanted" while keeping the replicas as "far apart" as possible. Two different zones are considered the farthest-apart things, followed by different ip/port pairs within a zone; the least-far-apart things are different devices with the same ip/port pair in the same zone. If you want more replicas than devices, you won't get all your replicas. :param reassign_parts: An iterable of (part, replicas_to_replace) pairs. replicas_to_replace is an iterable of the replica (an int) to replace for that partition. replicas_to_replace may be shared for multiple partitions, so be sure you do not modify it. """ for dev in self._iter_devs(): dev['sort_key'] = self._sort_key_for(dev)#设置每个dev的sort_key available_devs = \ #迭代出可用的devs根据sort_key排序 sorted((d for d in self._iter_devs() if d['weight']), key=lambda x: x['sort_key']) tier2children = build_tier_tree(available_devs)#生产层结构devs tier2devs = defaultdict(list)#devs层 tier2sort_key = defaultdict(list)#sort_key层 tiers_by_depth = defaultdict(set)#深度层 for dev in available_devs:#安装不一样方式分类排序。 for tier in tiers_for_dev(dev): tier2devs[tier].append(dev) # <-- starts out sorted! tier2sort_key[tier].append(dev['sort_key']) tiers_by_depth[len(tier)].add(tier) for part, replace_replicas in reassign_parts: # Gather up what other tiers (zones, ip_ports, and devices) the # replicas not-to-be-moved are in for this part. other_replicas = defaultdict(lambda: 0)#不一样的zone ip_port device_id标识 for replica in xrange(self.replicas): if replica not in replace_replicas: dev = self.devs[self._replica2part2dev[replica][part]] for tier in tiers_for_dev(dev): other_replicas[tier] += 1#不须要从新分配的会被+1 def find_home_for_replica(tier=(), depth=1): # Order the tiers by how many replicas of this # partition they already have. Then, of the ones # with the smallest number of replicas, pick the # tier with the hungriest drive and then continue # searching in that subtree. # # There are other strategies we could use here, # such as hungriest-tier (i.e. biggest # sum-of-parts-wanted) or picking one at random. # However, hungriest-drive is what was used here # before, and it worked pretty well in practice. # # Note that this allocator will balance things as # evenly as possible at each level of the device # layout. If your layout is extremely unbalanced, # this may produce poor results. candidate_tiers = tier2children[tier]#逐层的找最少的part min_count = min(other_replicas[t] for t in candidate_tiers) candidate_tiers = [t for t in candidate_tiers if other_replicas[t] == min_count] candidate_tiers.sort( key=lambda t: tier2sort_key[t][-1]) if depth == max(tiers_by_depth.keys()): return tier2devs[candidate_tiers[-1]][-1] return find_home_for_replica(tier=candidate_tiers[-1], depth=depth + 1) for replica in replace_replicas:#对于要分配的dev作相应的处理 dev = find_home_for_replica() dev['parts_wanted'] -= 1 dev['parts'] += 1 old_sort_key = dev['sort_key'] new_sort_key = dev['sort_key'] = self._sort_key_for(dev) for tier in tiers_for_dev(dev): other_replicas[tier] += 1 index = bisect.bisect_left(tier2sort_key[tier], old_sort_key) tier2devs[tier].pop(index) tier2sort_key[tier].pop(index) new_index = bisect.bisect_left(tier2sort_key[tier], new_sort_key) tier2devs[tier].insert(new_index, dev) tier2sort_key[tier].insert(new_index, new_sort_key) self._replica2part2dev[replica][part] = dev['id']#某个part的某个replica分配到dev['id'] # Just to save memory and keep from accidental reuse. for dev in self._iter_devs(): del dev['sort_key']
这个函数实现了从新分配的功能,其中重要的概念是三层结构,也就是utrls.py文件,会针对一个dev 或者一个devs,返回三层结构的字典。函数
源码中给咱们举了一个例子:源码分析
Example:ui
zone 1 -+---- 192.168.1.1:6000 -+---- device id 0
| |
| +---- device id 1
| |
| +---- device id 2
|
+---- 192.168.1.2:6000 -+---- device id 3
|
+---- device id 4
|
+---- device id 5
zone 2 -+---- 192.168.2.1:6000 -+---- device id 6
| |
| +---- device id 7
| |
| +---- device id 8
|
+---- 192.168.2.2:6000 -+---- device id 9
|
+---- device id 10
|
+---- device id 11
The tier tree would look like:
{
(): [(1,), (2,)],
(1,): [(1, 192.168.1.1:6000),
(1, 192.168.1.2:6000)],
(2,): [(1, 192.168.2.1:6000),
(1, 192.168.2.2:6000)],
(1, 192.168.1.1:6000): [(1, 192.168.1.1:6000, 0),
(1, 192.168.1.1:6000, 1),
(1, 192.168.1.1:6000, 2)],
(1, 192.168.1.2:6000): [(1, 192.168.1.2:6000, 3),
(1, 192.168.1.2:6000, 4),
(1, 192.168.1.2:6000, 5)],
(2, 192.168.2.1:6000): [(1, 192.168.2.1:6000, 6),
(1, 192.168.2.1:6000, 7),
(1, 192.168.2.1:6000, 8)],
(2, 192.168.2.2:6000): [(1, 192.168.2.2:6000, 9),
(1, 192.168.2.2:6000, 10),
(1, 192.168.2.2:6000, 11)],
}
经过zone,ip_port,device_id 分红三层,以后的操做会根据层次,进行相关的操做(这其中就实现了zone,副本等概念)。
这样一个ring rebalance操做就作好了,最后会保存新的 builder文件,和ring文件,ring文件时根据生产的builder文件调用了RingData类中的方法保存的比较简单,这里不作分析。
这样大致上就分析了swift-ring-builder, /swift/common/ring/下的文件,其中具体的函数具体的功能与实现,能够查看源码。下一篇文章我会分析一下swift-init,用经过start方法来讲明服务启动的流程。