《OpenStack 虚拟机的磁盘文件类型与存储方式》
《Libvirt Live Migration 与 Pre-Copy 实现原理》
《OpenStack 虚拟机冷/热迁移功能实践与流程分析》html
在通过上述文章的铺垫以后,终于来到了代码实现部分,经过对代码实现的分析,帮助咱们洞穿 OpenStack 虚拟机迁移的本质。node
NOTE:block_device_info 保存的并不是只是 OpenStack 块设备(Volume)信息,而是虚拟机块设备信息,即磁盘信息(包含 image、volume),这一点认识不清很容易在代码中被混淆。python
MariaDB [nova]> select device_name,destination_type,device_type,source_type,image_id from block_device_mapping where instance_uuid="1935fcf7-ba9b-437c-a7d3-5d54c6d0d6d3"; +-------------+------------------+-------------+-------------+--------------------------------------+ | device_name | destination_type | device_type | source_type | image_id | +-------------+------------------+-------------+-------------+--------------------------------------+ | /dev/vda | local | disk | image | 0aff2888-47f8-4133-928a-9c54414b3afb | +-------------+------------------+-------------+-------------+--------------------------------------+
# nova/nova/api/openstack/compute/migrate_server.py def _migrate(self, req, id, body): """Permit admins to migrate a server to a new host.""" ... # 判断用户是否有权重执行 migrate 操做 context.can(ms_policies.POLICY_ROOT % 'migrate') # 获取 instance 资源模型对象 instance = common.get_instance(self.compute_api, context, id) try: # 实际调用的是 instance Resize 接口 self.compute_api.resize(req.environ['nova.context'], instance) ... # nova/nova/compute/api.py def resize(self, context, instance, flavor_id=None, clean_shutdown=True, **extra_instance_updates): """Resize (ie, migrate) a running instance. If flavor_id is None, the process is considered a migration, keeping the original flavor_id. If flavor_id is not None, the instance should be migrated to a new host and resized to the new flavor_id. """ # 从注释能够看出,是 Migrate 仍是 Resize 主要看是否传入了 New Flavor ... # 获取虚拟机当前的 Flavor current_instance_type = instance.get_flavor() # If flavor_id is not provided, only migrate the instance. if not flavor_id: LOG.debug("flavor_id is None. Assuming migration.", instance=instance) # 保证迁移先后虚拟机 Flavor 不会发生改变 new_instance_type = current_instance_type ... filter_properties = {'ignore_hosts': []} # 经过配置项 allow_resize_to_same_host 来决定是否会 resize 到同一个计算节点 # 实际上,当 Migrate 到同一个计算节点时,nova-compute 会触发 UnableToMigrateToSelf 异常, # 再继续 Retry Scheduler,直至调度到合适的计算节点或异常退出,前提是 nova-scheduler 启用了 RetryFilter if not CONF.allow_resize_to_same_host: filter_properties['ignore_hosts'].append(instance.host) ... scheduler_hint = {'filter_properties': filter_properties} self.compute_task_api.resize_instance(context, instance, extra_instance_updates, scheduler_hint=scheduler_hint, flavor=new_instance_type, reservations=quotas.reservations or [], clean_shutdown=clean_shutdown, request_spec=request_spec) # nova/compute/manager.py def resize_instance(self, context, instance, image, reservations, migration, instance_type, clean_shutdown): """Starts the migration of a running instance to another host.""" ... # 获取虚拟机的网络信息 network_info = self.network_api.get_instance_nw_info(context, instance) ... # 获取虚拟机磁盘信息 bdms = objects.BlockDeviceMappingList.get_by_instance_uuid( context, instance.uuid) block_device_info = self._get_instance_block_device_info( context, instance, bdms=bdms) # 获取虚拟机的停机超时和重试信息 timeout, retry_interval = self._get_power_off_values(context, instance, clean_shutdown) # 关闭虚拟机电源并迁移虚拟机磁盘文件 disk_info = self.driver.migrate_disk_and_power_off( context, instance, migration.dest_host, instance_type, network_info, block_device_info, timeout, retry_interval) # 断开虚拟机的共享块设备链接 self._terminate_volume_connections(context, instance, bdms) # 迁移虚拟机网络 migration_p = obj_base.obj_to_primitive(migration) self.network_api.migrate_instance_start(context, instance, migration_p) ... # 修改虚拟机的主机记录 instance.host = migration.dest_compute instance.node = migration.dest_node instance.task_state = task_states.RESIZE_MIGRATED instance.save(expected_task_state=task_states.RESIZE_MIGRATING) ... # nova/nova/virt/libvirt/driver.py def migrate_disk_and_power_off(self, context, instance, dest, flavor, network_info, block_device_info=None, timeout=0, retry_interval=0): # 获取临时盘信息 ephemerals = driver.block_device_info_get_ephemerals(block_device_info) # 检查是否要调整磁盘大小 # Checks if the migration needs a disk resize down. root_down = flavor.root_gb < instance.flavor.root_gb ephemeral_down = flavor.ephemeral_gb < eph_size # 检查虚拟机是否经过卷启动 booted_from_volume = self._is_booted_from_volume(block_device_info) # 本地磁盘文件不能 Resize if (root_down and not booted_from_volume) or ephemeral_down: reason = _("Unable to resize disk down.") raise exception.InstanceFaultRollback( exception.ResizeError(reason=reason)) # Cinder LVM Backend & Boot from volume 的虚拟机不能迁移 # NOTE(dgenin): Migration is not implemented for LVM backed instances. if CONF.libvirt.images_type == 'lvm' and not booted_from_volume: reason = _("Migration is not supported for LVM backed instances") raise exception.InstanceFaultRollback( exception.MigrationPreCheckError(reason=reason)) # copy disks to destination # rename instance dir to +_resize at first for using # shared storage for instance dir (eg. NFS). inst_base = libvirt_utils.get_instance_path(instance) inst_base_resize = inst_base + "_resize" # 判断是否为共享存储 shared_storage = self._is_storage_shared_with(dest, inst_base) # try to create the directory on the remote compute node # if this fails we pass the exception up the stack so we can catch # failures here earlier if not shared_storage: try: # 非共享存储:经过 SSH 在目的主机上建立虚拟机目录 self._remotefs.create_dir(dest, inst_base) except processutils.ProcessExecutionError as e: reason = _("not able to execute ssh command: %s") % e raise exception.InstanceFaultRollback( exception.ResizeError(reason=reason)) # 关闭虚拟机电源 self.power_off(instance, timeout, retry_interval) # 卸载共享块设备 block_device_mapping = driver.block_device_info_get_mapping( block_device_info) for vol in block_device_mapping: connection_info = vol['connection_info'] disk_dev = vol['mount_device'].rpartition("/")[2] self._disconnect_volume(connection_info, disk_dev, instance) # 获取 disk.info 配置文件内容 # 记录了 Root Disk、Ephemeral Disk、Swap Disk 的 file paths disk_info_text = self.get_instance_disk_info( instance, block_device_info=block_device_info) disk_info = jsonutils.loads(disk_info_text) try: # 预删除虚拟机目录 utils.execute('mv', inst_base, inst_base_resize) # if we are migrating the instance with shared storage then # create the directory. If it is a remote node the directory # has already been created if shared_storage: # 共享存储:目的主机看做是本身 dest = None # 共享存储:直接在本地文件系统建立虚拟机目录 utils.execute('mkdir', '-p', inst_base) ... active_flavor = instance.get_flavor() # 块迁移虚拟机本地磁盘文件 for info in disk_info: # assume inst_base == dirname(info['path']) img_path = info['path'] fname = os.path.basename(img_path) from_path = os.path.join(inst_base_resize, fname) ... # We will not copy over the swap disk here, and rely on # finish_migration/_create_image to re-create it for us. if not (fname == 'disk.swap' and active_flavor.get('swap', 0) != flavor.get('swap', 0)): # 是否启用压缩 compression = info['type'] not in NO_COMPRESSION_TYPES # 非共享存储:使用 scp 远程拷贝 # 共享存储:使用 cp 本地拷贝 libvirt_utils.copy_image(from_path, img_path, host=dest, on_execute=on_execute, on_completion=on_completion, compression=compression) # Ensure disk.info is written to the new path to avoid disks being # reinspected and potentially changing format. # 拷贝 diks.inof 配置文件 src_disk_info_path = os.path.join(inst_base_resize, 'disk.info') if os.path.exists(src_disk_info_path): dst_disk_info_path = os.path.join(inst_base, 'disk.info') libvirt_utils.copy_image(src_disk_info_path, dst_disk_info_path, host=dest, on_execute=on_execute, on_completion=on_completion) except Exception: with excutils.save_and_reraise_exception(): self._cleanup_remote_migration(dest, inst_base, inst_base_resize, shared_storage) return disk_info_text
在《Libvirt Live Migration 与 Pre-Copy 实现原理》一文中咱们提到了 Libvirt Live Migration 的实现原理,和 KVM Pre-Copy Live Migration 的实现原理。简单的说,可分为 3 个阶段:linux
能够想到,其中最关键的阶段就是 Stage 2,即退出条件的实现。Libvirt 早期的原生退出条件有:算法
而 Nova 选择的是退出条件就是动态配置 max downtime,Libvirt Pre-Copy Live Migration 每次迭代都会从新计算虚拟机新的脏内存以及每次迭代所花掉的时间来估算带宽,再根据带宽和当前迭代的脏页数计算出传输剩余数据的时间,这个时间就是 downtime。若是 downtime 在管理员配置的 Live Migration Max Downtime 范围以内,则退出,进入 Stage 3。json
NOTE:Live Migration Max Downtime(热迁移最大停机时间,单位是 ms),表示可被容许的虚拟机静态数据持续时间,描述业务中断的容忍区间,通常小到能够忽略不计。可经过 nova.conf 配置项指定 CONF.libvirt.live_migration_downtime
。api
须要注意的是,动态配置 downtime 的退出条件存在一个问题,若是虚拟机持续处于高业务状态(不断产生新的脏内存),就意味着每次迭代迁移数据量都很大,downtime 就会一直没法进入退出范围。因此,你应该要有心理准备,使用热迁移多是一个漫长的过程。针对这种状况,Libvirt 引入了一些新特性:网络
除了 Pre-Copy(预拷贝)模式以外,Libvirt 还支持 Post-Copy(后拷贝)模式。前者要求全部数据都必须在虚拟机切换到目标主机以前拷贝完;相对的,Post-Copy 则会优先考虑尽快的切换到目标主机,而后再拷贝内存数据。Port-Copy 模式先把虚拟机的设备状态信息和一部分(10%)脏内存数据到目标主机,而后虚拟机就切换到目标主机上运行。当 GuestOS 发现访问的某些内存页不存在时,就会触发一个远程页错误,进而触发从源主机上面拉取该内存页的动做。显然,Post-Copy 模式也存在一些问题:若是其中一台主机宕机,或出现故障,或网络不通都会致使整个虚拟机异常。Post-Copy 对于核心业务而言不是推荐的 Live Migration 方式,能够经过 nova.conf 配置项 live_migration_permit_post_copy
指定是否开启。app
除此以外,Nova 采用的 Libvirt Live Migration 控制模型是 “Client 直连控制”,因此做为 Libvirt Client 的 Nova 就须要轮询访问 libvirtd 以获取数据迁移的状态信息做为控制迁移的依据。故此,Nova 还须要实现一套数据迁移监控机制。dom
简而言之,Nova 对于 Libvirt Live Migration 的主要实现有两点:
# nova/api/openstack/compute/migrate_server.py def _migrate_live(self, req, id, body): """Permit admins to (live) migrate a server to a new host.""" ... # 是否执行块迁移 block_migration = body["os-migrateLive"]["block_migration"] ... # 是否异步执行 async = api_version_request.is_supported(req, min_version='2.34') ... # 是否强制执行 force = self._get_force_param_for_live_migration(body, host) ... # 是否支持磁盘超额 disk_over_commit = body["os-migrateLive"]["disk_over_commit"] ... self.compute_api.live_migrate(context, instance, block_migration, disk_over_commit, host, force, async) ... # nova/nova/compute/api.py def live_migrate(self, context, instance, block_migration, disk_over_commit, host_name, force=None, async=False): """Migrate a server lively to a new host.""" ... # NOTE(sbauza): Force is a boolean by the new related API version if force is False and host_name: ... # 非强制执行:设定目的主机信息 destination = objects.Destination( host=target.host, node=target.hypervisor_hostname ) request_spec.requested_destination = destination ... self.compute_task_api.live_migrate_instance(context, instance, host_name, block_migration=block_migration, disk_over_commit=disk_over_commit, request_spec=request_spec, async=async) # nova/nova/conductor/manager.py def _live_migrate(self, context, instance, scheduler_hint, block_migration, disk_over_commit, request_spec): # 获取目的主机 destination = scheduler_hint.get("host") ... task = self._build_live_migrate_task(context, instance, destination, block_migration, disk_over_commit, migration, request_spec) ... task.execute() ... # nova/nova/conductor/tasks/live_migrate.py class LiveMigrationTask(base.TaskBase): ... def _execute(self): # 检查虚拟机是否正常运行 self._check_instance_is_active() # 检查源主机服务进程是否正常 self._check_host_is_up(self.source) # 热迁移必定会指定目的主机 if not self.destination: self.destination = self._find_destination() self.migration.dest_compute = self.destination self.migration.save() else: # 检查目的主机和源主机是否为同一个 # 检查目的主机服务进程是否正常 # 检查目的主机是否有足够的内存空间 # 检查目的主机和源主机的 Hypervisor 是否一致 # 检查目的主机是否能够进行热迁移 self._check_requested_destination() # TODO(johngarbutt) need to move complexity out of compute manager # TODO(johngarbutt) disk_over_commit? return self.compute_rpcapi.live_migration(self.context, host=self.source, instance=self.instance, dest=self.destination, block_migration=self.block_migration, migration=self.migration, migrate_data=self.migrate_data) # nova/compute/manager.py def live_migration(self, context, dest, instance, block_migration, migration, migrate_data): ... # 设定 migration 状态为 '队列中' self._set_migration_status(migration, 'queued') def dispatch_live_migration(*args, **kwargs): with self._live_migration_semaphore: self._do_live_migration(*args, **kwargs) # Spawn 一个热迁移队列消息(任务) utils.spawn_n(dispatch_live_migration, context, dest, instance, block_migration, migration, migrate_data) def _do_live_migration(self, context, dest, instance, block_migration, migration, migrate_data): ... # 设定 migration 状态为 '准备' self._set_migration_status(migration, 'preparing') got_migrate_data_object = isinstance(migrate_data, migrate_data_obj.LiveMigrateData) if not got_migrate_data_object: migrate_data = \ migrate_data_obj.LiveMigrateData.detect_implementation( migrate_data) try: if ('block_migration' in migrate_data and migrate_data.block_migration): # 进行块迁移:获取 disk.info 中记录的本地磁盘文件信息 block_device_info = self._get_instance_block_device_info( context, instance) disk = self.driver.get_instance_disk_info( instance, block_device_info=block_device_info) else: disk = None # 让目的主机执行热迁移前的准备 migrate_data = self.compute_rpcapi.pre_live_migration( context, instance, block_migration, disk, dest, migrate_data) ... # 设定 migration 状态为 '进行中' self._set_migration_status(migration, 'running') ... self.driver.live_migration(context, instance, dest, self._post_live_migration, self._rollback_live_migration, block_migration, migrate_data) ... # nova/nova/virt/libvirt/driver.py def _live_migration(self, context, instance, dest, post_method, recover_method, block_migration, migrate_data): ... # nova.virt.libvirt.guest.Guest 对象 guest = self._host.get_guest(instance) disk_paths = [] device_names = [] if migrate_data.block_migration: # 块迁移:获取本地磁盘文件路径 # 若是不须要块迁移,则只内存数据 # e.g. /var/lib/nova/instances/bf6824e9-1dac-466c-ab53-69f82d8adf73/disk disk_paths, device_names = self._live_migration_copy_disk_paths( context, instance, guest) # Spawn 一个热迁移执行函数 opthread = utils.spawn(self._live_migration_operation, context, instance, dest, block_migration, migrate_data, guest, device_names) ... # 监控 libvirtd 数据迁移进度 self._live_migration_monitor(context, instance, guest, dest, post_method, recover_method, block_migration, migrate_data, finish_event, disk_paths) ...
def _live_migration_operation(self, context, instance, dest, block_migration, migrate_data, guest, device_names): ... # 获取 live migration URI migrate_uri = None if ('target_connect_addr' in migrate_data and migrate_data.target_connect_addr is not None): dest = migrate_data.target_connect_addr if (migration_flags & libvirt.VIR_MIGRATE_TUNNELLED == 0): migrate_uri = self._migrate_uri(dest) # 获取 GuestOS XML new_xml_str = None params = None if (self._host.is_migratable_xml_flag() and ( listen_addrs or migrate_data.bdms)): new_xml_str = libvirt_migrate.get_updated_guest_xml( # TODO(sahid): It's not a really well idea to pass # the method _get_volume_config and we should to find # a way to avoid this in future. guest, migrate_data, self._get_volume_config) ... # 调用 libvirt.virDomain.migrate 的封装函数 # 向 libvirtd 发出 Live Migration 指令 guest.migrate(self._live_migration_uri(dest), migrate_uri=migrate_uri, flags=migration_flags, params=params, domain_xml=new_xml_str, bandwidth=CONF.libvirt.live_migration_bandwidth) ...
Libvirt Python Client 的迁移函数原型是 libvirt.virDomain.migrate
。
migrate(self, dconn, flags, dname, uri, bandwidth) method of libvirt.virDomain instance Migrate the domain object from its current host to the destination host given by dconn (a connection to the destination host).
Nova Libvirt Driver 对 libvirt.virDomain.migrate
进行了封装:
# nova/virt/libvirt/guest.py def migrate(self, destination, migrate_uri=None, params=None, flags=0, domain_xml=None, bandwidth=0): """Migrate guest object from its current host to the destination """ if domain_xml is None: self._domain.migrateToURI( destination, flags=flags, bandwidth=bandwidth) else: if params: ... if migrate_uri: # In migrateToURI3 this paramenter is searched in # the `params` dict params['migrate_uri'] = migrate_uri params['bandwidth'] = bandwidth self._domain.migrateToURI3( destination, params=params, flags=flags) else: self._domain.migrateToURI2( destination, miguri=migrate_uri, dxml=domain_xml, flags=flags, bandwidth=bandwidth)
经过 Flags 来配置 Libvirt 迁移细节:
这些 Flags 经过 nova.conf 配置项 live_migration_flag 定义,e.g.
live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED
# nova/nova/virt/libvirt/driver.py def _live_migration_monitor(self, context, instance, guest, dest, post_method, recover_method, block_migration, migrate_data, finish_event, disk_paths): # 获取须要进行热迁移的总数据量,包括 RAM 和本地磁盘文件 # data_gb: total GB of RAM and disk to transfer data_gb = self._live_migration_data_gb(instance, disk_paths) # e.g. downtime_steps = [(0, 46), (300, 47), (600, 48), (900, 51), (1200, 57), (1500, 66), (1800, 84), (2100, 117), (2400, 179), (2700, 291), (3000, 500)] # downtime_steps 经过一个算法得出,参与计算的参数有: # data_gb # CONF.libvirt.live_migration_downtime # CONF.libvirt.live_migration_downtime_steps # CONF.libvirt.live_migration_downtime_delay # downtime_steps 的含义: # 一个元组表示一个 Step,分 Steps 次给 libvirtd 传输 downtime # (delay, downtime),即:(下一次传递时间间隔,传递的 downtime 值) # 直到最后一次 Step 传递的元组是 (CONF.libvirt.live_migration_downtime_delay, CONF.libvirt.live_migration_downtime_steps) # 若是最后一次 libvirtd 迭代计算出来的 downtime 在传递的 downtime 范围内,则知足退出条件 # NOTE:downtime_steps 每一个 Step 的 max downtime 都在递增直到真正用户设定的最大可容忍 downtime, # 这是由于 Nova 在不断的试探实际最小的 max downtime,尽量早的进入退出状态。 downtime_steps = list(self._migration_downtime_steps(data_gb)) ... # 轮询次数 n = 0 # 监控开始时间 start = time.time() progress_time = start # progress_watermark 用来标记上次查询到的剩余数据量,若是数据有在迁移,那么脏数据水位(watermark)老是递减的 progress_watermark = None # 是否启用了 Port-Copy 模型 is_post_copy_enabled = self._is_post_copy_enabled(migration_flags) while True: # 获取 Live Migration Job 的信息 info = guest.get_job_info() ... elif info.type == libvirt.VIR_DOMAIN_JOB_UNBOUNDED: # Migration is still running # # This is where we wire up calls to change live # migration status. eg change max downtime, cancel # the operation, change max bandwidth libvirt_migrate.run_tasks(guest, instance, self.active_migrations, on_migration_failure, migration, is_post_copy_enabled) now = time.time() elapsed = now - start if ((progress_watermark is None) or (progress_watermark == 0) or (progress_watermark > info.data_remaining)): progress_watermark = info.data_remaining progress_time = now # progress_timeout 这个变量的设计用来防止因为 libvirtd 异常致使的数据迁移卡壳 # progress_timeout 标记迁移卡壳的超时时间,停止迁移 progress_timeout = CONF.libvirt.live_migration_progress_timeout # completion_timeout 这个变量的设计用来防止 libvirtd 长时间处在迁移状态 # 可能因为网络带宽过低等缘由,libvirtd 就会长时间处于迁移状态,可能会致使管理带宽拥堵 # completion_timeout 从第一次轮询开始计算,一旦超时没有完成迁移,停止迁移 completion_timeout = int( CONF.libvirt.live_migration_completion_timeout * data_gb) # 判断迁移过程是否应该终止 if libvirt_migrate.should_abort(instance, now, progress_time, progress_timeout, elapsed, completion_timeout, migration.status): try: guest.abort_job() except libvirt.libvirtError as e: LOG.warning(_LW("Failed to abort migration %s"), e, instance=instance) self._clear_empty_migration(instance) raise # 判断是否启动 Port-Copy 模式 if (is_post_copy_enabled and libvirt_migrate.should_switch_to_postcopy( info.memory_iteration, info.data_remaining, previous_data_remaining, migration.status)): # 进行 Port-Copy 转换 libvirt_migrate.trigger_postcopy_switch(guest, instance, migration) previous_data_remaining = info.data_remaining # 迭代的动态传递 Max Downtime Step curdowntime = libvirt_migrate.update_downtime( guest, instance, curdowntime, downtime_steps, elapsed) if (n % 10) == 0: remaining = 100 if info.memory_total != 0: # 计算剩余迁移数据量 remaining = round(info.memory_remaining * 100 / info.memory_total) libvirt_migrate.save_stats(instance, migration, info, remaining) # 每轮询 60 次打印一次 info # 没轮询 10 次打印一次 debug lg = LOG.debug if (n % 60) == 0: lg = LOG.info # 打印已经迁移了几秒、内存数据剩余量、迁移进度 lg(_LI("Migration running for %(secs)d secs, " "memory %(remaining)d%% remaining; " "(bytes processed=%(processed_memory)d, " "remaining=%(remaining_memory)d, " "total=%(total_memory)d)"), {"secs": n / 2, "remaining": remaining, "processed_memory": info.memory_processed, "remaining_memory": info.memory_remaining, "total_memory": info.memory_total}, instance=instance) if info.data_remaining > progress_watermark: lg(_LI("Data remaining %(remaining)d bytes, " "low watermark %(watermark)d bytes " "%(last)d seconds ago"), {"remaining": info.data_remaining, "watermark": progress_watermark, "last": (now - progress_time)}, instance=instance) n = n + 1 # 迁移完成 elif info.type == libvirt.VIR_DOMAIN_JOB_COMPLETED: # Migration is all done LOG.info(_LI("Migration operation has completed"), instance=instance) post_method(context, instance, dest, block_migration, migrate_data) break # 迁移失败 elif info.type == libvirt.VIR_DOMAIN_JOB_FAILED: # Migration did not succeed LOG.error(_LE("Migration operation has aborted"), instance=instance) libvirt_migrate.run_recover_tasks(self._host, guest, instance, on_migration_failure) recover_method(context, instance, dest, block_migration, migrate_data) break # 迁移取消 elif info.type == libvirt.VIR_DOMAIN_JOB_CANCELLED: # Migration was stopped by admin LOG.warning(_LW("Migration operation was cancelled"), instance=instance) libvirt_migrate.run_recover_tasks(self._host, guest, instance, on_migration_failure) recover_method(context, instance, dest, block_migration, migrate_data, migration_status='cancelled') break else: LOG.warning(_LW("Unexpected migration job type: %d"), info.type, instance=instance) time.sleep(0.5) self._clear_empty_migration(instance) def _live_migration_data_gb(self, instance, disk_paths): '''Calculate total amount of data to be transferred :param instance: the nova.objects.Instance being migrated :param disk_paths: list of disk paths that are being migrated with instance Calculates the total amount of data that needs to be transferred during the live migration. The actual amount copied will be larger than this, due to the guest OS continuing to dirty RAM while the migration is taking place. So this value represents the minimal data size possible. :returns: data size to be copied in GB ''' ram_gb = instance.flavor.memory_mb * units.Mi / units.Gi if ram_gb < 2: ram_gb = 2 disk_gb = 0 for path in disk_paths: try: size = os.stat(path).st_size size_gb = (size / units.Gi) if size_gb < 2: size_gb = 2 disk_gb += size_gb except OSError as e: LOG.warning(_LW("Unable to stat %(disk)s: %(ex)s"), {'disk': path, 'ex': e}) # Ignore error since we don't want to break # the migration monitoring thread operation # 返回 RAM + Disks 的数据量总和 return ram_gb + disk_gb def _migration_downtime_steps(data_gb): '''Calculate downtime value steps and time between increases. :param data_gb: total GB of RAM and disk to transfer This looks at the total downtime steps and upper bound downtime value and uses an exponential backoff. So initially max downtime is increased by small amounts, and as time goes by it is increased by ever larger amounts For example, with 10 steps, 30 second step delay, 3 GB of RAM and 400ms target maximum downtime, the downtime will be increased every 90 seconds in the following progression: - 0 seconds -> set downtime to 37ms - 90 seconds -> set downtime to 38ms - 180 seconds -> set downtime to 39ms - 270 seconds -> set downtime to 42ms - 360 seconds -> set downtime to 46ms - 450 seconds -> set downtime to 55ms - 540 seconds -> set downtime to 70ms - 630 seconds -> set downtime to 98ms - 720 seconds -> set downtime to 148ms - 810 seconds -> set downtime to 238ms - 900 seconds -> set downtime to 400ms This allows the guest a good chance to complete migration with a small downtime value. ''' # 经过配置项来控制 Live Migration 的执行细节 downtime = CONF.libvirt.live_migration_downtime steps = CONF.libvirt.live_migration_downtime_steps delay = CONF.libvirt.live_migration_downtime_delay # TODO(hieulq): Need to move min/max value into the config option, # currently oslo_config will raise ValueError instead of setting # option value to its min/max. if downtime < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN: downtime = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN if steps < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN: steps = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN if delay < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN: delay = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN delay = int(delay * data_gb) offset = downtime / float(steps + 1) base = (downtime - offset) ** (1 / float(steps)) for i in range(steps + 1): yield (int(delay * i), int(offset + base ** i)) # nova/nova/virt/libvirt/migration.py def update_downtime(guest, instance, olddowntime, downtime_steps, elapsed): """Update max downtime if needed :param guest: a nova.virt.libvirt.guest.Guest to set downtime for :param instance: a nova.objects.Instance :param olddowntime: current set downtime, or None :param downtime_steps: list of downtime steps :param elapsed: total time of migration in secs Determine if the maximum downtime needs to be increased based on the downtime steps. Each element in the downtime steps list should be a 2 element tuple. The first element contains a time marker and the second element contains the downtime value to set when the marker is hit. The guest object will be used to change the current downtime value on the instance. Any errors hit when updating downtime will be ignored :returns: the new downtime value """ LOG.debug("Current %(dt)s elapsed %(elapsed)d steps %(steps)s", {"dt": olddowntime, "elapsed": elapsed, "steps": downtime_steps}, instance=instance) thisstep = None for step in downtime_steps: # elapsed 是当前的已迁移时长 if elapsed > step[0]: # 若是已迁移时长大于 downtime_delay,那么这次 Step 就是 current step thisstep = step if thisstep is None: LOG.debug("No current step", instance=instance) return olddowntime if thisstep[1] == olddowntime: LOG.debug("Downtime does not need to change", instance=instance) return olddowntime LOG.info(_LI("Increasing downtime to %(downtime)d ms " "after %(waittime)d sec elapsed time"), {"downtime": thisstep[1], "waittime": thisstep[0]}, instance=instance) try: # 向 libvirtd 传递 current max downtime guest.migrate_configure_max_downtime(thisstep[1]) except libvirt.libvirtError as e: LOG.warning(_LW("Unable to increase max downtime to %(time)d" "ms: %(e)s"), {"time": thisstep[1], "e": e}, instance=instance) return thisstep[1]
在《OpenStack 虚拟机冷/热迁移功能实践与流程分析》中咱们尝试迁移过具备 NUMA 亲和、CPU 绑定的虚拟机,结果是迁移以后虚拟机依旧可以保持这些特性。这里咱们再进行一个更加极端的测试 —— 将一个具备 NUMA 亲和、CPU 独占绑定的虚拟机迁移到一个 NUMA、CPU 资源都已经已经耗尽的目的主机。
[stack@undercloud (overcloudrc) ~]$ openstack server show VM1 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | overcloud-ovscompute-1.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-ovscompute-1.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-000000d6 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-03-20T10:45:55.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19 | | config_drive | | | created | 2019-03-20T10:44:52Z | | flavor | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788) | | hostId | 9f1230901ddf3fe0e1a41e1c650a784c122b791f89fdf66a40cff3d6 | | id | a17ddcbf-d936-4c77-9ea6-2e684c41cc39 | | image | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb) | | key_name | stack | | name | VM1 | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | project_id | a6c78435075246f3aa5ab946b87086c5 | | properties | | | security_groups | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] | | status | ACTIVE | | updated | 2019-03-20T10:45:56Z | | user_id | 4fe574569664493bbd660abfe762a630 | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ [stack@undercloud (overcloudrc) ~]$ openstack server migrate --block-migration --live overcloud-ovscompute-0.localdomain --wait VM1 Complete [stack@undercloud (overcloudrc) ~]$ openstack server show VM1 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | ovs | | OS-EXT-SRV-ATTR:host | overcloud-ovscompute-0.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-ovscompute-0.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-000000d6 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-03-20T10:45:55.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19 | | config_drive | | | created | 2019-03-20T10:44:52Z | | flavor | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788) | | hostId | 0f2ec590cd73fe0e9522f1ba715dae7a7d4b884e15aa8254defe85d0 | | id | a17ddcbf-d936-4c77-9ea6-2e684c41cc39 | | image | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb) | | key_name | stack | | name | VM1 | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | project_id | a6c78435075246f3aa5ab946b87086c5 | | properties | | | security_groups | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] | | status | ACTIVE | | updated | 2019-03-20T10:51:47Z | | user_id | 4fe574569664493bbd660abfe762a630 | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
迁移过程当中的异常信息:
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager [req-566373ae-5282-4378-9678-d8d08e121cdb - - - - -] Error updating resources for node overcloud-ovscompute-0.localdomain. 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager Traceback (most recent call last): 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6590, in update_available_resource_for_node 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager rt.update_available_resource(context) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 536, in update_available_resource 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_available_resource(context, resources) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager return f(*args, **kwargs) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 896, in _update_available_resource 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_usage_from_instances(context, instances) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1393, in _update_usage_from_instances 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_usage_from_instance(context, instance) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1273, in _update_usage_from_instance 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager sign, is_periodic) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1119, in _update_usage 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self.compute_node, usage, free) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1574, in get_host_numa_usage_from_instance 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free)) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1447, in numa_usage_from_instances 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager newcell.pin_cpus(pinned_cpus) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 86, in pin_cpus 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self.pinned_cpus)) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [0, 1] must be a subset of free CPU set [8]
迁移后的 NUMA 亲和、CPU 绑定信息:
# 迁移虚拟机 [root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d6 VCPU: CPU Affinity ---------------------------------- 0: 0 1: 1 # 已存在虚拟机 [root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d0 VCPU: CPU Affinity ---------------------------------- 0: 0 1: 1 2: 2 3: 3 4: 4 5: 5 6: 6 7: 7
迁移虚拟机的 XML 文件局部:
<cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>IvyBridge</model> <topology sockets='1' cores='2' threads='1'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='arat'/> <feature policy='require' name='xsaveopt'/> <numa> <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> </numa> </cpu>
结论:虚拟机能够成功迁移而且依旧保持原有的 NUMA、CPU 特性。这是由于 Dedicated CPU Policy 是 Nova 层的概念,但从上述代码分析能够看出 Nova 是彻底的 NUMA-Non-aware。Hypervisor 层就更不会买这些参数的单了,Hypervisor 彻底忠于 XML 的描述,只要 XML 说了用 0,1 pCPU,那么即使 0,1 pCPU 已经被别的虚拟机占用了,Hypervisor 也依旧会安排下去。固然了,从 Nova 层面来看这就是一个 Bug,社区也已经有人描述来了这个问题并提出 BP:《NUMA-aware live migration》,《NUMA-aware live migration》。
至于 SR-IOV,Nova 官方文档明确提到了不支持 SR-IOV 虚拟机的 Live Migration。我曾在《启用 SR-IOV 解决 Neutron 网络 I/O 性能瓶颈》中分析过,SR-IOV 的 vf 设备对于 KVM 虚拟机来讲就是一个 XML 标签段。e.g.
<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address bus='0x81' slot='0x10' function='0x2'/> </source> </hostdev>
只要在目的计算节点能够找到与这个标签段匹配的 vf 设备便可实现 SR-IOV 网卡的迁移。问题是,原则上 Live Migration 虚拟机的 XML 文件理应不被修改,但实际上修改一段 vf 标签也许并没有大碍,主要是要作好迁移失败的回滚备案和 Nova 的 SR-IOV-aware(感知和管理),写到这里我是愈加的但愿 OpenStack Placement 可以快快发展,毕竟 Nova 对 NUMA、SR-IOV 等资源的 “黑盒” 管理是那么的痛苦。
经过对 OpenStack 虚拟机冷/热迁移的实现原理与代码分析能够感觉到,Nova 只是对传统的迁移方式或对底层 Hypervisor 支撑软件的迁移功能进行封装和调度,使虚拟机的冷、热迁移功能可以达到企业级云平台的业务需求水平。主要的技术价值仍是体如今底层技术支撑上,一如其余 OpenStack 项目。
https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/
http://www.javashuo.com/article/p-scpajatm-mr.html
https://docs.openstack.org/nova/pike/admin/configuring-migrations.html
https://docs.openstack.org/nova/pike/admin/live-migration-usage.html
https://blog.csdn.net/lemontree1945/article/details/79901874
https://www.ibm.com/developerworks/cn/linux/l-cn-mgrtvm1/index.html
https://blog.csdn.net/hawkerou/article/details/53482268