做为管理员,你应监视 Impala 的资源使用状况,必要时采起行动以保证 Impala 平衡运行,避免与统一集群里的其余 Haoopd 组件冲突。当检测到已发生或将发生的问题时,你应从新配置 Impala 或其余组件,如HDFS乃至集群中的硬件,来解决或避免问题的发生。 html
继续阅读: node
做为管理员,你能够在集群的全部机器上执行 Impala 的安装、升级、配置任务。参见 Installing Cloudera Impala, Upgrading Impala, Configuring Impala 了解详细信息。 sql
对于由管理员执行的额外的安全任务,参见 Impala Security 了解详细信息。 shell
You can limit the CPU and memory resources used by Impala, to manage and prioritize workloads on clusters that run jobs from many Hadoop components. (Currently, there is no limit or throttling on the I/O for Impala queries.) Impala uses the underlying Apache Hadoop YARN resource management framework, which allocates the required resources for each Impala query. Impala estimates the resources required by the query on each node of the cluster, and requests the resources from YARN. Requests from Impala to YARN go through an intermediary service Llama (Low Latency Application Master). When the resource requests are granted, Impala starts the query and places all relevant execution threads into the CGroup containers and sets up the memory limit on each node. If sufficient resources are not available, the Impala query waits until other jobs complete and the resources are freed. While the waits for resources might make individual queries seem less responsive on a heavily loaded cluster, the resource management feature makes the overall performance of the cluster smoother and more predictable, without sudden spikes in utilization due to memory paging, saturated I/O channels, CPUs pegged at 100%, and so on. 数据库
To make resource usage easier to verify, the output of the EXPLAIN SQL statement now includes information about estimated memory usage, whether table and column statistics are available for each table, and the number of virtual cores that a query will use. You can get this information through the EXPLAIN statement without actually running the query. The extra information requires setting the query option EXPLAIN_LEVEL=verbose; see EXPLAIN Statement for details. The same extended information is shown at the start of the output from the PROFILE statement in impala-shell. The detailed profile information is only available after running the query. You can take appropriate actions (gathering statistics, adjusting query options) if you find that queries fail or run with suboptimal performance when resource management is enabled. 安全
To enable resource management for Impala, first you set up the YARN and Llama services for your CDH cluster. Then you add startup options and customize resource management settings for the Impala services. session
YARN is the general-purpose service that manages resources for many Hadoop components within a CDH cluster. Llama is a specialized service that acts as an intermediary between Impala and YARN, translating Impala resource requests to YARN and coordinating with Impala so that queries only begin executing when all needed resources have been granted by YARN. app
For information about setting up the YARN and Llama services, see the instructions for YARN and Llama in the CDH 5 Installation Guide. less
Before issuing SQL statements through the impala-shell interpreter, you can use the SET command to configure the following parameters related to resource management: ide
Setting this option to verbose or 1 enables extra information in the output of the EXPLAIN command. Setting the option to normal or 0 suppresses the extra information. The extended information is especially useful during performance tuning, when you need to confirm if table and column statistics are available for a query. The extended information also helps to check estimated resource usage when you use the resource management feature in CDH 5. See EXPLAIN Statement for details about the extended information and how to use it.
When resource management is not enabled, defines the maximum amount of memory a query can allocate on each node. If query processing exceeds the specified memory limit on any node, Impala cancels the query automatically. Memory limits are checked periodically during query processing, so the actual memory in use might briefly exceed the limit without the query being cancelled.
When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node, and the query does not proceed until that much memory is available. The actual memory used by the query could be lower, since some queries use much less memory than others. With resource management, the MEM_LIMIT setting acts both as a hard limit on the amount of memory a query can use on any node (enforced by YARN and a guarantee that that much memory will be available on each node while the query is being executed. When resource management is enabled but no MEM_LIMIT setting is specified, Impala estimates the amount of memory needed on each node for each query, requests that much memory from YARN before starting the query, and then internally sets the MEM_LIMIT on each node to the requested amount of memory during the query. Thus, if the query takes more memory than was originally estimated, Impala detects that the MEM_LIMIT is exceeded and cancels the query itself.
Default: 0
Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5.
Default: 300000 (5 minutes)
The number of per-host virtual CPU cores to request from YARN. If set, the query option overrides the automatic estimate from Impala. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5.
Default: 0 (use automatic estimates)
The YARN pool/queue name that queries should be submitted to. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5. Specifies the name of the pool used by resource requests from Impala to the YARN resource management framework.
Default: empty (use the user-to-pool mapping defined by an impalad startup option in the Impala configuration file)
Currently, the beta versions of CDH 5 and Impala have the following limitations for resource management of Impala queries:
Currently, there are known bugs that could cause the maximum memory usage reported by the PROFILE command to be lower than the actual value.
尽管 Impala 一般工做在放置于有充足容量空间的HDFS存储系统里的许多大文件之上的,有时你也须要执行清理释放空间,或者为开发者在最小化空间使用与文件副本方面提供技术支持(Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques to minimize space consumption and file duplication)。
为了保持长时间运行的查询,或释放会话占用的集群资源,你能够针对单独的查询或整个会话设置超时时长(To keep long-running queries or idle sessions from tying up cluster resources, you can set timeout intervals for both individual queries, and entire sessions)。为 impalad 守护进程设置以下启动选项:
关于修改 impalad 选项,参见 Modifying Impala Startup Options.