DOM树节点解析

时间 2019-11-09

标签 dom 树节点解析栏目 HTML 繁體版

原文原文链接

DOM是解析XML文件的官方标准，它与平台和语言无关。DOM解析将整个XML文件载入并组装成一棵DOM节点树，而后经过遍历、查找节点以读取XML文件中定义的数据。因为DOM解析中把全部节点都载入到内存中，于是它比较耗资源，并且它须要把整棵节点树构建完成后开始读取数据，于是它相对性能也很差；不过因为它在内存中保存了DOM节点树，于是它能够屡次读取，而且它的节点树定义比较容易理解，于是操做起来比较简单。关于性能，有人对一些经常使用的解析方法作了比较：css

单位：s（秒）转自：http://www.cnblogs.com/hedalixin/archive/2011/12/04/2275453.htmlhtml

	100KB node	1MB spring	10MB编程
DOMapp	0.146s ide	0.469s性能	5.876sui
SAXthis	0.110s	0.328s	3.547s
JDOM	0.172s	0.756s	45.447s
DOM4J	0.161s	0.422s	5.103s
StAX Stream	0.093s	0.334s	3.553s
StAX Event	0.131s	0.359s	3.641s

DOM树中，全部节点都是一个Node，对不一样节点有不一样类型，W3C对不一样类型的节点定义以下：

节点类型	描述	子元素
Document	表示整个文档（DOM 树的根节点）	Element (max. one) ProcessingInstruction Comment DocumentType
DocumentFragment	表示轻量级的 Document 对象，其中容纳了一部分文档。	ProcessingInstruction Comment Text CDATASection EntityReference
DocumentType	向为文档定义的实体提供接口。	None
ProcessingInstruction	表示处理指令。	None
EntityReference	表示实体引用元素。	ProcessingInstruction Comment Text CDATASection EntityReference
Element	表示 element（元素）元素	Text Comment ProcessingInstruction CDATASection EntityReference
Attr	表示属性。	Text EntityReference
Text	表示元素或属性中的文本内容。	None
CDATASection	表示文档中的 CDATA 区段（文本不会被解析器解析）	None
Comment	表示注释。	None
Entity	表示实体。	ProcessingInstruction Comment Text CDATASection EntityReference
Notation	表示在 DTD 中声明的符号。	None

在Java中DOM节点之间的继承关系以下：

Node接口

在DOM中，全部节点类型都继承自Node类，它表明在DOM树中的一个节点。Node接口定义了一些用于处理子节点的方法，可是并非全部的节点有子节点，好比Text节点并无子节点，于是向Text节点添加子节点（appendChild）会抛出DOMException。

NodeName、NodeValue、Attributes属性

Node接口提供了getNodeName()、getNodeValue()、getAttributes()三个方法，以方便应用程序获取这些信息，而不须要每次都转换成不一样子类以获取这些信息。可是并非全部的节点类型都有NodeName、NodeValue、Attributes信息，于是对那些不存在这些信息的节点能够返回null。全部节点类型对这三个方法的返回值以下表：

节点类型	getNodeName()	getNodeValue()	getAttributes()
Document	“#document”	null	null
DocumentFragment	“#document-fragment”	null	null
DocumentType	DocumentType.name	null	null
EntityReference	实体引用名称	null	null
Element	Element.tagName(qName)	null	NamedNodeMap
Attr	属性名(Attr.name)	属性值(Attr.value)	null
ProcessingInstruction	ProcessingInstruction.target	ProcessingInstruction.data	null
Comment	“#comment”	注释文本(CharacterData.data)	null
Text	“#text”	节点内容(CharacterData.data)	null
CDATASection	“#cdata-section”	节点内容(CharacterData.data)	null
Entity	实体名称	null	null
Notation	符号名称	null	null

对NodeValue，Node接口还提供了setNodeValue(String nodeValue)方法，对那些getNodeValue()为null的节点类型来讲，调用该方法不会有任何影响，而对非null的nodeValue，若是它是只读的节点，调用该方法将会抛出DOMException。只有Element节点类型才有属性信息，于是只有Element节点的getAttributes()方法能返回NamedNodeMap，Node接口还提供hasAttributes()方法以判断当前节点是否存在属性信息，也只有Element类型节点才会返回true。

NamedNodeMap接口实现了name=>Node的一个Map操做，对NamedNodeMap实例操做会影响其所在的Element节点中属性的信息：

public interface NamedNodeMap {

public Node getNamedItem(String name);

public Node setNamedItem(Node arg);

public Node removeNamedItem(String name);

public Node item(int index);

public int getLength();

public Node getNamedItemNS(String namespaceURI, String localName);

public Node setNamedItemNS(Node arg);

public Node removeNamedItemNS(String namespaceURI, String localName);

}

TextContent属性（set、get）

Node接口还提供了TextContent属性，它以字符串的形式表示当前节点和它的全部子节点。对设置该属性（非空非null值），它会移除当前节点的全部子节点，并用一个Text节点替代。读取该属性的值不会包含任何标签字符，它也不会作任何解析，于是返回的文本会包含全部空格、换行等符号信息。对不一样节点类型该属性的内容以下：

节点类型	TextContent
Element、Attr、Entity、EntityReference、DocumentFragment	将全部子节点的TextContent属性链接在一块儿组成的字符串（不包含Comment、ProcessingInstruction节点），若是当前节点没有子节点，该值为空
Text、CDATASection、Comment、ProcessingInstruction	NodeValue
Document、DocumentType、Notation	null

NodeType属性

DOM为每种节点类型定义了一个short值：

NodeType	Named Constant	Node Value
Element	ELEMENT_NODE	1
Attr	ATTRIBUTE_NODE	2
Text	TEXT_NODE	3
CDATASection	CDATA_SECTION_NODE	4
EntityReference	ENTITY_REFERENCE_NODE	5
Entity	ENTITY_NODE	6
ProcessingInstruction	PROCESSING_INSTRUCTION_NODE	7
Comment	COMMENT_NODE	8
Document	DOCUMENT_NODE	9
DocumentType	DOCUMENT_TYPE_NODE	10
DocumentFragment	DOCUMENT_FRAGMENT_NODE	11
Notation	NOTATION_NODE	12

在节点树中遍历、查找方法

getParentNode()：返回当前节点的父节点。Attr、Document、DocumentFragment、Entity、Notation这些类型的节点没有父节点。其余类型都有可能有父节点。

getFirstChild()：返回当前节点的第一个子节点，若是没有，返回null

getLastChild()：返回当前节点的最后一个子节点，若是没有，返回null

getNextSibling()：返回当前节点的下一个兄弟节点，若是没有，返回null

getPreviousSibling()：返回当前节点的上一个兄弟节点，若是没有，返回null

getOwnerDocument()：返回和当前节点关联的Document节点，通常对DOM节点树，Document是其根节点，于是全部子节点能够经过该方法直接获取根节点。对Document、DocumentType节点，该方法返回null。

hasChildNodes()：判断当前节点是否存在子节点。

修改子节点方法

appendChild(Node newChild)：向该节点添加子节点（全部已存在的子节点以后），若是该新的节点已经在节点树中了，该节点会先被移除。若是新节点是DocumentFragment类型，则新节点内部全部的节点都会添加到子节点列表中。因为若是新添加的节点已存在在节点树中，该节点会先被移除，于是新节点不能够是当前节点的祖先节点或该节点自己；对不一样类型的节点，其子节点的类型也是固定的，于是不能够添加了当前节点不支持的子节点；另外，对Document节点，它只能存在一个Element节点和一个DocumentType节点。

removeChild(Node oldChild)：移除当前节点中的oldChild子节点，并返回该节点。

replaceChild(Node newChild, Node oldChild)：将oldChild子节点替换成newChild子节点，若是newChild节点类型是DocumentFragment，则全部DocumentFragment内部的节点都会插入到oldChild节点所在的位置，最后返回oldChild子节点。若是oldChild节点已存在节点树中，则该节点会先被移除。

insertBefore(Node newChild, Node refChild)：向已存在的refChild子节点以前插入新子节点newChild。若是refChild为null，则该方法如appendChild()，即向子节点最后插入新子节点newChild。若是newChild节点为DocumentFragment，则插入的节点为DocumentFragment中的全部节点。若是newChild节点存在节点树中，该节点会先被移除。

命名空间支持

DOM从Level2开始提供对命名空间的支持。在XML中，只有Element节点和Attr节点存在命名空间的定义，并且属性节点（Attr）的命名空间并不默认继承自Element节点，而它须要本身显示的定义所在的命名空间，不然默认没有定义命名空间。

getNamespaceURI()：获取当前节点所在的命名空间，若是没有定义返回null。它是经过在当前做用域中查找到的值。出了Element、Attr，其余节点类型没有命名空间定义。

getPrefix()/setPrefix(String prefix)：命名空间前缀属性，对非Element、Attr的节点，他们永远返回null，对其设值不会有任何影响。

getLocalName()：返回当前节点的本地名称，即不包含命名空间的名字。

getBaseURI()：不认识

lookupPrefix(String namespaceURI)：经过namespaceURI查找命名空间前缀（prefix），从当前节点开始查找，忽略默认命名空间URI。

lookupNamespaceURI(String prefix)：经过prefix查找命名空间URI，从当前节点开始查找。若prefix为null，返回默认命名空间URI。

isDefaultNamespace(String namespaceURI)：判断是不是默认的命名空间URI。

其余

isSupported(String feature, String version)：返回DOM实现是否支持给定的Feature和Version。

getFeature(String feature, String version)：返回实现该Feature和Version的对象，有点难理解的方法，参考其中一个简单实现（NodeImpl）：

public Object getFeature(String feature, String version) {

return isSupported(feature, version) ? this : null;

}

setUserData(String key, Object data, UserDAtaHandler handler)/getUserData(String key)：向该Node中添加用户自定义的数据，应用程序能够在接下来的逻辑中从该Node使用getUserData(String key)方法从新获取该数据。其中UserDataHandler实例会在该Node每次被复制（Node.cloneNode()）、导入（Document.importNode()）、重命名（Document.renameNode()）、从其余Document中引入（Document.adoptNode()）、删除（删除在Java中的实现不可靠）时被调用：

public interface UserDataHandler {

public static final short NODE_CLONED = 1;

public static final short NODE_IMPORTED = 2;

public static final short NODE_DELETED = 3;

public static final short NODE_RENAMED = 4;

public static final short NODE_ADOPTED = 5;

public void handle(short operation, String key, Object data, Node src, Node dst);

}

cloneNode(boolean deep)：拷贝当前Node，可是不会拷贝UserData属性和ParentNode属性。拷贝Element节点是，若是deep为false，不会拷贝全部字节点，但会拷贝全部属性节点以及定义的具备默认值的属性。而拷贝Attr节点，无论deep为false仍是true，都会拷贝Attr的属性值和其全部子节点。拷贝EntityReference节点时，无论deep为false仍是true，都会拷贝相应的Entity。对全部其余节点类型的拷贝都是指返回自身引用。

normalize()：在编程构建DOM树时，能够构建出一棵不标准的DOM树，好比存在两个相邻的Text节点，空节点之类的，normalize方法能够合并两个相邻的Text节点、移除空节点等。

compareDocumentPosition(Node other)：比较两个节点的相对位置，返回的short值只是一些简单的信息，能够有以下值：

public static final short DOCUMENT_POSITION_DISCONNECTED = 0x01;

public static final short DOCUMENT_POSITION_PRECEDING = 0x02;

public static final short DOCUMENT_POSITION_FOLLOWING = 0x04;

public static final short DOCUMENT_POSITION_CONTAINS = 0x08;

public static final short DOCUMENT_POSITION_CONTAINED_BY = 0x10;

public static final short DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC = 0x20;

isSameNode(Node other)：判断两个节点是否相同，比较引用，包括代理引用。

isEqualNode(Node other)：判断两个节点是否相同，比较内容。normalize操做会影响该方法的结果，于是通常先调用normalize方法后，比较。parentNode、ownedDocument、userData等属性不会影响该方法的比较，具体参考API文档。

Document接口

Document是DOM树的根节点，因为其余节点类型都要基于Document而存在，于是Document接口还提供了建立其余节点的工厂方法。

Document级别操做

Document子节点只能包含一个DocumentType和Element节点，于是Document提供了两个方法直接返回这两个节点，而对其余子节点（ProcessingInstruction和Comment）则须要经过Node接口提供的操做读取：

public DocumentType getDoctype();

public Element getDocumentElement();

public DOMImplementation getImplementation();

对DOMImplementation接口，它提供了一些和DOM节点无关的操做：

public interface DOMImplementation {

public boolean hasFeature(String feature, String version);

public DocumentType createDocumentType(String qualifiedName,

String publicId, String systemId);

public Document createDocument(String namespaceURI, String qualifiedName,

DocumentType doctype);

public Object getFeature(String feature, String version);

}

每一个XML文件均可以指定编码类型、standalone属性、XML版本、是否执行严格的语法检查、Document的位置：

public String getInputEncoding();

public String getXmlEncoding();

public boolean getXmlStandalone();

public void setXmlStandalone(boolean xmlStandalone);

public String getXmlVersion();

public void setXmlVersion(String xmlVersion);

public boolean getStrictErrorChecking();

public void setStrictErrorChecking(boolean strictErrorChecking);

public String getDocumentURI();

public void setDocumentURI(String documentURI);

工厂方法

Document提供了建立其余全部节点类型的工厂方法，而且支持带命名空间的Element、Attr的建立：

public Element createElement(String tagName);

public Element createElementNS(String namespaceURI, String qualifiedName);

public Attr createAttribute(String name);

public Attr createAttributeNS(String namespaceURI, String qualifiedName);

public DocumentFragment createDocumentFragment();

public Text createTextNode(String data);

public Comment createComment(String data);

public CDATASection createCDATASection(String data);

public ProcessingInstruction createProcessingInstruction(String target, String data);

public EntityReference createEntityReference(String name);

查找Element节点

1. 经过ID属性查找：public Element getElementById(String elementId);

2. 使用标签名查找：public NodeList getElementsByTagName(String tagname);

3. 使用命名空间URI和本地标签名：

public NodeList getElementsByTagNameNS(String namespaceURI, String localName);

其中NodeList接口提供了遍历Node的操做：

public interface NodeList {

public Node item(int index);

public int getLength();

}

配置与规格化

如Node.normalize()，Document也定义本身的normalizeDocument()，它根据当前的配置规格化DOM树（替换EntityReference成Entity，合并相邻的Text节点等），若是配置验证信息，在操做同时还会验证DOM树的合法性。Document能够经过DOMConfiguration接口实例配置：

public DOMConfiguration getDomConfig();

public interface DOMConfiguration {

public void setParameter(String name, Object value);

public Object getParameter(String name);

public boolean canSetParameter(String name, Object value);

public DOMStringList getParameterNames();

}

如设置validate属性：

DOMConfiguration docConfig = myDocument.getDomConfig();

docConfig.setParameter("validate", Boolean.TRUE);

其中DOMStringList提供了遍历String List的操做，相似NodeList：

public interface DOMStringList {

public String item(int index);

public int getLength();

public boolean contains(String str);

}

其余

public Node importNode(Node importedNode, boolean deep);

向当前Document中导入存在于另外一个Document中的节点而不改变另外一个Document中DOM树的结构。

public Node adoptNode(Node source);

向当前Document中加入存在于另外一个Document中的节点，并将该节点从另外一个Document中移除。

public Node renameNode(Node n, String namespaceURI, String qualifiedName);

重命名一个Element或Attr节点

Element接口

Element节点表示XML文件中的一个标签，于是它最经常使用。因为在全部节点类型中，只有Element节点能够包含书，于是全部和属性具体相关的操做都定义在Element接口中：

public String getAttribute(String name);

public void setAttribute(String name, String value);

public void removeAttribute(String name);

public Attr getAttributeNode(String name);

public Attr setAttributeNode(Attr newAttr)

public Attr removeAttributeNode(Attr oldAttr);

public String getAttributeNS(String namespaceURI, String localName);

public void setAttributeNS(String namespaceURI, String qualifiedName, String value);

public void removeAttributeNS(String namespaceURI, String localName);

public Attr getAttributeNodeNS(String namespaceURI, String localName);

public Attr setAttributeNodeNS(Attr newAttr);

public boolean hasAttribute(String name);

public boolean hasAttributeNS(String namespaceURI, String localName);

public void setIdAttribute(String name, boolean isId);

public void setIdAttributeNS(String namespaceURI, String localName, boolean isId);

public void setIdAttributeNode(Attr idAttr, boolean isId);

Element还提供了两个使用标签名查找子Element方法以及返回当前Element的标签名方法，该标签名为包含命名空间前缀的全名：

public NodeList getElementsByTagName(String name);

public NodeList getElementsByTagNameNS(String namespaceURI, String localName);

public String getTagName();

Attr接口

Attr节点表示Element节点中的一个属性，通常一个Element中存在什么样的属性会在相应的Schema或DTD中定义。虽然Attr继承自Node接口，可是它并不属于Element节点的子节点，于是它不属于DOM树中的节点，它只存在于Element节点中，因此它的parentNode、previousSibling、nextSibling的值都为null。然而Attr能够存在子节点，它的子节点能够是Text节点或EntityReference节点。

对在Schema中有默认值定义的Attr，移除和初始化时会新建立一个Attr，它的specified属性为false，值为Schema中定义的默认值。在调用Document.normalizeDocument()方法时，全部specified属性为false的Attr值都会从新计算，若是它在Schema中没有默认值定义，则该Attr会被移除。

属性是具备Name和Value的值对，其中建立一个Attr实例后，Name不能够改变，要改变则须要建立新的Attr实例，而Value值能够改变。specified属性代表该属性的值是用户设置的（true）仍是Schema中默认提供的。id属性代表该Attr是否是其所属Element的id属性，一个id属性的值能够在一个Document中惟一的标识一个Element。因为全部Attr都是基于Element的，于是能够获取其所属的Element。

public interface Attr extends Node {

public String getName();

public boolean getSpecified();

public String getValue();

public void setValue(String value);

public Element getOwnerElement();

public TypeInfo getSchemaTypeInfo();

public boolean isId();

}

DocumentType接口

每一个Document都有doctype属性，它包含了DTD文件信息（位置、文件名等），同时它还提供了读取DTD文件中定义的Entity、Notation的集合，即它是对XML文件中如下语句的封装：

<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">

其中：

name为beans

publicId为-//SPRING//DTD BEAN//EN

systemId为http://www.springframework.org/dtd/spring-beans.dtd

entities、notations为该DTD文件中定义的Entity和Notation的集合

public interface DocumentType extends Node {

public String getName();

public NamedNodeMap getEntities();

public NamedNodeMap getNotations();

public String getPublicId();

public String getSystemId();

public String getInternalSubset();

}

ProcessingInstruction接口

在XML文件中还能够定义一些指令以供一些解析器使用该信息对该文件作正确的处理，在DOM中使用ProcessingInstruction接口对该定义进行抽象：

<?xml-stylesheet href="show.css" type="text/css" ?>

ProcessingInstruction额外定义了两个属性：target和data：

target为xml-stylesheet

data为href="show.css" type="text/css"

public interface ProcessingInstruction extends Node {

public String getTarget();

public String getData();

public void setData(String data);

}

CharacterData接口

CharacterData接口继承自Node接口，它是全部字符相关节点的父接口，定义了全部字符相关的操做：定义属性、添加、插入、删除、替换、取子串等。

public interface CharacterData extends Node {

public String getData();

public void setData(String data);

public int getLength();

public String substringData(int offset, int count);

public void appendData(String arg);

public void insertData(int offset, String arg);

public void deleteData(int offset, int count);

public void replaceData(int offset, int count, String arg);

}

Text接口

Text接口继承自CharacterData接口，它表示文本节点，通常做为Element、Attr的子节点，而它自己没有子节点。Text定义了一个文本节点，如Element的文本Content或Attr的值。若文本里面包含特殊字符（如’<’, ‘>’等）须要转义。在操做DOM树时，用户能够插入多个Text节点，在Node.normalize()处理时会合并两个各相邻的Text节点。

Text节点提供了除对字符数据操做的其余额外操做：

splitText()：Text节点分割成两个相邻的Text节点，即新分割出的Text节点为以前Text节点的兄弟节点

isElementContentWhitespace()：判断当前Text节点是否存在Element Content Whitespace，没读懂。

getWholeText()：当存在多个相邻的Text节点时，该属性会返回全部相邻Text节点的值。

replaceWholeText()：替换全部相邻Text节点为新设置的节点（多是当前节点自己）。若是其中有一个节点没法移除（如包含EntityReference的节点），则会抛出DOMException。

public interface Text extends CharacterData {

public Text splitText(int offset);

public boolean isElementContentWhitespace();

public String getWholeText();

public Text replaceWholeText(String content);

}

CDATASection接口

CDATASection接口继承自Text接口，它相似于Text节点，所不一样的是全部的CDATASection节点都包含在<![CDATA[“Content need not to be escaped”]]>中，而且若是其内容包含特殊字符不须要转义。不一样于Text节点，在Node.normalize()阶段，相邻的两个CDATASection节点不会被合并。

public interface CDATASection extends Text {

}

Comment接口

Comment接口继承自CharacterData接口，它是对XML文件中注释语句的抽象，它只包含注释的字符串信息，没有额外定义的行为：

public interface Comment extends CharacterData {

}

Entity接口

Entity接口是对一个实体定义的抽象，即它是对DTD文件中如下定义的抽象：

<!ENTITY JENN SYSTEM "http://images.about.com/sites/guidepics/html.gif" NDATA gif>

Entity接口定义了systemId、publicId、notationName等信息，而对其余信息则在其子节点中显示，如Entity可能指向另外一个外部文件，或者直接定义Entity的值（对这个，貌似我始终返回null，按网上的说法，这个是xerces的bug），如如下定义：

<!ENTITY name "cnblog">

<!ENTITY copyright SYSTEM "copyright.desc">

另外Entity还定义了一些编码和XML版本的信息：

public interface Entity extends Node {

public String getPublicId();

public String getSystemId();

public String getNotationName();

public String getInputEncoding();

public String getXmlEncoding();

public String getXmlVersion();

}

EntityReference接口

EntityReference节点表示对一个Entity的引用。

public interface EntityReference extends Node {

}

Notation接口

Notation接口是对DTD中Notation定义的抽象：

<!NOTATION gif SYSTEM "image/gif">

一个Notation包含name（nodeName）、systemId、publicId信息：

public interface Notation extends Node {

public String getPublicId();

public String getSystemId();

}

DocumentFragment接口

DocumentFragment是对DOM树片断的抽象，从而能够将部分DOM树节点做为一个集合来处理，如插入到一个Document中的某个节点中，实际插入的是DocumentFragment中全部的子节点。把DOM树的部分节点做为一个总体来看待部分能够经过Document来实现，然而在部分实现中，Document是一个重量级的对象，而DocumentFragment则能够保证它是一个轻量级的对象，由于它没有Document存在的那么限制，这也是DocumentFragment存在的缘由。DocumentFragment的定义以下：
public interface DocumentFragment extends Node {

}