Java URL类踩坑指南

时间 2019-12-04

标签 java url 指南栏目 Java 繁體版

原文原文链接

背景介绍

最近再作一个RSS阅读工具给本身用，其中一个环节是从服务器端获取一个包含了RSS源列表的json文件，再根据这个json文件下载、解析RSS内容。核心代码以下：html

class PresenterImpl(val context: Context, val activity: MainActivity) : IPresenter {
    private val URL_API = "https://vimerzhao.github.io/others/rssreader/RSS.json"

    override fun getRssResource(): RssSource {
        val gson = GsonBuilder().create()
        return gson.fromJson(getFromNet(URL_API), RssSource::class.java)
    }

    private fun getFromNet(url: String): String {
        val result = URL(url).readText()
        return result
    }

    ......
}

以前一直执行地很好，直到前两天我购买了一个vimerzhao.top的域名，并将原来的域名vimerzhao.github.io重定向到了vimerzhao.top。这个工具就没法使用了，但在浏览器输入URL_API却能获得数据：
java

那为何URL.readText()没有拿到数据呢？python

不支持重定向

能够经过下面代码测试：linux

import java.net.*;
import java.io.*;

public class TestRedirect {
    public static void main(String args[]) {
        try {
            URL url1 = new URL("https://vimerzhao.github.io/others/rssreader/RSS.json");
            URL url2 = new URL("http://vimerzhao.top/others/rssreader/RSS.json");
            read(url1);
            System.out.println("=--------------------------------=");
            read(url2);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

获得结果以下：android

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>
=--------------------------------=
{"theme":"tech","author":"zhaoyu","email":"dutzhaoyu@gmail.com","version":"0.01","contents":[{"category":"综合版块","websites":[{"tag":"门户网站","url":["http://geek.csdn.net/admin/news_service/rss","http://blog.jobbole.com/feed/","http://feed.cnblogs.com/blog/sitehome/rss","https://segmentfault.com/feeds","http://www.codeceo.com/article/category/pick/feed"]},{"tag":"知名社区","url":["https://stackoverflow.com/feeds","https://www.v2ex.com/index.xml"]},{"tag":"官方博客","url":["https://www.blog.google/rss/","https://blog.jetbrains.com/feed/"]},{"tag":"我的博客-行业","url":["http://feed.williamlong.info/","https://www.liaoxuefeng.com/feed/articles"]},{"tag":"我的博客-学术","url":["http://www.norvig.com/rss-feed.xml"]}]},{"category":"编程语言","websites":[{"tag":"Kotlin","url":["https://kotliner.cn/api/rss/latest"]},{"tag":"Python","url":["https://www.python.org/dev/peps/peps.rss/"]},{"tag":"Java","url":["http://www.codeceo.com/article/category/develop/java/feed"]}]},{"category":"行业动态","websites":[{"tag":"Android","url":["http://www.codeceo.com/article/category/develop/android/feed"]}]},{"category":"乱七八遭","websites":[{"tag":"Linux-综合","url":["https://linux.cn/rss.xml","http://www.linuxidc.com/rssFeed.aspx","http://www.codeceo.com/article/tag/linux/feed"]},{"tag":"Linux-发行版","url":["https://blog.linuxmint.com/?feed=rss2","https://manjaro.github.io/feed.xml"]}]}]}

HTTP返回码301，即发生了重定向。可在浏览器上这个过程太快以致于咱们看不到这个301界面的出现。这里须要说明的是URL.readText()是Kotlin中一个扩展函数，本质仍是调用了URL类的openStream方法，部分源码以下：nginx

.....
/**
 * Reads the entire content of this URL as a String using UTF-8 or the specified [charset].
 *
 * This method is not recommended on huge files.
 *
 * @param charset a character set to use.
 * @return a string with this URL entire content.
 */
@kotlin.internal.InlineOnly
public inline fun URL.readText(charset: Charset = Charsets.UTF_8): String = readBytes().toString(charset)

/**
 * Reads the entire content of the URL as byte array.
 *
 * This method is not recommended on huge files.
 *
 * @return a byte array with this URL entire content.
 */
public fun URL.readBytes(): ByteArray = openStream().use { it.readBytes() }

因此上面的测试代码即说明了URL.readText()失败的缘由。
不过URL不支持重定向是否合理？为何不支持？还有待探究。git

不稳定的`equals`方法

首先看下equals的说明(URL (Java Platform SE 7 ))：github

Compares this URL for equality with another object.
If the given object is not a URL then this method immediately returns false.
Two URL objects are equal if they have the same protocol, reference equivalent hosts, have the same port number on the host, and the same file and fragment of the file.
Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null.
Since hosts comparison requires name resolution, this operation is a blocking operation.
Note: The defined behavior for equals is known to be inconsistent with virtual hosting in HTTP.web

接下来再看一段代码：编程

import java.net.*;
public class TestEquals {
    public static void main(String args[]) {
        try {
            // vimerzhao的博客主页
            URL url1 = new URL("https://vimerzhao.github.io/");
            // zhanglanqing的博客主页
            URL url2 = new URL("https://zhanglanqing.github.io/");
            // vimerzhao博客主页重定向后的域名
            URL url3 = new URL("http://vimerzhao.top/");
            System.out.println(url1.equals(url2));
            System.out.println(url1.equals(url3));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

根据定义输出结果是什么呢？运行以后是这样：

true
false

你可能猜对了，但若是我把电脑断网以后再次执行，结果倒是：

false
false

但其实3个域名的IP地址都是相同的，能够ping一下：

zhaoyu@Inspiron ~/Project $ ping vimezhao.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 396.692/396.692/396.692/0.000 ms
zhaoyu@Inspiron ~/Project $ ping zhanglanqing.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1000ms
rtt min/avg/max/mdev = 396.009/396.009/396.009/0.000 ms
zhaoyu@Inspiron ~/Project $ ping vimezhao.top
ping: unknown host vimezhao.top
zhaoyu@Inspiron ~/Project $ ping vimerzhao.top
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=409 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 409.978/409.978/409.978/0.000 ms

首先看一下有网络链接的状况，vimerzhao.github.io和zhanglanqing.github.io是我和我同窗的博客，虽然内容不同可是指向相同的IP，协议、端口等都相同，因此相等了；而vimerzhao.github.io虽然和vimerzhao.top指向同一个博客，可是一个是https一个是http，协议不一样，因此判断为不相等。相信这和大多数人的直觉是相背的：指向不一样博客的URL相等了，但指向相同博客的URL却不相等！
再分析断网以后的结果：首先查看URL的源码：

public boolean equals(Object obj) {
        if (!(obj instanceof URL))
            return false;
        URL u2 = (URL)obj;

        return handler.equals(this, u2);
    }

再看handler对象的源码：

protected boolean equals(URL u1, URL u2) {
        String ref1 = u1.getRef();
        String ref2 = u2.getRef();
        return (ref1 == ref2 || (ref1 != null && ref1.equals(ref2))) &&
               sameFile(u1, u2);
    }

sameFile源码：

protected boolean sameFile(URL u1, URL u2) {
        // Compare the protocols.
        if (!((u1.getProtocol() == u2.getProtocol()) ||
              (u1.getProtocol() != null &&
               u1.getProtocol().equalsIgnoreCase(u2.getProtocol()))))
            return false;

        // Compare the files.
        if (!(u1.getFile() == u2.getFile() ||
              (u1.getFile() != null && u1.getFile().equals(u2.getFile()))))
            return false;

        // Compare the ports.
        int port1, port2;
        port1 = (u1.getPort() != -1) ? u1.getPort() : u1.handler.getDefaultPort();
        port2 = (u2.getPort() != -1) ? u2.getPort() : u2.handler.getDefaultPort();
        if (port1 != port2)
            return false;

        // Compare the hosts.
        if (!hostsEqual(u1, u2))
            return false;// 无网络链接时会触发这一句

        return true;
    }

最后是hostsEqual的源码：

protected boolean hostsEqual(URL u1, URL u2) {
        InetAddress a1 = getHostAddress(u1);
        InetAddress a2 = getHostAddress(u2);
        // if we have internet address for both, compare them
        if (a1 != null && a2 != null) {
            return a1.equals(a2);
        // else, if both have host names, compare them
        } else if (u1.getHost() != null && u2.getHost() != null)
            return u1.getHost().equalsIgnoreCase(u2.getHost());
         else
            return u1.getHost() == null && u2.getHost() == null;
    }

在有网络的状况下，a1和a2都不是null因此会触发return a1.equals(a2)，返回true；而没有网络时则会触发return u1.getHost().equalsIgnoreCase(u2.getHost());即第二个判断，显然url1的host（vimerzhao.github.io）和url2的host（zhanglanqing.github.io）不等，因此返回false，致使if (!hostsEqual(u1, u2))判断为真，return false执行。
可见，URL类的equals方法不只违反直觉还缺少一致性，在不一样环境会有不一样结果，十分危险！

耗时的`equals`方法

此外，equals仍是个耗时的操做，由于在有网络的状况下须要进行DNS解析，hashCode()同理，这里以hashCode()为例说明。URL类的hashCode()源码：

public synchronized int hashCode() {
        if (hashCode != -1)
            return hashCode;

        hashCode = handler.hashCode(this);
        return hashCode;
    }

handler对象的hashCode()方法：

protected int hashCode(URL u) {
        int h = 0;

        // Generate the protocol part.
        String protocol = u.getProtocol();
        if (protocol != null)
            h += protocol.hashCode();

        // Generate the host part.
        InetAddress addr = getHostAddress(u);
        if (addr != null) {
            h += addr.hashCode();
        } else {
            String host = u.getHost();
            if (host != null)
                h += host.toLowerCase().hashCode();
        }

        // Generate the file part.
        String file = u.getFile();
        if (file != null)
            h += file.hashCode();

        // Generate the port part.
        if (u.getPort() == -1)
            h += getDefaultPort();
        else
            h += u.getPort();

        // Generate the ref part.
        String ref = u.getRef();
        if (ref != null)
            h += ref.hashCode();

        return h;
    }

其中getHostAddress()会消耗大量时间。因此，若是在基于哈希表的容器中存储URL对象，简直就是灾难。下面这段代码，对比了URL和URI在存储50次时的表现：

import java.net.*;
import java.util.*;

public class TestHash {
    public static void main(String args[]) {
        HashSet<URL> list1 = new HashSet<>();
        HashSet<URI> list2 = new HashSet<>();
        try {
            URL url1 = new URL("https://vimerzhao.github.io/");
            URI url2 = new URI("https://zhanglanqing.github.io/");
            long cur = System.currentTimeMillis();
            int cnt = 50;
            for (int i = 0; i < cnt; i++) {
                list1.add(url1);
            }
            System.out.println(System.currentTimeMillis() - cur);
            cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                list2.add(url2);
            }
            System.out.println(System.currentTimeMillis() - cur);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

输出为：

271
0

因此，基于哈希表实现的容器最好不要用URL。

`TrailingSlash`的做用

所谓TrailingSlash就是域名结尾的斜杠。好比咱们在浏览器看到vimerzhao.top,复制后粘贴发现是http://vimerzhao.top/。首先用下面代码测试：

import java.net.*;
import java.io.*;

public class TestTrailingSlash {
    public static void main(String args[]) {
        try {
            URL url1 = new URL("https://vimerzhao.github.io/");
            URL url2 = new URL("https://vimerzhao.github.io");
            System.out.println(url1.equals(url2));
            outputInfo(url1);
            outputInfo(url2);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void outputInfo(URL url) {
        System.out.println("------" + url.toString() + "----------");
        System.out.println(url.getRef());
        System.out.println(url.getFile());
        System.out.println(url.getHost());
        System.out.println("----------------");
    }
}

获得结果以下：

false
------https://vimerzhao.github.io/----------
null
/
vimerzhao.github.io
----------------
------https://vimerzhao.github.io----------
null

vimerzhao.github.io
----------------

其实，不管用前面的read()方法读或者地址栏直接输入url，url1和url2的内容都是相同的，可是加/表示这是一个目录，不加表示这是一个文件，因此两者getFile()的结果不一样，致使equals判断为false。在地址栏输入时甚至不会觉察到这个TrailingSlash，所返回的结果也同样，但equals判断居然为false，真是防不胜防！
这里还有一个问题就是：一个是文件，令一个是目录，为何都能获得相同结果？
调查一番后发现：其实再请求的时候若是有/，那么就会在这个目录下找index.html文件；若是没有，以vimerzhao.top/tags为例，则会先找tags，若是找不到就会自动在后面添加一个/，再在tags目录下找index.html文件。如图：

这里有一个有趣的测试，编写两段代码以下：

import java.net.*;
import java.io.*;

public class TestTrailingSlash {
    public static void main(String args[]) {
        try {
            URL urlWithSlash = new URL("http://vimerzhao.top/tags/");
            int cnt = 5;
            long cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                read(urlWithSlash);
            }
            System.out.println(System.currentTimeMillis() - cur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                //System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

import java.net.*;
import java.io.*;

public class TestWithoutTrailingSlash {
    public static void main(String args[]) {
        try {
            URL urlWithoutSlash = new URL("http://vimerzhao.top/tags");
            int cnt = 5;
            long cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                read(urlWithoutSlash);
            }
            System.out.println(System.currentTimeMillis() - cur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                //System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

使用以下脚本测试：

#!/bin/sh
for i in {1..20}; do
    java TestTrailingSlash > out1
    java TestWithoutTrailingSlash > out2
done

将输出的时间作成表格：

能够发现，添加了/的速度更快，这是由于省去了查找是否有tags文件的操做。这也给咱们启发：URL结尾的/最好仍是加上！

以上，本周末发现的一些坑。

Java URL类踩坑指南

背景介绍

不支持重定向

不稳定的equals方法

耗时的equals方法

TrailingSlash的做用

参考

不稳定的`equals`方法

耗时的`equals`方法

`TrailingSlash`的做用