我的网站对xss跨站脚本攻击（重点是富文本编辑器状况）和sql注入攻击的防范

时间 2019-11-18

标签我的网站 xss 脚本攻击重点文本编辑器状况 sql 注入防范栏目网站开发繁體版

原文原文链接

昨天本博客受到了xss跨站脚本注入攻击，3分钟攻陷……其实攻击者进攻的手法很简单，没啥技术含量。只能感叹本身以前居然彻底没防范。javascript

这是数据库里留下的一些记录。最后那人弄了一个无限循环弹出框的脚本，估计这个脚本以后他再想输入也无法了。html

相似这种：java

<html>
     <body onload='while(true){alert(1)}'>
     </body>
</html>

我马上认识到这事件严重性，它说明个人博客有严重安全问题。由于xss跨站脚本攻击可能致使用户Cookie甚至服务器Session用户信息被劫持，后果严重。虽然攻击者就用些未必有什么技术含量的脚本便可作到。node

次日花些时间去了解，该怎么防范。顺便也看了sql注入方面。web

sql注入是源于sql语句的拼接。因此须要对用户输入参数化。因为我使用的是jpa，不存在sql拼接问题，但仍是对一些用户输入作处理比较好。个人博客系统并不复杂，一共四个表，Article,User,Message,Comment。正则表达式

涉及数据库查询且由用户输入的就只有用户名，密码，文章标题。其它后台产生的如文章日期一类就不用管。sql

对于这三个字段的校验，可使用自定义注解方式。数据库

/**
* @ClassName: IsValidString 
* @Description: 自定义注解实现先后台参数校验，判断是否包含非法字符
* @author 无名
* @date 2016-7-25 下午8:22:58  
* @version 1.0
 */
@Target({ElementType.FIELD, ElementType.METHOD})
@Retention(RetentionPolicy.RUNTIME)
@Constraint(validatedBy = IsValidString.ValidStringChecker.class) @Documented public @interface IsValidString { String message() default "The string is invalid."; Class<?>[] groups() default {}; Class<? extends Payload>[] payload() default{}; class ValidStringChecker implements ConstraintValidator<IsValidString,String> { @Override public void initialize(IsValidString arg0) { } @Override public boolean isValid(String strValue, ConstraintValidatorContext context) { //校验方法添在这里 return true; } } }

定义了自定义注解之后就能够在对应的实体类字段上添上@IsValidString便可。安全

但因为我还没研究出怎么拦截自定义注解校验返回的异常，就在controller类里作校验吧。服务器

    public static boolean contains_sqlinject_illegal_ch(String str_input) {
        //"[`~!@#$%^&*()+=|{}':;',\\[\\].<>/?~！@#￥%……&*（）——+|{}【】‘；：”“’。，、？]"
        String regEx = "['=<>;\"]"; Pattern p = Pattern.compile(regEx); Matcher m = p.matcher(str_input); if (m.find()) { return true; } else { return false; } }

拦截的字符有 ' " [] <> ;

我以为这几个就够了吧。<>顺便就解决了xxs跨站脚本注入问题。

而xxs跨站脚本注入问题仍是让我很头疼。由于个人博客系统使用wangEditor web文本编辑器，返回给后台的包含不少合法的html标签，用来表现文章格式。因此不能统一过滤<>这类字符。

例如，将<html><body onload='while(true){alert(1)}'></body></html>这句输入编辑器，提交。后台获得的是：

中间被转意的&lt,&gt是合法的可供页面显示的<>字符。而外面的 就是文本编辑器生产的用来控制格式的正常的html标签。

问题在于，若是有人点击编辑器“源代码”标识，将文本编辑器生产的正常的html标签，再输入这句<html><body onload='while(true){alert(1)}'></body></html>结果返回后台的就是原封不动的<html><body onload='while(true){alert(1)}'></body></html> <和>没有变成&lt和&gt。

这让人头痛，我在想这个编辑器为何提供什么狗屁查看源代码功能，致使不能统一对<>。

在这种状况下，我只能过滤一部分认准是有危害的html标签，而众所周知，这类黑名单校验是不够安全的。(2016-12-30:下面这个函数是确定不行的，写的很蠢，下文已经把它干掉，用白名单校验，并应用正则表达式的方式来作)

    /*
     * Cross-site scripting (XSS) is a type of computer security vulnerability
     * typically found in web applications. XSS enables attackers to inject
     * client-side scripts into web pages viewed by other users. A cross-site
     * scripting vulnerability may be used by attackers to bypass access
     * controls such as the same-origin policy. Cross-site scripting carried out
     * on websites accounted for roughly 84% of all security vulnerabilities
     * documented by Symantec as of 2007. Their effect may range from a petty
     * nuisance to a significant security risk, depending on the sensitivity of
     * the data handled by the vulnerable site and the nature of any security
     * mitigation implemented by the site's owner.(From en.wikipedia.org)
     */
    public static boolean contains_xss_illegal_str(String str_input) {
        if (str_input.contains("<html") || str_input.contains("<HTML") || str_input.contains("<body") || str_input.contains("<BODY") || str_input.contains("<script") || str_input.contains("<SCRIPT") || str_input.contains("<link") || str_input.contains("<LINK") || str_input.contains("%3Cscript") || str_input.contains("%3Chtml") || str_input.contains("%3Cbody") || str_input.contains("%3Clink") || str_input.contains("%3CSCRIPT") || str_input.contains("%3CHTML") || str_input.contains("%3CBODY") || str_input.contains("%3CLINK") || str_input.contains("<META") || str_input.contains("<meta") || str_input.contains("%3Cmeta") || str_input.contains("%3CMETA") || str_input.contains("<style") || str_input.contains("<STYLE") || str_input.contains("%3CSTYLE") || str_input.contains("%3Cstyle") || str_input.contains("<xml") || str_input.contains("<XML") || str_input.contains("%3Cxml") || str_input.contains("%3CXML")) { return true; } else { return false; } }

我在考虑着把这个文本编辑器的查看源代码功能给干掉。

另外，仍是要系统学习xss跨站脚本注入防范。开始看一本书《白帽子讲web安全》，以为这本书不错。

到时候有新看法再在这篇文章补充。

2016-12-30日补充：

今天读了那本《白帽子讲web安全》，果真获益很多。其中提到富文本编辑器的状况，因为富文本编辑器自己会使用正常的一些html标签，因此须要作白名单校验。只容许使用一些肯定安全的标签，除富文本编辑器使用的标签，其余的都过滤掉。这是白名单方式，是真正合理的。

另外下午研究下正则表达式的写法：<([^(a)(img)(div)(p)(span)(pre)(br)(code)(b)(u)(i)(strike)(font)(blockquote)(ul)(li)(ol)(table)(tr)(td)(/)][^>]*)>（2016-12-30夜-2016-12-31 发现这个正则有误，下面就继续补充）

[^]是非的意思。

上面的正则的意思就是若含有a、img、div……以外的标签则匹配。

    /*
     * Cross-site scripting (XSS) is a type of computer security vulnerability
     * typically found in web applications. XSS enables attackers to inject
     * client-side scripts into web pages viewed by other users. A cross-site
     * scripting vulnerability may be used by attackers to bypass access
     * controls such as the same-origin policy. Cross-site scripting carried out
     * on websites accounted for roughly 84% of all security vulnerabilities
     * documented by Symantec as of 2007. Their effect may range from a petty
     * nuisance to a significant security risk, depending on the sensitivity of
     * the data handled by the vulnerable site and the nature of any security
     * mitigation implemented by the site's owner.(From en.wikipedia.org)
     */
    public static boolean contains_xss_illegal_str(String str_input) {
        final String REGULAR_EXPRESSION =
                "<([^(a)(img)(div)(p)(span)(pre)(br)(code)(b)(u)(i)(strike)(font)(blockquote)(ul)(li)(ol)(table)(tr)(td)(/)][^>]*)>"; Pattern pattern = Pattern.compile(REGULAR_EXPRESSION); Matcher matcher = pattern.matcher(str_input); if (matcher.find()) { return true; } else { return false; } }

2016-12-30夜-2016-12-31 补充：

实验发现前面写的那个正则表达式是无效的。同时发现这个正则是很是难写、颇有技术含量的，对于我这个基本正则都不太熟悉的菜鸟来讲。

这种‘非’的表达，不能简单的用上面提到的[^]。那种没法匹配字符串的非。例如(a[^bc]d)表示地是ad其中的字符串不能为b或c。

对于字符串的非，应该用这种表达式：^(?!.*helloworld).*$

以此为前提，下面的正则能够表达不为的html标签：

<((?!p)[^>])> 后面[^]表示<>中只有一个字符(?!p)且第一个字符非p

若写成<((?!p)[^>]*)>则表示有n个字符，且第一个字符非p

    @Test
    public void test_Xss_check() { System.out.println("begin"); String str_input = "<p>"; final String REGULAR_EXPRESSION = "<((?!p)[^>])>"; Pattern pattern = Pattern.compile(REGULAR_EXPRESSION); Matcher matcher = pattern.matcher(str_input); if (matcher.find()) { System.out.println("yes"); } }

那么该如何匹配，不为AA且不为BB的html标签呢？

<((?!p)(?!a)[^>]*)>匹配的就是不以p开头且不以a开头html标签！

咱们要求的匹配的是：不为、不为<ul>、不为<li>……且不以<a 开头、不以<img 开头、不以</开头……的html标签。该如何写？

先写一个简单的例子：<(((?!p )(?!a )[^>]*)((?!p)(?!a).))>匹配的是非且非<a xxxx>且非且非<a>的<html>标签。

例如，字符串<pasd>则匹配，则不匹配，则不匹配。然而不精准的一点是，<ppp>或<aaa>也不匹配。其余问题也有，例如非<table>的标签就不知道该怎么表示。

总之感受这个正则很难写，超出了个人能力范围。因此最后决定用正则先筛选html标签，再由java代码作白名单筛选。

用于筛选html标签的正则是<(?!a )(?!p )(?!img )(?!code )(?!spab )(?!pre )(?!font )(?!/)[^>]*>，筛选到的html排除掉<a xxx><img xx></>等等，由于那些是默认合法的。筛选获得的<html>标签存进List里，再作白名单校验。

代码以下：

    @Test
    public void test_Xss_check() {
        String str_input =
                "<a ss><script>sds<body><a></adsd><d/s><p dsd><pp><a><dsds>dsdas<font ds>" +
                "<fontdsdsd><font>das<oooioacc><pp sds><script><code ><br><code><ccc><abug>";
        System.out.println("String inputed:" + str_input);
        final String REGULAR_EXPRESSION = 
                "<(?!a )(?!p )(?!img )(?!code )(?!spab )(?!pre )(?!font )(?!/)[^>]*>";
        final Pattern PATTERN = Pattern.compile(REGULAR_EXPRESSION);
        final Matcher MATCHER = PATTERN.matcher(str_input);
        List<String> str_lst = new ArrayList<String>();
        while (MATCHER.find()) {
            str_lst.add(MATCHER.group());
        }
        final String  LEGAL_TAGS = "<a><img><div><p><span><pre><br><code>" +
                "<b><u><i><strike><font><blockquote><ul><li><ol><table><tr><td>";
        for (String str:str_lst) {
            if (!LEGAL_TAGS.contains(str)) {
                    System.out.println(str + " is illegal");
            }
        }
    }

上述代码输出为：

String inputed:<a ss><script>sds<body><a></adsd><d/s><pp><a><dsds>dsdas<fontdsdsd>das<oooioacc><pp sds><script><code > <code><ccc><abug>
<script> is illegal
<body> is illegal
<d/s> is illegal
<pp> is illegal
<dsds> is illegal
<fontdsdsd> is illegal
<oooioacc> is illegal
<pp sds> is illegal
<script> is illegal
<ccc> is illegal
<abug> is illegal

2017年1月1日

新年好，然而，不得再也不说下这个xss白名单校验的新进展。昨天，更新了上述校验方法。那个脚本小子又来了，根据上文内容可知，我如今作到的是只限定有限的html标签，但没对标签属性作限制。结果这个脚本小子就拿这个作文章。好比把p标签设为绝对定位，绑定指定位置，设置长宽，一类的……

并且onclick、onload这些东西不少标签都有。

因此上文所述的方法写的也不够。但又感受去再校验属性对我来讲好麻烦。就上网上找找别人怎么作的。最后就找到了jsoup这个开源jar包。

https://jsoup.org/download

引入jar包后，这样写便可：

articleContent = Jsoup.clean(articleContent, Whitelist.basicWithImages());

妈的，能用轮子就尽快用，本身造太难了，浪费我五天。

最后，祝天下全部脚本猴子，2017年倒大霉！！！

2017年1月10日

上次加了Jsoup的过滤后，感受写博客方面有些问题。明显是一些不应被过滤的标签被过滤掉了。

articleContent = Jsoup.clean(articleContent,Whitelist.basicWithImages());

以为有必要继续处理。

设置断点调试。做为例子，博客中写这样的html代码：

<html>
     <body>
    <audio controls="controls" autoplay="autoplay" height="100" width="100">
            <source src="<%=basePath %>music/Breath and Life.mp3" type="audio/mp3" />
          <source src="<%=basePath %>music/Breath and Life.ogg" type="audio/ogg" />
          <embed height="100" width="100" src="<%=basePath %>music/Breath and Life.mp3" />
     </audio>
    <script type="text/javascript" src="<%=basePath %>js/global.js"></script>
    <script type="text/javascript" src="<%=basePath %>js/photos.js"></script>
    </body>
</html>

富文本编辑器传到后台的字符串为：

hello，日向blog<pre style="max-width:100%;overflow-x:auto;"><code class="html hljs xml"
codemark="1"><html>
<body>
<audio controls="controls" autoplay="autoplay" height="100" width="100">
<source src="<%=basePath %>music/Breath and Life.mp3" type="audio/mp3" />
<source src="<%=basePath %>music/Breath and Life.ogg" type="audio/ogg" />
<embed height="100" width="100" src="<%=basePath %>music/Breath and Life.mp3" />
</audio>
<script type="text/javascript" src="<%=basePath
%>js/global.js"></script>
<script type="text/javascript" src="<%=basePath
%>js/photos.js"></script>
</body>
</html></code></pre>

经jsoup过滤后的值为：

hello，日向blog
<pre><code><html>
<body>
<audio controls="controls" autoplay="autoplay" height="100" width="100">
<source src="<%=basePath %>music/Breath and Life.mp3" type="audio/mp3" />
<source src="<%=basePath %>music/Breath and Life.ogg" type="audio/ogg" />
<embed height="100" width="100" src="<%=basePath %>music/Breath and Life.mp3" />
</audio>
<script type="text/javascript" src="<%=basePath %>js/global.js"></script>
<script type="text/javascript" src="<%=basePath %>js/photos.js"></script>
</body>
</html></code></pre>

显然pre标签的style、span和code标签的class属性被过滤掉了，而这些属性是无害而必须的。因此，咱们须要改动jsoup原有的白名单。

查看代码，了解到Jsoup的过滤是经过传入Whitelist.basicWithImages()这个参数实现的，这是个白名单。

查看其源代码：

    /**
     <p>
     This whitelist allows a fuller range of text nodes: <code>a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li,
     ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul</code>, and appropriate attributes.
     </p>
     <p>
     Links (<code>a</code> elements) can point to <code>http, https, ftp, mailto</code>, and have an enforced
     <code>rel=nofollow</code> attribute.
     </p>
     <p>
     Does not allow images.
     </p>

     @return whitelist
     */
    public static Whitelist basic() {
        return new Whitelist()
                .addTags(
                        "a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em",
                        "i", "li", "ol", "p", "pre", "q", "small", "span", "strike", "strong", "sub",
                        "sup", "u", "ul")

                .addAttributes("a", "href")
                .addAttributes("blockquote", "cite")
                .addAttributes("q", "cite")

                .addProtocols("a", "href", "ftp", "http", "https", "mailto")
                .addProtocols("blockquote", "cite", "http", "https")
                .addProtocols("cite", "cite", "http", "https")

                .addEnforcedAttribute("a", "rel", "nofollow")
                ;

    }

    /**
     This whitelist allows the same text tags as {@link #basic}, and also allows <code>img</code> tags, with appropriate
     attributes, with <code>src</code> pointing to <code>http</code> or <code>https</code>.

     @return whitelist
     */
    public static Whitelist basicWithImages() {
        return basic()
                .addTags("img")
                .addAttributes("img", "align", "alt", "height", "src", "title", "width")
                .addProtocols("img", "src", "http", "https")
                ;
    }

我作了修改后为：

   public static Whitelist basic() {
       return new Whitelist()
               .addTags(
                       "a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em",
                       "i", "li", "ol", "p", "pre", "q", "small", "span", "strike", "strong", "sub",
                       "sup", "u", "ul")

               .addAttributes("a", "href")
               .addAttributes("blockquote", "cite")
               .addAttributes("q", "cite")
               .addAttributes("code", "class")
               .addAttributes("span", "class")
               .addAttributes("pre", "style")

               .addProtocols("a", "href", "ftp", "http", "https", "mailto")
               .addProtocols("blockquote", "cite", "http", "https")
               .addProtocols("cite", "cite", "http", "https")

               .addEnforcedAttribute("a", "rel", "nofollow")
               ;

   }

添加了：

.addAttributes("span", "class")

.addAttributes("pre", "style")

.addAttributes("code", "class")

这三项