《Web安全之机器学习入门》笔记：第七章 7.3朴素贝叶斯检测WebShell（一）

2022-01-30 23:58:31 阅读：250 来源： 互联网

标签：WebShell Web bigram file 7.3 webshell path files vectorizer

1.源码修改

（1）报错

UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequence

Load ../data/PHP-WEBSHELL/xiaoma/1148d726e3bdec6db65db30c08a75f80.php
Traceback (most recent call last):
......
  t=load_file(file_path)
  for line in f:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequence

将代码改为

def load_file(file_path):
    t=""
    with open(file_path,encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            t+=line
    return t

（2）报错2：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 15: invalid start byte

Load ../data/PHP-WEBSHELL/xiaoma/6b2548e859dd00dbf9e11487597b2c06.php
Traceback (most recent call last): 
    t=load_file(file_path)
    for line in f:
  File "C:\ProgramData\Anaconda3\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 15: invalid start byte

报这个错的话，将这个文件另存为，改为utf-8编码

2.数据集处理之黑白样本获取

本节使用的数据集是在互联网搜集到的黑样本，也就是各种大马和小马的集合。

打开小马的目录，可以看到有54个php后缀的小马文件

打开一个文件，可以看到内容为一句话木马

样本应包括黑样本和白羊吧，对于基于Webshell的文本特征进行WebShell的检测，上文提到本文采用在互联网上搜集到的Webshell作为黑样本，那么白样本则是采用当前最新的wordpress源码，如下所示为白样本

3.样本向量化

在本文中php后缀的文件为黑白样本，需要将其转换为向量的方式。将一个PHP文件作为一个字符串处理，以基于单词2-gram切割，遍历全部文件形成基于2-gram的词汇表。然后进一步将每个PHP文件向量化

webshell的的思路为，将php webshell文件按照单词分词后(正则r'\b\w+\b')，按照2-gram算法得到词集，从而得到文件每一行在该词集上的分布情况，得到特征向量；然后将正常的php文件也按照如上方法在如上词集上得到特征向量。

（1）何为N-gram与2-gram

N-gram是机器学习中NLP处理中的一个较为重要的语言模型，它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列。n-gram模型是指n个连续的单词组成的序列。N=1时称为unigram，N=2称为bigram，N=3称为trigram，以此类推。

该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的是二元的Bi-Gram和三元的Tri-Gram。

（2）黑样本

代码如下：

    webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
                                        token_pattern = r'\b\w+\b',min_df=1)
    webshell_files_list=load_files("../data/PHP-WEBSHELL/xiaoma/")
    x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
    print(len(x1), x1[0])
    y1=[1]*len(x1)

打印feature

print(webshell_bigram_vectorizer.get_feature_names())

结果如下：

打印vocabulary

    vocabulary=webshell_bigram_vectorizer.vocabulary_

内容如下所示

（3）白样本

代码如下

    vocabulary=webshell_bigram_vectorizer.vocabulary_
    wp_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), 
decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1,vocabulary=vocabulary)
    wp_files_list=load_files("../data/wordpress/")
    x2=wp_bigram_vectorizer.fit_transform(wp_files_list).toarray()
    print(len(x2), x2[0])
    y2=[0]*len(x2)

（4）构造训练集

代码如下

    x=np.concatenate((x1,x2))
    y=np.concatenate((y1, y2))

5.完整代码如下：

基本运行环境为python3，如下为修改过可以正常运行的源码

import os
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB


def load_file(file_path):
    t=""
    with open(file_path, encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            t+=line
    return t


def load_files(path):
    files_list=[]
    for r, d, files in os.walk(path):
        for file in files:
            if file.endswith('.php'):
                file_path=path+file
                #print("Load %s" % file_path)
                t=load_file(file_path)
                files_list.append(t)
    return  files_list



if __name__ == '__main__':

    webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",token_pattern = r'\b\w+\b',min_df=1)
    webshell_files_list=load_files("../data/PHP-WEBSHELL/xiaoma/")
    x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
    print(len(x1), x1[0])
    y1=[1]*len(x1)

    vocabulary=webshell_bigram_vectorizer.vocabulary_
    wp_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), 
decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1,vocabulary=vocabulary)
    wp_files_list=load_files("../data/wordpress/")
    x2=wp_bigram_vectorizer.fit_transform(wp_files_list).toarray()
    print(len(x2), x2[0])
    y2=[0]*len(x2)
    x=np.concatenate((x1,x2))
    y=np.concatenate((y1, y2))

    clf = GaussianNB()
    # 使用三折交叉验证
    scores = model_selection.cross_val_score(clf, x, y, n_jobs=1, cv=3)
    print(scores)
    print(scores.mean())

6.运行结果（3折交叉验证）

[0.71153846 0.88235294 0.74509804]
0.7796631473102061

7.10折交叉验证结果

代码如下

    # 使用三折交叉验证
    scores = model_selection.cross_val_score(clf, x, y, n_jobs=1, cv=10)
    print(scores)
    print(scores.mean())

运行结果如下

[0.75       0.4375     0.625      0.6875     0.73333333 0.66666667
 0.73333333 0.53333333 0.46666667 0.53333333]
0.6166666666666666

标签：WebShell,Web,bigram,file,7.3,webshell,path,files,vectorizer
来源： https://blog.csdn.net/mooyuan/article/details/122756613

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

《Web安全之机器学习入门》笔记：第七章 7.3朴素贝叶斯检测WebShell（一）