软工作业2_论文查重-JZTXT

1.软工作业2_论文查重

软件工程	21计科34班
作业要求	个人项目
作业目标	按软件设计开发流程设计实现论文查重程序
作业GitHub地址	软工作业GitHub

开发环境：
编译语言：Java 17
IDE：Intellij IDEA 2022.3.3
项目构建工具：maven
单元测试：JUnit
性能分析工具：JProfiler 11.0.2

2.PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	50	100
· Estimate	· 估计这个任务需要多少时间	70	100
Development	开发	500	555
· Analysis	· 需求分析 (包括学习新技术)	340	400
· Design Spec	· 生成设计文档	70	20
· Design Review	· 设计复审	20	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	20	20
· Design	· 具体设计	50	50
· Coding	· 具体编码	120	100
· Code Review	· 代码复审	10	10
· Test	· 测试（自我测试，修改代码，提交修改）	100	35
Reporting	报告	80	80
· Test Report	· 测试报告	20	40
· Size Measurement	· 计算工作量	40	20
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	50	30
	· 合计	770	775

3.项目需求

题目：论文查重

描述如下：

设计一个论文查重算法，给出一个原文文件和一个在这份原文上经过了增删改的抄袭版论文的文件，在答案文件中输出其重复率。

原文示例：今天是星期天，天气晴，今天晚上我要去看电影。
抄袭版示例：今天是周天，天气晴朗，我晚上要去看电影。

要求输入输出采用文件输入输出，规范如下：

从命令行参数给出：论文原文的文件的绝对路径。
从命令行参数给出：抄袭版论文的文件的绝对路径。
从命令行参数给出：输出的答案文件的绝对路径。

4.计算模块接口的设计与实现

设计思路：

核心算法：

主要依靠SimHash和海明距离：文章分词并且将每个词都附上权重，然后将分词通过Hash算法计算出哈希值，将哈希值进行加权后把所有值相加，得到一个序列串，最后把这个序列串简化为1、0组成的序列，通过比较差异的位数就可以得到两串文本的差异，差异的位数，称之为“海明距离”，通常认为海明距离<3的是高度相似的文本

实现类：

Main：main 方法所在的类
SimHash：计算 SimHash 值与海明距离的类
Utils：读写 txt 文件的工具类

MainTest：测试类

SimHash类的实现：

simhash函数：

private BigInteger simHash() {

    int[] v = new int[this.hashbits];

    List<Term> termList = StandardTokenizer.segment(this.tokens);
    Map<String, Integer> weightOfNature = new HashMap<String, Integer>();
    weightOfNature.put("n", 2);
    Map<String, String> stopNatures = new HashMap<String, String>();
    stopNatures.put("w", ""); //
    int overCount = 5;
    Map<String, Integer> wordCount = new HashMap<String, Integer>();

    for (Term term : termList) {
        String word = term.word;

        String nature = term.nature.toString();
        //  过滤超频词
        if (wordCount.containsKey(word)) {
            int count = wordCount.get(word);
            if (count > overCount) {
                continue;
            }
            wordCount.put(word, count + 1);
        } else {
            wordCount.put(word, 1);
        }

        if (stopNatures.containsKey(nature)) {
            continue;
        }

        BigInteger t = this.hash(word);
        for (int i = 0; i < this.hashbits; i++) {
            BigInteger bitmask = new BigInteger("1").shiftLeft(i);
            int weight = 1;
            if (weightOfNature.containsKey(nature)) {
                weight = weightOfNature.get(nature);
            }
            if (t.and(bitmask).signum() != 0) {
                v[i] += weight;
            } else {
                v[i] -= weight;
            }
        }
    }
    BigInteger fingerprint = new BigInteger("0");
    for (int i = 0; i < this.hashbits; i++) {
        if (v[i] >= 0) {
            fingerprint = fingerprint.add(new BigInteger("1").shiftLeft(i));
        }
    }
    return fingerprint;
}

private BigInteger hash(String source) {
    if (source == null || source.length() == 0) {
        return new BigInteger("0");
    } else {
        while (source.length() < 3) {
            source = source + source.charAt(0);
        }
        char[] sourceArray = source.toCharArray();
        BigInteger x = BigInteger.valueOf(((long) sourceArray[0]) << 7);
        BigInteger m = new BigInteger("1000003");
        BigInteger mask = new BigInteger("2").pow(this.hashbits).subtract(new BigInteger("1"));
        for (char item : sourceArray) {
            BigInteger temp = BigInteger.valueOf((long) item);
            x = x.multiply(m).xor(temp).and(mask);
        }
        x = x.xor(new BigInteger(String.valueOf(source.length())));
        if (x.equals(new BigInteger("-1"))) {
            x = new BigInteger("-2");
        }
        return x;
    }
}

海明距离：

int hammingDistance(SimHash other) {
    BigInteger m = new BigInteger("1").shiftLeft(this.hashbits).subtract(
            new BigInteger("1"));
    BigInteger x = this.strSimHash.xor(other.strSimHash).and(m);
    int tot = 0;
    while (x.signum() != 0) {
        tot += 1;
        x = x.and(x.subtract(new BigInteger("1")));
    }
    return tot;
}


public double getSemblance(SimHash s2) {
    double i = (double) this.hammingDistance(s2);
    return 1 - i / this.hashbits;
}

5.计算模块单元测试展示与性能展示

测试结果：

调用最多的函数为simHash():

性能分析图：

方法调用情况：

6.计算模块部分异常处理说明

  try {
    if (str.length() < 200) throw new ShortStringException("文本过短,无法判断");
} catch (ShortStringException e) {
    e.printStackTrace();
    return null;
}

当文本长度过短的时候，无法获得关键字