第一次编程作业-JZTXT

软件工程	计科4班
作业要求	(https://edu.cnblogs.com/campus/gdgy/CSGrade21-34/homework/13023)
作业目标	个人项目

GitHub地址：

(https://github.com/abduwali66/3121005072)

psp表格

PSP	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	40	90
Estimate	估计这个任务需要多少时间	600	730
Development	开发	300	400
Analysis	需求分析 (包括学习新技术)	200	280
Design Spec	生成设计文档	40	20
Design Review	设计复审	40	20
Coding Standard	代码规范 (为目前的开发制定合适的规范)	10	5
Design	具体设计	10	5
Coding	具体编码	200	220
Code Review	代码复审	20	10
Test	测试（自我测试，修改代码，提交修改）	40	40
Reporting	报告	30	20
Test Repor	测试报告	20	10
Size Measurement	计算工作量	5	5
Postmortem & Process Improvement Plan	事后总结, 并提出过程改进计划	5	5
Total	总计	1560	1860

作业要求

题目：论文查重

描述如下：

设计一个论文查重算法，给出一个原文文件和一个在这份原文上经过了增删改的抄袭版论文的文件，在答案文件中输出其重复率。

原文示例：今天是星期天，天气晴，今天晚上我要去看电影。
抄袭版示例：今天是周天，天气晴朗，我晚上要去看电影。
要求输入输出采用文件输入输出，规范如下：

从命令行参数给出：论文原文的文件的绝对路径。
从命令行参数给出：抄袭版论文的文件的绝对路径。
从命令行参数给出：输出的答案文件的绝对路径。
我们提供一份样例，课堂上下发，上传到班级群，使用方法是：orig.txt是原文，其他orig_add.txt等均为抄袭版论文。

注意：答案文件中输出的答案为浮点型，精确到小数点后两位

接口设计与实现

文件读取接口实现
从命令行中接受原文,查重论文以及答案文件存储路径.
对原文及查重论文进行分句处理.
将处理后的文件传递给计算模组.

核心算法

private static String[] TxtToArray(String paperPath) {
        String[] sentenceArray = new String[2000];
        try {
            Reader reader = null;
            reader = new InputStreamReader(new FileInputStream(new File(paperPath)));
            int tempchar;
            int n = 0;
            String sentence = "";
            while ((tempchar = reader.read()) != -1) {
                switch (JudgeType(tempchar)) {
                    case 1:
                        if (sentence.equals("")) break;
                        if (sentence.length() > 5) sentenceArray[n++] = sentence;
                        sentence = "";
                        break;
                    case 2:
                        sentence = sentence + (char) (tempchar);
                    default:
                        break;
                }
            }
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return sentenceArray;
    }

原理:
通过逐字读取字符并判定字符类型来划分句子.

计算模组接口实现
从读取接口中获取两组处理好的字符数组
利用原文中的每一句对查重论文的所有句子进行比对并取最高值

for (String doc1 : originalArray
        ) {
            sentencePercentage = 0;
            if (doc1 == null) break;
            wordNum += doc1.length();
            for (String doc2 : addArray
            ) {
                if (doc2 == null) break;
                Map<Character, int[]> algMap = new HashMap<>();
                for (int i = 0; i < doc1.length(); i++) {
                    char d1 = doc1.charAt(i);
                    int[] fq = algMap.get(d1);
                    if (fq != null && fq.length == 2) {
                        fq[0]++;
                    } else {
                        fq = new int[2];
                        fq[0] = 1;
                        fq[1] = 0;
                        algMap.put(d1, fq);
                    }
                }
                for (int i = 0; i < doc2.length(); i++) {
                    char d2 = doc2.charAt(i);
                    int[] fq = algMap.get(d2);
                    if (fq != null && fq.length == 2) {
                        fq[1]++;
                    } else {
                        fq = new int[2];
                        fq[0] = 0;
                        fq[1] = 1;
                        algMap.put(d2, fq);
                    }
                }
                double sqdoc1 = 0;
                double sqdoc2 = 0;
                double denuminator = 0;
                for (Map.Entry entry : algMap.entrySet()) {
                    int[] c = (int[]) entry.getValue();
                    denuminator += c[0] * c[1];
                    sqdoc1 += c[0] * c[0];
                    sqdoc2 += c[1] * c[1];
                }
                double similarPercentage = denuminator / Math.sqrt(sqdoc1 * sqdoc2);
                if (similarPercentage > sentencePercentage)
                    sentencePercentage = similarPercentage;
            }
            similarityPercentage += (sentencePercentage * doc1.length());
        }
        similarityPercentage = similarityPercentage / wordNum * 100;

原理:

利用余弦相似度来判断句子的相似性.

性能分析

CPU Load:

内存消耗: