12-JZTXT

Scene text is ubiquitous in our daily life, and it conveys valuable information. However, various private information, such as ID numbers, telephone numbers, car numbers, and home addresses (Inai et al. 2014) may easily be exposed in natural scene images. Such important private information can be easily collected automatically by the machines engaged in fraud, marketing, or other illegal activities. Therefore, a method that can ensconce the text in the wild would be beneficial.

场景文本在我们的日常生活中无处不在，它传达着有价值的信息。然而，各种私人信息，如身份证号码、电话号码、汽车号码和家庭地址（Inai et al. 2014）可能很容易暴露在自然场景图像中。这些重要的私人信息可以很容易地被从事欺诈、营销或其他非法活动的机器自动收集。因此，一种可以在野外隐藏文本的方法将是有益的。
Textual information in a captured scene plays an important role in scene interpretation and decision making.

捕捉场景中的文本信息在场景的解释和决策中起着重要的作用。

Text is widely present in different design and scene images. It contains important contextual information for the readers. However, if any alteration is required in the text present in an image, it becomes extremely difficult for several reasons.

文本广泛存在于不同的设计和场景图像中。它包含了对读者很重要的上下文信息。然而，如果需要对图像中出现的文本进行任何修改，由于几个原因，这就会变得极其困难。

Text manipulation technologies cause serious worries in recent years, however, corresponding tampering detection methods have not been well explored. In this paper, we introduce a new task, named Tampered Scene Text Detection (TSTD), to localize text instances and recognize the texture authenticity in an end-to-end manner. Different from the general scene text detection (STD) task, TSTD further introduces the fine-grained classification, i.e. the tampered and real-world texts share a semantic space (text position and geometric structure) but have different local textures.

文本操纵技术近年来引起了严重的担忧，但相应的篡改检测方法尚未得到很好的探索。在本文中，我们引入了一个新的任务，称为篡改场景文本检测（TSTD），来定位文本实例和识别纹理的真实性。与一般的场景文本检测（STD）任务不同，TSTD进一步引入了细粒度分类，即被篡改的文本和真实世界的文本共享一个语义空间（文本位置和几何结构），但具有不同的局部纹理。
In this paper, we introduce a new task, named Tampered Scene Text Detection (TSTD), to localize text instances and recognize the texture authenticity in an end-to-end manner. Different from the general scene text detection (STD) task, TSTD further introduces the fine-grained classification, i.e. the tampered and real-world texts share a semantic space (text position and geometric structure) but have different local textures.

在本文中，我们引入了一个新的任务，称为篡改场景文本检测（TSTD），来定位文本实例和识别纹理的真实性。与一般的场景文本检测（STD）任务不同，TSTD进一步引入了细粒度分类，即被篡改的文本和真实世界的文本共享一个语义空间（文本位置和几何结构），但具有不同的局部纹理。

As an important media for information transmission, scene text contains amounts of important and sensitive information [33,31,3,22]. With the development of text manipulation technologies [34,38,21], computers can automatically tamper with the important and sensitive content into fake information, being used in fraud, marketing or other illegal purposes. In contrast, methods for the tampered text detection field are currently blank. To fill such a research blank, we propose a new task named Tampered Scene Text Detection (TSTD) in this paper.

场景文本作为信息传递的重要媒介，包含大量重要敏感信息[33,31,3,22]。随着文本操纵技术[34,38,21]的发展，计算机可以自动将重要和敏感的内容篡改成虚假信息，被用于欺诈、营销或其他非法目的。相比之下，对于被篡改的文本检测字段的方法目前是空白的。为了填补这一研究空白，本文提出了一种新的篡改场景文本检测（TSTD）。
On the basis of face forgery detection task [12,17] that tampering detection approaches should not focus on only the tampered class, TSTD task needs to locate all the texts in scene images and determine whether the text has been tampered with (shown in Fig. 1).

基于人脸伪造检测任务[12,17]，篡改检测方法不应该只关注被篡改的类，TSTD任务需要定位场景图像中的所有文本，并确定文本是否被篡改（如图1所示）

TSTD task has two main challenges:

Fine-grained perception. The tampered and real-world texts share a semantic space (text position and geometric structure) but have local texture differences. As shown in Fig. 1, both tampered and real-world texts exist in the same position (e.g. board, bus, etc.) and have the identical geometric structure (e.g. horizontal/oriented posture and text shape), while tampered texts contain different local textures (e.g. smoothness) than real-world ones. Thus, TSTD methods need to maximize the discrimination of class-specific texture features while maintaining the semantic invariance.
The limited size of high-quality annotated tampered text images. At present, the results of existing text manipulation methods [34,38] are still a long way from practical application. In general, manual refinement in the post processing is necessary for the visualization improvement, which inevitably results in lots of human cost. Thus, how to construct a TSTD method with low data-dependency is necessary.

TSTD任务有两个主要的挑战：

1)细粒度的知觉。被篡改的文本和真实世界的文本共享一个语义空间（文本位置和几何结构），但存在局部纹理差异。如图1所示，篡改文本和真实文本都存在于同一位置（如板、总线等）并且具有相同的几何结构（如水平/面向方向的姿态和文本形状），而被篡改的文本包含不同于真实世界的局部纹理（如平滑度）。因此，TSTD方法需要在保持语义不变性的同时，最大限度地区分特定于类的纹理特征。

2)高质量的标注篡改文本图像的大小有限。目前，现有的文本操作方法[34,38]的研究结果距离实际应用还有很长的路要走。一般来说，后处理中的人工细化是可视化改进的必要条件，这不可避免地会导致大量的人力成本。因此，如何构建一个具有低数据依赖性的TSTD方法是必要的。

In this paper, we propose a Separating Segmentation while Sharing Regression (S3R) modification strategy to construct TSTD approaches based on existing STD ones. The S3R strategy follows the phenomenon that tampered and real-word texts only have local texture differences, but contain the same global semantics (text position and geometric structure).

在本文中，我们提出了一种分离分割而共享回归（S3R）修改策略，基于现有的STD方法来构建TSTD方法。S3R策略遵循篡改和真实单词文本只有局部纹理差异，但包含相同的全局语义（文本位置和几何结构）的现象。

However, in the realworld scenario, the doctored image could be further processed or transmitted over a channel with unknown distortion, which dramatically degrade the forgery detection performance. In this work, we make the first step towards designing a robust document image forgery localization against image blending.

然而，在现实场景中，篡改后的图像可以通过未知失真的信道进行进一步处理或传输，这大大降低了伪造检测性能。在这项工作中，我们向设计一个鲁棒的文档图像伪造定位对抗图像混合的第一步。
Digital document image is the digital representation of a hard-copy document. They are available in electronic form on various media and networks and can be easily accessed by users. Recent years have seen a general use of digital document images in areas such as Office Automation, E-commerce, and E-government. However, the content of the document can be easily manipulated by using image editing tools or techniques that utilize deep learning. Splicing, copy-pasting, and object removing forgery operations to change document images may create a crisis of shared trust. Multiple methods have been proposed for verifying the authenticity of images.

数字文档图像是硬拷贝文档的数字表示。它们以电子形式在各种媒体和网络上提供，用户可以很容易地访问。近年来，数字文档图像在办公自动化、电子商务和电子政务等领域被普遍使用。然而，通过使用图像编辑工具或利用深度学习的技术，可以很容易地操作文档的内容。剪接、复制粘贴和对象删除伪造操作以更改文档映像可能会创建共享信任的危机。目前已经提出了多种方法来验证图像的真实性

Nowadays, image forensics gradually moving from lab research to the real-world applications. The majority of existing image forensic algorithms have been designed and validated only for ideal controlled laboratory environments, without sufficient consideration of the robustness in practical environments. It is often assumed that the tampered image has not undergone any pre-processing or post-processing. However, forgery images in real scenarios often undergo post-processing to mask the tracing of manipulation.

如今，图像取证逐渐从实验室研究转向现实世界的应用。大多数现有的图像取证算法只在理想的受控实验室环境中设计和验证，而没有充分考虑在实际环境中的鲁棒性。通常假定被篡改的图像没有经过任何预处理或后处理。然而，真实场景中的伪造图像往往会经过后处理，以掩盖对操作的跟踪。
The S3R strategy successfully maintains the semantic invariance and explicitly guides the class-specific texture feature learning between tampered and real-world texts. Furthermore, a parallel-branch feature extractor is constructed for the feature representation capability enhancement and data-dependency reduction.

S3R策略成功地保持了语义不变性，并明确地指导了被篡改文本和真实文本之间的类特定的纹理特征学习。此外，还构造了一个并行分支特征提取器，以提高特征表示能力和降低数据依赖性。

In this paper, we focus on only the word-level tampered text detection and propose the relatively word-level detection method. The character-level and line-level tampered cases are not included in this paper.

在本文中，我们只关注单词级的篡改文本检测，并提出了相对单词级的检测方法。本文不包括字符级和线级的篡改案例。《 Detecting Tampered Scene Text in the Wild》

Thus, the end-to-end tampered text detection approaches need to be explored.

因此，需要探索端到端被篡改的文本检测方法。

Scene text editing task aims to end-to-end tamper with text content in scene images. As deep learning becomes the most promising machine learning tool [42,2,4,13], scene text editing has achieved remarkable improvement in recent years.

场景文本编辑任务的目的是端到端篡改场景图像中的文本内容。随着深度学习成为最有前途的机器学习工具[42,2,4,13]，场景文本编辑近年来取得了显著的进步。

ETW [34] splits the text editing process into three sub-networks: text conversion network, background inpainting network and fusion network. The text transfer network learns to transform the style of input text image. Then, the background inpainting network erases the text content in the source image and reconstructs the background texture. Finally, the fusion network aggregates the style-transferred text images and text erased images to generate the final tampered sample. Based on ETW [34], SwapText [38] introduces TPS to handle the severe geometric distortion cases. To edit the specific character in text images, STEFANN [21] proposes a character-level text editing network.

ETW [34]将文本编辑过程分为三个子网络：文本转换网络、背景绘制网络和融合网络。文本传输网络学习转换输入文本图像的样式。然后，背景内画网络擦除源图像中的文本内容，重建背景纹理。最后，融合网络将样式传输的文本图像和文本擦除的图像进行聚合，生成最终的篡改样本。基于ETW [34]，SwapText [38]引入了TPS来处理严重的几何失真情况。为了编辑文本图像中的特定字符，STEFANN [21]提出了一个字符级的文本编辑网络。34-Editing text in the wild

Though text manipulation technologies have been well developed in recent years, methods for tampered text detection field are almost blank. [18,11,24,1] regard tampered text detection task as a pure classification task, and the detection process is not included in their approaches.

虽然近年来文本操作技术得到了很好的发展，但被篡改的文本检测领域的方法几乎是空白的。[18,11,24,1]将篡改文本检测任务视为纯粹的分类任务，检测过程不包括在他们的方法中。18-Forged text detection in video, scene, and document images,1- Document forgery detection using printer source identification—a text-independent approach
image manipulations are proved to leave high-frequency traces [41,17]. However, such variant high-frequency information is difficult to be captured in the RGB domain. Thus, the network needs amounts of tampered images for the better convergency on tampered textures, resulting in a high datadependency. To this end, we introduce a parallel-branch feature extractor to capture the high-frequency information in frequency domain. Through aggregating the RGB and high-frequency features, our parallel-branch feature extractor can easily capture the high-frequency traces to assist the prediction and cause in low data-dependency

图像处理被证明会留下高频痕迹[41,17]。然而，这种不同的高频信息很难在RGB域中被捕获。因此，网络需要大量的被篡改的图像才能更好地转换被篡改的纹理，从而产生较高的数据一致性。为此，我们引入了一种并行分支特征提取器来捕获频域内的高频信息。通过聚合RGB和高频特征，我们的并行分支特征提取器可以很容易地捕获高频跟踪，以帮助预测，并导致较低的数据依赖性。
a new word-level tampered TSTD dataset named TamperedIC13 is proposed (shown in Fig. 5). Tampered-IC13 is generated by tampering with the text in the most well-known scene text detection benchmark ICDAR2013 [10]. To the best of our knowledge, this is the first world-level TSTD dataset, which will greatly promote the development of TSTD task.

提出了一种新的词级篡改TSTD数据集TamperedIC13（如图5所示）。篡改-IC13是通过篡改最著名的场景文本检测基准测试ICDAR2013 [10]中的文本而生成的。据我们所知，这是第一个世界级的TSTD数据集，它将极大地促进TSTD任务的开发。

A parallel-branch feature extractor is constructed to capture both characteristics in RGB and frequency domains,

构造了一个并行分支特征提取器来捕获RGB和频域的特征

To better understand how frequency information helps the network for prediction, we visualize the extracted features from frequency branch and RGB branch. We normalize each map to [0, 1] for better visualization. As shown in Fig. 4, different from RGB branch mainly focuses on the text content in the RGB domain, the frequency branch effectively captures the high-frequency characteristics in the outline areas. By fusing the features captured from both frequency and RGB branches, the parallel-branch feature extractor is able to learn distinguishing features between tampered and real-world texts.

为了更好地理解频率信息如何帮助网络进行预测，我们将从频率分支和RGB分支中提取的特征进行可视化。为了更好地可视化，我们将每个映射规范化为[0,1]。如图4所示，与RGB分支主要关注RGB域中的文本内容不同，频率分支有效地捕获了轮廓区域中的高频特征。通过融合从频率和RGB分支捕获的特征，并行分支特征提取器能够学习被篡改文本和真实文本之间的区分特征。

The difficulty of the TSTD task lies in how the network can better distinguish the tampered class from the real class, which puts higher requirements on the data generation process to ensure the texture consistency and background integrity in the tampered region.

TSTD任务的难点在于网络如何能够更好地区分被篡改的类和真实的类，这就对数据生成过程提出了更高的要求，以确保被篡改区域的纹理一致性和背景完整性。

Thus, it is quite a challenge to achieve the accurate detection on Tampered-IC13 and the detection results on Tampered-IC13 can well reflect the performance of TSTD detectors.

因此，对篡改ic13的准确检测是一个很有挑战性的挑战，对篡改ic13的检测结果可以很好地反映TSTD探测器的性能。

Based on above analyses, the separated structure and representation suppression between two branches together promote the class-specific texture learning, and improve the detection performance to a new level.

基于以上分析，两个分支之间的分离结构和表示抑制共同促进了类特定的纹理学习，并将检测性能提高到一个新的水平。

We summarize the impressive improvement to that the high-frequency information effectively assists the network to learn distinguishing features between tampered and real-world texts.

我们总结了一个令人印象深刻的改进，即高频信息有效地帮助网络学习被篡改的文本和真实世界的文本之间的区别特征。

"Our experimental results demonstrate that our approach achieves high accuracy and robustness in detecting and locating image tampering."
"We conducted extensive experiments to evaluate the robustness of our approach to various types of image tampering, including copy-paste, modification, text overlay, and watermark removal."
"Our experiments show that our approach is effective and feasible on different datasets, and can be applied to various real-world scenarios."
"Compared to other existing methods, our approach achieves superior performance in detecting and locating image tampering while maintaining high efficiency."
"We also evaluated the robustness of our approach against different scales, angles, and lighting conditions, and the results demonstrate that our approach is robust under various scenarios."
"Our approach is able to effectively detect and locate hidden tampering, such as local modifications or subtle text overlays."
"Through practical testing in real-world scenarios, we show that our approach is feasible and practical for various applications, and can provide reliable detection and localization results."

“我们的实验结果表明，我们的方法在检测和定位图像篡改方面实现了高精度和鲁棒性。”

“我们进行了广泛的实验，以评估我们的方法对各种图像篡改的稳健性，包括复制粘贴、修改、文本覆盖和水印删除。”

“我们的实验表明，我们的方法在不同的数据集上是有效和可行的，并且可以应用于各种真实世界的场景。”

与现有的其他方法相比，我们的方法在检测和定位图像篡改方面取得了优越的性能，同时保持了高效率

“我们还评估了我们的方法在不同尺度、角度和光照条件下的稳健性，结果表明我们的方法在各种情况下都是稳健的。”

“我们的方法能够有效地检测和定位隐藏的篡改，比如局部修改或细微的文本覆盖。”

“通过在真实场景中的实际测试，我们表明我们的方法对于各种应用是可行和实用的，并且能够提供可靠的检测和定位结果。”

Our experiments show that our approach is robust against different scales, angles, and lighting conditions, and can effectively detect and locate image tampering even when the tampered regions are small or obscure. Specifically, we conducted experiments on datasets with images of varying sizes and orientations, and our approach achieved high detection and localization rates across all of them. Furthermore, we tested our approach on images with varying levels of brightness and contrast, and the results show that our approach is able to detect and locate tampering even in challenging lighting conditions. These results demonstrate the robustness of our approach and its potential for practical applications in the real world.

实验结果表明，该方法对不同的尺度、角度和光照条件具有较强的鲁棒性，能够有效地检测和定位图像篡改，即使篡改区域很小或模糊。具体来说，我们对不同大小和方向的图像数据集进行了实验，我们的方法在所有图像上实现了高的检测和定位率。此外，我们在不同亮度和对比度的图像上测试了我们的方法，结果表明我们的方法能够检测和定位篡改，即使在具有挑战性的光照条件下。这些结果证明了我们的方法的鲁棒性和它的潜力在实际应用中的真实世界。

Designing the decoder network structure for text image tamper detection and localization requires consideration of several factors, including the size of the input image, the complexity of the tampering, and the desired output format. Here are some general steps that you can follow:

Determine the input size and encoding method: Depending on the size of the input image, you may need to downsample or compress it to a more manageable size before feeding it into the decoder network. You may also need to encode the input image using a suitable encoding method, such as JPEG or PNG.
Define the decoder architecture: The decoder architecture should be designed to reconstruct the original image from the encoded input while also identifying any tampered regions. A common approach is to use a deep convolutional neural network (CNN) with skip connections, which can efficiently extract features from the input image and reconstruct the output image.
Incorporate attention mechanisms: Text image tampering often involves subtle changes to specific regions of the image, such as text or logos. To effectively detect and localize these changes, it may be necessary to incorporate attention mechanisms into the decoder architecture. Attention mechanisms can help the network focus on specific regions of the image that are likely to contain tampering.
Define the output format: The output format of the decoder network should be designed to clearly indicate the location of any tampered regions. This can be achieved through various methods, such as image segmentation or bounding box detection. The output format should also be compatible with downstream applications, such as image editing software or forensic analysis tools.
Train and evaluate the network: Once the decoder architecture and output format have been defined, the network should be trained on a suitable dataset of text image tampering examples. The network should be evaluated on a separate test dataset to ensure that it can effectively detect and localize tampering across a range of scenarios.

By following these steps, you can design a decoder network structure that is effective at detecting and localizing text image tampering.

设计用于文本图像篡改检测和定位的解码器网络结构需要考虑几个因素，包括输入图像的大小、篡改的复杂性和所需的输出格式。以下是一些你可以遵循的一般步骤:

1.确定输入大小和编码方法: 根据输入图像的大小，在将其输入解码器网络之前，可能需要缩小采样或将其压缩到一个更易于管理的大小。您可能还需要使用合适的编码方法(如 JPEG 或 PNG)对输入图像进行编码。
2.定义解码器体系结构: 解码器体系结构应设计为从编码输入重建原始图像，同时识别任何被篡改的区域。一种常见的方法是使用带跳跃连接的深度卷积神经网络(CNN) ，它可以有效地从输入图像中提取特征并重建输出图像。
3.合并注意机制: 文本图像篡改通常涉及图像特定区域的微妙变化，如文本或标识。为了有效地检测和定位这些变化，可能有必要将注意机制纳入解码器体系结构。注意机制可以帮助网络专注于图像中可能包含篡改的特定区域。
4.定义输出格式: 解码器网络的输出格式应设计为清楚地指出任何被篡改的区域的位置。这可以通过各种方法来实现，例如图像分割或边界盒检测。输出格式还应与下游应用程序兼容，如图像编辑软件或法证分析工具。
5.训练和评估网络: 一旦解码器结构和输出格式已经确定，网络应该训练在一个合适的文本图像篡改例子的数据集。应该在单独的测试数据集上对网络进行评估，以确保它能够有效地检测和定位各种情况下的篡改。

通过以下步骤，您可以设计一个能够有效检测和定位文本图像篡改的解码器网络结构。

Document image manipulation detection is a challenging task that aims to identify and locate forged regions in images containing mostly text, such as identity documents, certificates, receipts, etc. Some of the specific problems or challenges in this area are:

The forged regions often look very similar to the original image, as both are usually black text on a white background.
The number and location of text boxes on a document vary widely, making it difficult to apply fixed-size convolutional filters.
The manipulation methods are diverse and evolving, such as splicing, inpainting, warping, etc. This requires the detection models to be robust and generalizable to unseen manipulations.

文档图像处理检测是一项具有挑战性的任务，其目的是识别和定位伪造的图像中大部分文本，如身份证件，证书，收据等。这方面的一些具体问题或挑战是:

伪造的区域通常看起来与原始图像非常相似，因为两者通常都是白色背景上的黑色文字。

文件上文本框的数量和位置差别很大，因此难以采用固定大小的卷积过滤器。

操作方法多种多样，不断演变，如拼接、修补、翘曲等。这要求检测模型是健壮的，并且可以推广到看不见的操作。

Another challenge is the detection of document manipulations that are designed to be imperceptible to the human eye.This can include subtle alterations to the text or layout of a document that may not be noticeable to the human observer, but can still have a significant impact on the document's meaning or authenticity.Additionally, document manipulations may be targeted specifically at evading detection by existing forensic methods, such as by exploiting weaknesses in the algorithms used for image analysis or by using advanced image processing techniques to hide or obscure the evidence of tampering.

另一个挑战是检测设计为人眼无法察觉的文档操作。这可能包括对文档文本或布局的细微改动，人类观察者可能不会注意到这些改动，但仍会对文档的含义或真实性产生重大影响。此外，文件操纵可能专门针对逃避现有取证方法的检测，例如利用图像分析算法中的弱点或使用高级图像处理技术隐藏或掩盖篡改证据。
This may involve using more sophisticated attention mechanisms, such as adversarial attention, that can effectively identify and capture subtle artifacts and irregularities in manipulated images.

这可能涉及使用更复杂的注意力机制，例如对抗性注意力，可以有效地识别和捕获操纵图像中的细微伪影和不规则现象。
This could include developing datasets that cover a wider range of manipulation types and levels of realism, and using more advanced validation techniques, such as cross-validation and adversarial training, to improve the accuracy and robustness of detection models.

这可能包括开发涵盖更广泛的操作类型和真实水平的数据集，并使用更先进的验证技术（例如交叉验证和对抗训练）来提高检测模型的准确性和稳健性。
First, they need to identify the types of manipulations that are most common or pose the greatest risk in the real world.This could include common techniques such as copy-pasting, splicing, or text overlay, as well as more advanced techniques like deepfakes or GAN-generated content.

首先，他们需要确定在现实世界中最常见或构成最大风险的操作类型。这可能包括复制粘贴、拼接或文本叠加等常见技术，以及深度伪造或 GAN 生成内容等更高级的技术。
Adversarial attention is a more advanced form of attention mechanism that has shown promising results in various computer vision tasks, including image manipulation detection. It is based on the concept of adversarial training, where the attention mechanism is trained to distinguish between genuine and manipulated images, while the manipulator tries to generate images that can evade detection by the attention mechanism.

In the context of text image manipulation detection, adversarial attention can be used to identify and capture subtle artifacts and irregularities in manipulated images. For example, the attention mechanism can be trained to focus on regions of the image that contain artifacts such as inconsistent textures, inconsistent lighting, or inconsistent color tones. The manipulator can then try to generate images that are more consistent in these regions, making it more difficult for the attention mechanism to detect the manipulation.

One approach to implementing adversarial attention is through the use of generative adversarial networks (GANs), which consist of two deep neural networks - a generator and a discriminator. The generator tries to generate images that can deceive the discriminator, while the discriminator tries to distinguish between genuine and manipulated images. By iteratively training the generator and discriminator, the attention mechanism can be trained to better identify and capture subtle artifacts and irregularities in manipulated images.

In summary, adversarial attention is a more sophisticated attention mechanism that can effectively identify and capture subtle artifacts and irregularities in manipulated images, and can be implemented through the use of GANs in the context of text image manipulation detection.

对抗性注意力是一种更高级的注意力机制形式，已在各种计算机视觉任务（包括图像处理检测）中显示出令人鼓舞的结果。它基于对抗训练的概念，其中注意力机制被训练以区分真实图像和操纵图像，而操纵者试图生成可以逃避注意力机制检测的图像。

在文本图像操纵检测的背景下，对抗性注意力可用于识别和捕获操纵图像中的细微伪影和不规则性。例如，可以训练注意力机制以关注图像中包含伪像的区域，例如不一致的纹理、不一致的照明或不一致的色调。然后，操纵器可以尝试生成在这些区域中更加一致的图像，从而使注意力机制更难以检测到操纵。

实现对抗性注意力的一种方法是使用生成对抗网络 (GAN)，它由两个深度神经网络——一个生成器和一个鉴别器组成。生成器试图生成可以欺骗鉴别器的图像，而鉴别器则试图区分真实图像和经过处理的图像。通过迭代训练生成器和判别器，可以训练注意力机制以更好地识别和捕获操纵图像中的细微伪影和不规则性。

总之，对抗性注意力是一种更复杂的注意力机制，可以有效地识别和捕获被操纵图像中的细微伪影和不规则性，并且可以通过在文本图像操纵检测的上下文中使用 GAN 来实现。
To develop datasets that cover a wider range of text manipulation types and levels of realism, researchers can consider the following strategies:
1. Collect real-world text images: Collecting a diverse set of real-world text images can help capture the various types of text manipulation that are commonly encountered in the wild. This can include text swap, text removal, text splicing, text copy-move, and other types of manipulation.
2. Generate synthetic data: Synthetic data can be generated using various techniques such as Generative Adversarial Networks (GANs) and image editing software. Synthetic data can be manipulated to create a variety of realistic text manipulations, which can be used to train machine learning models to detect such manipulations.
3. Crowd-sourcing: Crowd-sourcing platforms can be used to create a dataset with a wider range of text manipulation types and levels of realism. Researchers can ask participants to manipulate text images using a set of predefined techniques, such as text swap, text removal, and text splicing, and then collect the manipulated images for use in the dataset.
By utilizing these strategies, researchers can create datasets that cover a wider range of text manipulation types and levels of realism, which can help improve the effectiveness and robustness of text manipulation detection algorithms.

要开发涵盖更广泛的文本操作类型和真实程度的数据集，研究人员可以考虑以下策略：

收集真实世界的文本图像：收集各种真实世界的文本图像可以帮助捕获在野外常见的各种类型的文本操作。这可以包括文本交换、文本删除、文本拼接、文本复制移动和其他类型的操作。

生成合成数据：可以使用各种技术生成合成数据，例如生成对抗网络 (GAN) 和图像编辑软件。可以对合成数据进行操作以创建各种逼真的文本操作，这些操作可用于训练机器学习模型以检测此类操作。

众包：众包平台可用于创建具有更广泛的文本操作类型和真实度级别的数据集。研究人员可以要求参与者使用一组预定义的技术来处理文本图像，例如文本交换、文本删除和文本拼接，然后收集处理过的图像以用于数据集。

通过利用这些策略，研究人员可以创建涵盖更广泛的文本操作类型和真实程度的数据集，这有助于提高文本操作检测算法的有效性和稳健性。
The forgery document image dataset constructed using different blending approaches can help in developing more robust and effective methods for detecting a wide range of text image manipulations. The three blending techniques used in the dataset are:
1. Naive Blending: It involves blending two images using simple averaging or addition of pixel values, resulting in a visible seam between the two images.
2. Poisson Blending: It involves solving a Poisson equation to blend two images seamlessly, resulting in a smoother and more natural-looking composite image.
3. Deep Image Blending: It involves using a deep neural network to blend two images seamlessly, learning the blending function from a large set of training examples.
By using these blending techniques, various types of text image manipulations such as text splicing, copy-move, and removal can be simulated in the dataset. The dataset can also include variations in the level of realism of the manipulated images, such as varying levels of blur, noise, and compression artifacts.

The availability of such a diverse and realistic dataset can help in training and evaluating more sophisticated deep learning models for text image manipulation detection. These models can then be used to detect text manipulation in real-world scenarios, such as in identifying doctored images in social media or detecting fraudulent documents in legal or financial applications.

使用不同混合方法构建的伪造文档图像数据集可以帮助开发更强大和有效的方法来检测各种文本图像操作。数据集中使用的三种混合技术是：

朴素混合：它涉及使用简单的平均或添加像素值来混合两个图像，从而在两个图像之间产生可见的接缝。

泊松混合：它涉及求解泊松方程以无缝混合两个图像，从而产生更平滑、更自然的合成图像。

深度图像混合：它涉及使用深度神经网络将两个图像无缝混合，从大量训练示例中学习混合功能。

通过使用这些混合技术，可以在数据集中模拟各种类型的文本图像操作，例如文本拼接、复制移动和删除。数据集还可以包括处理过的图像的真实感级别的变化，例如不同级别的模糊、噪声和压缩伪影。这种多样化且真实的数据集的可用性有助于训练和评估更复杂的文本图像操纵检测深度学习模型。然后，这些模型可用于检测现实场景中的文本操纵，例如识别社交媒体中的篡改图像或检测法律或金融应用程序中的欺诈文件。
In order to develop datasets that cover a wider range of text manipulation types and levels of realism, researchers can use a variety of techniques to simulate different types of text manipulations. For example, they can create synthetic datasets that simulate various types of text manipulation, or they can collect real-world data that contains instances of text manipulation.

Some common types of text manipulation that can be included in these datasets include text swap, text removal, text splicing, and text copy-move. Text swap involves replacing one word or phrase with another, while text removal involves removing a word or phrase from the text. Text splicing involves combining two different pieces of text to create a new, false message, while text copy-move involves copying and pasting a section of text from one location to another.

To ensure that the dataset is representative of real-world text manipulation, researchers can also use different post-processing techniques to mask any visual traces of forgery. These techniques may include boundary blurring, contrast modification, and scaling, among others.

By developing datasets that cover a wider range of text manipulation types and levels of realism, researchers can better train and test their text manipulation detection models, leading to more effective and robust detection methods.

为了开发涵盖更广泛的文本操作类型和真实程度的数据集，研究人员可以使用多种技术来模拟不同类型的文本操作。例如，他们可以创建模拟各种类型的文本操作的合成数据集，或者他们可以收集包含文本操作实例的真实数据。

这些数据集中可以包含的一些常见文本操作类型包括文本交换、文本删除、文本拼接和文本复制移动。文本交换涉及将一个词或短语替换为另一个，而文本删除涉及从文本中删除一个词或短语。文本拼接涉及将两段不同的文本组合起来以创建新的虚假消息，而文本复制移动涉及将一段文本从一个位置复制并粘贴到另一个位置。

为了确保数据集代表真实世界的文本操作，研究人员还可以使用不同的后处理技术来掩盖任何视觉伪造痕迹。这些技术可能包括边界模糊、对比度修改和缩放等。通过开发涵盖更广泛的文本操作类型和真实程度的数据集，研究人员可以更好地训练和测试他们的文本操作检测模型，从而产生更有效和更强大的检测方法。
However, these general image manipulation detection approaches based on convolutional neural network (CNN) are not suitable for certificate file type tampering detection because the convolutional methods used tend to extract image content features, while most certificate file types have tampered with regions are weakly correlated with image content, which is why existing image processing detection algorithms are less effective when used directly for certificate document type processing.

然而，这些基于卷积神经网络（CNN）的通用图像篡改检测方法并不适用于证书文件类型篡改检测，因为使用的卷积方法倾向于提取图像内容特征，而大多数证书文件类型被篡改的区域与相关性较弱图像内容，这就是为什么现有的图像处理检测算法在直接用于证书文档类型处理时效果较差的原因。

A key idea of image manipulation detection lies in the existence of specific local structural relationships between pixels independent of the image content, and image manipulation operations change these local relationships. Therefore, the certificate document-like image operation detection feature extractor must learn the relationships between pixels and their local domains while suppressing the image content to avoid learning content-related features.

图像操纵检测的一个关键思想在于像素之间存在与图像内容无关的特定局部结构关系，图像操纵操作会改变这些局部关系。因此，类证书文档图像操作检测特征提取器必须在抑制图像内容的同时学习像素与其局部域之间的关系，以避免学习与内容相关的特征。
In addition, the certificate document type image manipulation operation is mainly for text splicing and copy-paste, and the edges during this text manipulation operation will show edge inconsistency between the real region and the manipulated region. Therefore, the edge feature of the image is an important clue for the manipulation of certificate document type image manipulation.

另外，证件类图像处理操作主要是文本拼接和复制粘贴，这种文本处理操作的边缘会出现真实区域和被处理区域的边缘不一致。因此，图像的边缘特征是证件类图像处理的重要线索。
In this work, we propose a manipulation detection network ASGC-Net for certificate document type images based on spatial attention mechanism. To achieve a network that can better localize the tampering cues of text, we also propose a novel spatially constrained convolution that can effectively suppress image content and adaptively learn operational detection characteristics by capturing the different features between the neighborhood and the center of the convolution space. To increase the network’s ability to capture tampering cues at multiple scales of images, we add multilayer cross-scale connections inspired by FPN [4] networks. In experiments, the algorithm was found to locate manipulated regions of certificate documents more accurately than general-purpose manipulation detection algorithms.

在这项工作中，我们提出了一种基于空间注意机制的证书文档类型图像的操纵检测网络 ASGC-Net。为了实现一个能够更好地定位文本篡改线索的网络，我们还提出了一种新的空间约束卷积，它可以有效地抑制图像内容，并通过捕获邻域和卷积空间中心之间的不同特征来自适应地学习操作检测特性。为了提高网络在多尺度图像上捕获篡改线索的能力，我们添加了受 FPN [4] 网络启发的多层跨尺度连接。在实验中，发现该算法比通用篡改检测算法更准确地定位证书文档的篡改区域。
Although to better exploit the textual tampering cues and spatially accurate detail information of the shallow feature maps, a full-scale cross-layer connection structure is implemented. However, the extracted shallow feature maps have a lot of redundant information that is irrelevant to the manipulation detection cues, so it is necessary to add the spatial attention structure in the sampling layer to allow the model to learn the parts that are irrelevant to the image manipulation detection by suppressing the model while aggravating the learning of the features that are relevant to the image manipulation detection when the model is trained. Since the main content of certificate manipulation detection is textual manipulation, ordinary convolution tends to learn the content features of images, while it is more difficult to learn the tampering cues of textual content, so spatial gradient convolution is proposed to capture the local edge-tampering features of certificate documents.

尽管为了更好地利用浅层特征图的文本篡改线索和空间准确的细节信息，实现了全尺寸跨层连接结构。但是提取出来的浅层特征图有很多冗余信息，与操作检测线索无关，因此需要在采样层加入空间注意力结构，让模型学习到与图像无关的部分通过抑制模型来进行篡改检测，同时在训练模型时加强对与图像篡改检测相关的特征的学习。由于证书篡改检测的主要内容是文本篡改，普通卷积更倾向于学习图像的内容特征，而学习文本内容的篡改线索则比较困难，因此提出空间梯度卷积来捕获局部边缘篡改证书文件的特点。
Since ASGC-Net uses a full-scale cross-layer connectivity structure to better utilize the textual tampering cues and spatially accurate detail information of the shallow feature maps, but the extracted shallow feature maps have a lot of redundant information that is irrelevant to the tampering detection cues, it is necessary to add the spatial attention structure in the downsampling module to allow the model to be trained in a way that suppresses the model from learning the image tampering cues by the spatial attention structure needs to be added to the downsampling module so that when the model is trained, it can suppress the parts of the model that aren’t important to image manipulation detection, and at the same time to increase the learning of features that are relevant to image manipulation detection.

由于ASGC-Net采用全尺度跨层连接结构，更好地利用浅层特征图的文本篡改线索和空间准确的细节信息，但提取的浅层特征图有大量与篡改无关的冗余信息检测线索，需要在下采样模块中添加空间注意结构，以允许模型以一种抑制模型学习图像篡改线索的方式进行训练，空间注意结构需要添加到下采样模块中，因此在训练模型时，可以抑制模型中对图像篡改检测不重要的部分，同时增加对图像篡改检测相关特征的学习。
To evaluate the performance of our proposed algorithm for certificate document-like image manipulation detection, we will use the dataset of certificate document-like image manipulation detection released in 2021 by the Ali Tianchi competition [9]. The image type of this dataset consists of manipulated images of real business scenarios (qualifications, documents, screenshots, facade images), and contains manipulation with attacks such as copy and paste, splicing, and pasting. The dataset has a total of 4000 manipulated images with labeled data. Unlike previous image manipulation datasets that focus on natural content images, this dataset uses a large number of forged document-type images. As shown in Table I we randomly divide the labeled dataset into 2000 training sets and 2000 test sets for evaluating the performance of the algorithm.

为了评估我们提出的类证书文档图像篡改检测算法的性能，我们将使用阿里天池竞赛 [9] 于 2021 年发布的类证书文档图像篡改检测数据集。该数据集的图像类型由真实业务场景（资质、文档、截图、门面图像）的操纵图像组成，包含复制粘贴、拼接、粘贴等攻击操纵。该数据集共有 4000 个带有标记数据的处理图像。与以前专注于自然内容图像的图像处理数据集不同，该数据集使用了大量伪造的文档类型图像。如表一所示，我们将标记数据集随机分为 2000 个训练集和 2000 个测试集，用于评估算法的性能。
The comparative experimental results of the JPEG compression attack are shown in Figure 8(b). We find that when the quality factor drops from 100 to 50, the F1-Score of other deep learning detection methods drops sharply, while the performance of ASGC-Net remains stable. It can be concluded that ASGC-Net shows better robustness against JPEG compression attacks. The positioning results under the resize attack are shown in Figure 8(c). It can be found that the accuracy of other algorithms except ASGC-Net decreases greatly with the increase of the compression ratio, which is mainly due to the full-scale cross-layer of ASGC-Net. The connection structure can well combine feature map information of various scales, so it shows better robustness under resizing attacks.

JPEG压缩攻击的对比实验结果如图8(b)所示。我们发现当质量因子从 100 下降到 50 时，其他深度学习检测方法的 F1-Score 急剧下降，而 ASGC-Net 的性能保持稳定。可以得出结论，ASGC-Net 对 JPEG 压缩攻击表现出更好的鲁棒性。调整大小攻击下的定位结果如图8（c）所示。可以发现，除了ASGC-Net之外，其他算法的准确率随着压缩比的增加而大幅下降，这主要是由于ASGC-Net的全面跨层。连接结构可以很好地结合各种尺度的特征图信息，因此在调整大小攻击下表现出更好的鲁棒性。
We propose an attention-based image tampering detection algorithm for certificate documents. To better extract local details of images, this paper proposes a full-scale cross-layer connection structure, which captures both fine-grained and coarse-grained tampering cues at full scale. And adding a spatial attention structure to each upsampling layer allows the model to focus on learning features related to certificate document tampering detection during training, and at the same time suppresses the model to learn parts unrelated to image tampering detection. The network can effectively suppress the image content and adaptively learn to operate detection features by capturing the difference features between the convolutional spatial neighborhood and the center through a novel spatial gradient convolution. The results of the experiments show that the algorithm proposed in this paper is effective in locating the manipulated area of the certificate document image.

我们提出了一种基于注意力的证书文档图像篡改检测算法。为了更好地提取图像的局部细节，本文提出了一种全尺度跨层连接结构，它可以全尺度捕获细粒度和粗粒度的篡改线索。并且在每个上采样层加入空间注意力结构，可以让模型在训练时专注于学习与证件篡改检测相关的特征，同时抑制模型去学习与图像篡改检测无关的部分。该网络可以通过新颖的空间梯度卷积捕获卷积空间邻域与中心之间的差异特征，从而有效地抑制图像内容并自适应地学习操作检测特征。实验结果表明，本文提出的算法在定位证件图像被篡改区域方面是有效的。
With this in mind, we produce a new MSM30K dataset, and we expect it to pose new challenges and promote more in-depth and comprehensive research in the field of image forensics.

考虑到这一点，我们制作了一个新的 MSM30K 数据集，我们预计它将带来新的挑战，并促进在图像取证领域进行更深入和全面的研究。
Through the above two collection methods, we have finally produced a dataset of 30,000 tampered images of real-life scenes. Each tampering type includes 10,000 images. It contains a rich set of real-life scenes, including portraits, landscape photography, human documentary, photojournalism, commercial photography, ecological photography, special photography and other common 7 super categories and 32 sub-categories.

通过上述两种收集方法，我们最终产生了一个30,000篡改图像的现实生活场景的数据集。每种篡改类型包括10,000张图像。它包含了丰富的现实生活场景集，包括人像摄影、风景摄影、人类纪录片、新闻摄影、商业摄影、生态摄影、特殊摄影等常见的7个超级类别和32个子类别。
NIST dataset and CASIA dataset are recognized as large-scale tampering datasets and test datasets at present, so we analyze the complexity of these three datasets in detail in this article. We plot the normalized object size in Figure 5(a). By comparing it with NIST and CASIA, we can find that our proposed dataset has a broader size range, including detection objects with abundant small sizes. Generally, the detection of small target images is the emphasis and difficulty of target detection. We plot the object center to image center in Figure 5(b), compared with NIST and CASIA, the dataset proposed in this article has more complex margin objects, which increases the difficulty of detection. Also, MSM30K dataset has more complex detection targets, which is in line with the complex detection scenarios in real life.

NIST 数据集和 CASIA 数据集目前被认为是大规模篡改数据集和测试数据集，因此本文对这三个数据集的复杂性进行了详细的分析。我们在图5(a)中绘制了规范化的对象大小。通过与美国国家标准与技术研究院(nIST)和 CASIA 进行比较，我们可以发现，我们提出的数据集具有更广泛的大小范围，包括大量小尺寸的探测对象。一般来说，小目标图像的检测是目标检测的重点和难点。我们在图5(b)中将目标中心绘制到图像中心，与 NIST 和 CASIA 相比，本文提出的数据集具有更复杂的边缘目标，增加了检测的难度。同时，MSM30K 数据集具有更复杂的检测目标，符合现实生活中复杂的检测场景。
Different types of tampering have different characteristics, so the detection of multiple types of tampering is more complicated than the detection of a single tampering. In the field of image forensics, we need a general detection framework, which has a good effect on all three tampering methods in daily life.

不同类型的篡改有不同的特点，因此多种类型的篡改的检测要比单一的篡改检测复杂得多。在图像取证领域，我们需要一个通用的检测框架，它对日常生活中的三种篡改方法都有很好的效果。
In deep networks, low-level features are generally deemed lacking in semantic information but rich in keeping geometric details, which is the opposite for high-level features. Therefore, feature fusion plays a crucial role in combining both semantic and geometric information.

在深层网络中，低层特征通常被认为缺乏语义信息，但却富于保持几何细节，而高层特征则恰恰相反。因此，特征融合在语义信息和几何信息的融合中起着至关重要的作用。
Existing literature [1, 2] has shown that attention mechanisms can effectively eliminate interference from irrelevant features. Similarly, the attention mechanism of RFBA module can automatically adjust the weights of five Bconvd modules {Bconvd i, i = 1, 3, 5, 7, 9} to learn robust cues.

现有的文献[1,2]已经表明，注意机制可以有效地消除来自不相关特征的干扰。同样，RFBA 模块的注意机制可以自动调整五个 Bconvd 模块的权重{ Bconvd i，i = 1,3,5,7,9}来学习鲁棒线索。
The loss function of this article is mainly composed of two parts: the weighted intersection-overunion (IoU) loss (LW IoU ) and binary cross entropy (BCE) loss for the global restriction and local (pixel-level) restriction (LW BCE ). The IoU loss used in this article increases the weight of hard pixels to highlight their importance. Similarly, BCE losses pay more attention to hard pixels. So the total loss function is:

本文的损失函数主要由两部分组成: 全局约束和局部(像素级)约束的加权交叉-过并(IoU)损失(LW IoU)和二进制交叉熵(BCE)损失(LW BCE)。本文中使用的 IU 丢失增加了硬像素的权重，以突出它们的重要性。同样，BCE 损失更加注重硬像素。总损失函数是:
Image manipulation detection is different from semantic segmentation, because it pays more attention to tampering artifacts than to image content, which suggests that richer features need to be learned, therefore using semantic segmentation networks (e.g., Mask-Rcnn, RGB-N, MSRCNN) to detect tampered regions would have a high false positive. The emergence of image manipulation software has made it possible to perfectly blend the tampered area with the image background. A major challenge is that the intrinsic similarities between such foreground objects and background surroundings make the features extracted by deep model indistinguishable.

图像处理检测不同于语义分割，因为它更关注篡改伪影，而不是图像内容，这表明需要学习更丰富的特征，因此使用语义分割网络(如 Mask-Rcnn，RGB-N，MSRCNN)来检测篡改区域会产生很高的假阳性。图像处理软件的出现使得将篡改区域与图像背景完美地混合成为可能。一个主要的挑战是，这些前景物体和背景环境之间的内在相似性使得深度模型提取的特征难以区分。
it can automatically adjust the size of the perceptual field according to the shape of the tampered target, which also has a great resistance to the external compression, noise, and scaling interference.

它可以根据被篡改目标的形状自动调整感知场的大小，对外部压缩、噪声和尺度干扰有很强的抵抗能力
These networks are very effective in resisting noise and JPEG compression interference. As the image is scaled down, the main features of the image are reduced and it becomes challenging to extract the features of the tampered target.

这些网络在抵抗噪声和 JPEG 压缩干扰方面非常有效。随着图像的缩放，图像的主要特征被减少，提取篡改目标的特征变得非常困难。
Nowadays, the development of intelligent retouching software is more and more rapid, especially with the addition of deep learning technology, the operation of retouching software is more and more convenient. If criminals apply intelligent technology to malicious tampering of images and spread the tampered images on the Internet, then it will impose a serious impact on our daily life. In this article, first, we put forward a new dataset through careful manual collection and name it MSM30K, in response to the lack of datasets for the intelligent software operation scene. Moreover, the unified network model (ESRNet) is proposed for the three tampering methods in daily life. It mainly includes four main modules: Efficient feature pyramid network (EFPN), Residual receptive field block with attention (RFBA), Hierarchical decoding identification (HDI), Cascaded group-reversal attention (GRA) blocks. By comparing with 13 current state-of-the-art methods, ESRNet has a good performance in terms of objective evaluation metrics and subjective evaluation. It can quickly and accurately locate image tampering regions. It plays a significant role in the field of information security, and it has strong robustness compared to other state-of-the-art methods.

目前，智能修饰软件的发展越来越快，特别是随着深度学习技术的加入，修饰软件的操作越来越方便。如果罪犯利用智能技术恶意篡改图像，并将篡改后的图像传播到互联网上，将对我们的日常生活造成严重影响。本文首先针对智能软件操作场景中数据集不足的问题，提出了一种新的数据集，并将其命名为 MSM30K。并针对日常生活中的三种篡改方法，提出了统一的网络模型 ESRNet。它主要包括四个主要模块: 高效特征金字塔网络(EFPN)、有注意的剩余接受域模块(RFBA)、分层译码识别(HDI)、级联群反转注意模块(GRA)。通过与现有的13种最新方法的比较，ESRNet 在客观评价指标和主观评价方面具有良好的性能。它可以快速、准确地定位图像篡改区域。它在信息安全领域中发挥着重要作用，与其他最新的信息安全方法相比，具有很强的鲁棒性。【结论，仿写】
Image splicing, also known as image composition, is the most common form of image forgery. The general operation method is to insert fragments of one image into another image for splicing, which is usually aimed at deceiving the viewer. Splicing is a much more difficult task than copy-move forgery detection. Manipulating an image by splicing is as tricky as detecting splice forgery. The basic concept of the various splicing detection techniques is to find regions that are inconsistent with the image features. In splicing, regions are often re-sampled, double compressed, and blurred to create a faked image. This has inspired researchers to develop different techniques to detect image splicing. In general, the invisible subtle alterations induced by splicing operations can be traced back through physics-based and statistics-based approaches. Several approaches have been proposed to detect image splicing based on the abnormal transients at splicing boundaries. A method for image splicing detection based on a natural image model was introduced in Reference [1]. This model uses statistical features extracted from the image and 2D arrays generated by applying a multi-size block discrete cosine transform (MBDCT) to the image. The statistical features include moments of characteristic functions of wavelet sub-bands and Markov transition probability matrices. The method in Reference [1] was improved by capturing intrablock and inter-block correlations using DCT coefficients. The original Markov model obtained using a discrete wavelet transform (DWT) was used to extract additional features. These traditional methods that provide localization capabilities often rely on heavy, time-consuming preand/or post-processing. Then, the cross-domain features were used to train a support vector machine (SVM) classifier. Except for machine learning techniques, there are numerous state-of-theart methods for splicing detection using deep learning approach. Ying Zhang et al. [44] performed image region forgery detection using deep learning approach in which they used a stacked auto encoder model for feature extraction in the first stage and integrated the contextual information from each patch for detection accurately.

图像拼接，也称为图像合成，是最常见的图像伪造形式。一般的操作方法是将一幅图像的片段插入到另一幅图像中进行拼接，这通常是为了欺骗观看者。拼接是一个比复制移动伪造检测更困难的任务。通过拼接来操作图像就像检测拼接伪造一样棘手。各种拼接检测技术的基本概念是寻找与图像特征不一致的区域。在拼接中，区域经常被重新采样、双重压缩和模糊以创建伪造图像。这激发了研究人员开发不同的技术来检测图像拼接。一般来说，由拼接操作引起的不可见的细微变化可以通过基于物理学和基于统计学的方法来追溯。目前已经提出了几种基于拼接边界处的异常瞬变来检测图像拼接的方法。在参考文献[1]中介绍了一种基于自然图像模型的图像拼接检测方法。该模型使用从图像和应用生成的二维阵列中提取的统计特征

Object removal from the image is one of the most used manipulation operations. Image manipulation has no longer been rocket science for non-professionals, and image tampering is not limited to the operating applications of smartphones and computers. To make the scenario even worse, they can be done online without downloading and signing in, so fake images can be spread for illegal purposes anytime and anywhere. Users can easily remove uninterested areas and repair damaged images, and with the development of automatic coloring technology [33], the output photos can easily deceive viewers and make them unable to distinguish the real image from the tampered one. Especially, the recent application of deep learning in image restoration [2, 40, 42] can restore removed areas quickly and with high quality, which increases the difficulty of detecting image removal and tampering. Up to now, few scholars have studied the operation mode of image removing.

从图像中删除对象是最常用的操作操作之一。图像操作对于非专业人士来说已经不再是火箭科学，图像篡改也不局限于智能手机和计算机的操作应用。更糟糕的是，它们可以在网上完成，而不需要下载和登录，所以假图像可以在任何时间用于非法目的传播。用户可以很容易地去除不感兴趣的区域，修复损坏的图像，随着自动着色技术[33]的发展，输出的照片可以很容易地欺骗观看者，使他们无法区分真实的图像和被篡改的图像。特别是最近深度学习在图像恢复[2,40,42]中的应用，可以快速、高质量地恢复被删除的区域，增加了检测图像去除和篡改的难度。到目前为止，学者对图像去除操作模式的研究很少。

Digital document images are commonly employed in many e-government and e-commerce systems to prove user identity and qualification. Due to the popularity of image editing tools, the trustworthiness of important document 5 images is sometime in question. Some key information of a document image can be edited by diﬀerent attacks [1], e.g., text insertion via imitation, splicing, and copy-move. Moreover, the recent development of deep learning-based techniques has also posed real threats to the authenticity of digital images [2] and also document images.

许多电子政务和电子商务系统普遍采用数字文档图像来证明用户身份和资格。由于图像编辑工具的流行，重要文档图像的可信度有时会受到质疑。文档图像的一些关键信息可以通过不同的攻击[1]进行编辑，例如，通过模仿、拼接和复制移动进行文本插入。此外，最近基于深度学习的技术的发展也对数字图像 [2] 和文档图像的真实性构成了真正的威胁。
Creating a high-quality forgery text image dataset is a critical first step in developing effective text tampering detection methods. By carefully selecting and preparing the images in our dataset, we can ensure that our approach is both accurate and robust, even when faced with complex and sophisticated tampering techniques.

创建高质量的伪造文本图像数据集是开发有效的文本篡改检测方法的关键的第一步。通过仔细选择和准备我们的数据集中的图像，我们可以确保我们的方法是准确和健壮的，即使面临复杂和复杂的篡改技术。
By including text swap, splicing, and removal in the dataset, the authors cover a wide range of possible text forgeries, increasing the generalization of the proposed approach.

通过在数据集中包含文本交换、拼接和删除，作者覆盖了广泛的可能的文本伪造，增加了所提方法的普遍性。
The difficulty of the TSTD task lies in how the network can better distinguish the tampered class from the real class, which puts higher requirements on the data generation process to ensure the texture consistency and background integrity in the tampered region.

TSTD 任务的难点在于网络如何更好地区分篡改类和真实类，这对数据生成过程提出了更高的要求，以保证篡改区域的纹理一致性和背景完整性。
This work proposes a new encoder-decoder network for improving the robustness of document image forgery localization against image blending.The network uses a pretrained EfficientNet-B3 to extract powerful features and a novel feature fusion module that pays more attention to the position of the useful pixels when fusing low-level and high-level features.An attention module is also added to combine global contextual information with local feature maps to reduce noise response and focus on forgery features.Additionally, the ASPP module is used to avoid the loss of feature information. The authors constructed a forgery document dataset preprocessed by image blending techniques to enhance the robustness of the proposed approach.The experiments show that the proposed network outperforms many other image forgery localization algorithms.The authors suggest that this approach can be extended to a wide range of application scenarios. The authors intend to further improve the generalization of their detection and localization models by applying more data augmentation techniques and using more realistic document images.They also plan to improve the accuracy of detecting small forgery regions by introducing more loss terms. Overall, this work presents a novel approach to improving the robustness of document image forgery localization against image blending.The proposed network and dataset can serve as a valuable resource for researchers working in the field of docu ment image forgery detection and localization.

这项工作提出了一种新的编码器-解码器网络，用于提高文档图像伪造定位对图像混合的鲁棒性。该网络使用预训练的 EfficientNet-B3 来提取强大的特征和新颖的特征融合模块，该模块在融合低级和高级特征时更加关注有用像素的位置。还添加了一个注意力模块，将全局上下文信息与局部特征图相结合，以减少噪声响应并专注于伪造特征。此外，ASPP 模块用于避免特征信息丢失。作者构建了一个通过图像混合技术预处理的伪造文档数据集，以增强所提出方法的鲁棒性。实验表明，所提出的网络优于许多其他图像伪造定位算法。作者建议这种方法可以扩展到更广泛的应用场景。作者打算通过应用更多数据增强技术和使用更逼真的文档图像来进一步改进其检测和定位模型的泛化。他们还计划通过引入更多损失项来提高检测小伪造区域的准确性。总的来说，这项工作提出了一种新方法来提高文档图像伪造定位对图像混合的鲁棒性。所提出的网络和数据集可以作为在文档图像伪造检测和定位领域工作的研究人员的宝贵资源。
As a result, we propose a SAN that assures that the added locations and objects are image‐specific, enabling models trained on the SAN‐generated data set to achieve increased generalization without fine‐tuning.

因此，我们提出了一个SAN，以确保添加的位置和对象是特定于图像的，使在SAN生成的数据集上训练的模型能够在没有微调的情况下实现增加的泛化。

It should be emphasized that since the training set contains only tampered images, not all of the images synthesized by SAN, just like real images, contain the visual plausibility. In the first and second rows of Figure 4, we show the visually reasonable as well as unreasonable data generated by the SAN, respectively. When SAN has difficulty finding a hidden location within the image center, it will choose a location near the periphery of the image to insert objects for hiding.

需要强调的是，由于训练集只包含被篡改的图像，所以并不是所有由SAN合成的图像，就像真实的图像一样，都包含视觉上的合理性。在图4的第一行和第二行中，我们分别展示了由SAN生成的视觉上合理和不合理的数据。当SAN难以在图像中心内找到一个隐藏的位置时，它会选择一个靠近图像外围的位置来插入要隐藏的对象。

Our analysis concludes that although some detailed information is lost by reducing the image size, this also reduces the impact of attacks, such as image compression and noise introduction. Training images with smaller sizes leads to a model that focuses more on the global information of the image. Therefore, HDU‐Net 2 trained by image size 128 × 192 is more robust than HDU‐Net 2 trained by image size 256 × 384.

我们的分析得出结论，虽然减少图像大小会导致一些详细的信息丢失，但这也减少了攻击的影响，如图像压缩和噪声引入。训练较小尺寸的图像可以得到一个更关注图像的全局信息的模型。因此，图像大小为128×192训练的HDU-Net 2比图像大小为256×384训练的HDU-Net 2更具鲁棒性。

We combine U‐Net47 and the dense block of DenseNet48 to compose DU‐Net. Then, by converting RGB space to other color spaces, more attribute differences from tampered and nontampered areas are extracted to locate the tampered area in the image better. We choose four available color spaces RGB, steganalysis rich model (SRM), V, and A, and finally combined with DU‐Net to compose HDU‐Net. 【HDU‐Net】

我们结合U-Net47和DenseNet48的密集块来组成DU-Net。然后，通过将RGB空间转换为其他颜色空间，提取篡改区域的属性差异，更好地定位图像中的篡改区域。我们选择了四个可用的颜色空间RGB、步进分析富模型（SRM）、V和A，最后与DU-Net组合来组成HDU-Net。
The comparative experiment results under different attacks are shown in Figure 7. Ordinates in Figure 7 represent the F1 score. From all subfigures in Figure 7, we can clearly see that the proposed method has the best localization performance under different attacks. Figure 7a is the result under JPEG compression. It can be observed that the slopes of all lines are very small, which indicates that these approaches are robust against JPEG compression with quality factors varying from 60 to 100. Figure 7b exhibits the performance under Gaussian noise. With the increase of standard deviation, the F1 scores of the proposed MFAN and EncNet [46] drop down gradually, and EncNet [46] degrades more rapidly than the proposed method.

不同攻击下的比较实验结果如图7所示。图7中的纵坐标表示F1的分数。从图7中的所有子图中，我们可以清楚地看出，所提出的方法在不同的攻击下具有最好的定位性能。图7a是在JPEG压缩条件下的结果。可以观察到，所有线的斜率都很小，这表明这些方法对JPEG压缩是稳健的，质量因子从60到100不等。图7b展示了在高斯噪声条件下的性能。随着标准差的增加，所提出的MFAN和EncNet [46]的F1分数逐渐下降，并且EncNet [46]的降解速度比所提出的方法更快。

Figure 7c is the experiment result under resize. All approaches except for NOI [43] show small slopes. The localization result under median blur is demonstrated in Figure 7d. The performances of RRU-Net [36], EncNet [46] and the proposed method get worse when the kernel size becomes larger, among which RRU-Net [36] is less sensitive.

图7c为调整大小下的实验结果。除NOI [43]外，所有的方法都显示出较小的坡度。中值模糊下的定位结果如图7d所示。当核大小增大时，RRU-Net [36]、EncNet [46]和所提方法的性能会下降，其中RRU-Net [36]的敏感性较低。

As shown in Figure 8, the first and the second rows refer to the defense capacities of various methods under JPEG compression and noise addition attacks, respectively. The four columns from left to right indicate the attack experiments executed on each of the four data sets. For JPEG compression attack, when the quality‐factor of JPEG compression decreases from 90% to 50%, the performance of most of the comparison detection methods descend dramatically. In contrast, HDU‐Net remains better stable, and its performance is superior to the others. For noise addition attack, on the three data sets (CASIA v1.0, Realistic, and In‐The‐Wild), all methods except HDU‐Net tend to fail detection state. Overall, the preferable robustness and effectiveness of HDU‐Net are demonstrated by performing two attack experiments on the four datasets.

如图8所示，第一行和第二行分别表示各种方法在JPEG压缩攻击和噪声附加攻击下的防御能力。从左到右的四列表示对四个数据集上执行的攻击实验。对于JPEG压缩攻击，当JPEG压缩的质量因子从90%下降到50%时，大多数比较检测方法的性能都会显著下降。相比之下，HDU-Net保持更好的稳定，其性能优于其他公司。对于噪声添加攻击，在三个数据集（CASIA v1.0、现实和狂野）上，除HDU-Net外，所有方法都倾向于检测状态失败。总体而言，通过对四个数据集进行两次攻击实验，证明了HDU-Net的鲁棒性和有效性。

Even though Dice loss improves false positive and region overlaps, it falls short for small forged regions. This is because as the regions become very small, |G| becomes small. So if the model does not predict anything at all, i.e., if |P| → 0, then |P ∩ G| → 0, and the total loss decreases. Thus, dice loss alone is ineffective in these instances.

即使骰子丢失改善了假阳性和区域重叠，但它缺乏小的锻造区域。这是因为当这些区域变得非常小时，|-G-|就会变得很小。所以，如果模型根本没有预测任何东西，即，如果|P|→0，那么|P∩G|→0，总损失减少。因此，在这些情况下，掷骰子的损失本身是无效的。

Firstly, we detailedly discussed the effect of different loss terms in Equation (23). Five parameter combinations, (λ1=1, λ2=0), (λ1=0.9, λ2=0.1), (λ1=0.7, λ2=0.3), (λ1=0.5, λ2=0.5) and (λ1=0.3, λ2=0.7), were tested to observe experimental results, which are shown in Table IV. An interesting phenomenon can be found from these results, when λ1 > λ2 (λ2 > 0), the performance of the network is improved continuously and can attain the best performance when ( λ1=0.7, λ2=0.3). However, when λ1 < λ2, that is, the weight of Ledge is more than that of Lsem, the performance of the network has a slight reduction.

首先，我们在方程（23）中详细讨论了不同损失项的影响。对五种参数组合（λ1=1，λ2=0）、（λ1=0.9，λ2=0.1）、（λ1=0.7，λ2=0.3）、（λ1=0.5，λ2=0.5）和（λ1=0.3，λ2=0.7）进行了测试，观察实验结果，如表IV所示。从这些结果中可以发现一个有趣的现象，当λ1 > λ2（λ2 > 0）时，网络的性能不断得到改善，并在（λ1=0.7，λ2=0.3）时获得最佳的性能。但是，当λ1 < λ2，即Ledge的权重大于Lsem时，网络的性能略有下降。

Therefore, our network ment is added in the BFI module, the false positives caused by non-manipulated regions can be minimized significantly. In addition, we easily observed that compared to spatial-wise re- finement, channel-wise refinement can achieve a slightly better performance, because the features for channel-wise refinement are just adjusted channel-by-channel. Also, our SRBFI outperformed CBAM [45] over all three standard datasets. Due to a serial connection of the attention module, spatial attention may be not focused on the tampered regions once the channel attention drifts away. Correspondingly, our network can utilize global information to progressively learn and selectively strengthen useful features while suppressing useless features. Subsequently, we tested the ablation influence of the MAE module and showed the results in Table VII.

因此，我们的网络被添加到BFI模块中，由非操纵区域引起的误报可以显著地最小化。此外，我们很容易地观察到，与空间上的重新细化相比，信道细化可以获得稍微更好的性能，因为信道细化的特性只是逐信道进行调整。此外，我们的SRBFI在所有三个标准数据集上都优于CBAM [45]。由于注意模块的串行连接，一旦通道注意漂移，空间注意可能不会集中在被篡改的区域上。相应地，我们的网络可以利用全局信息来逐步学习和选择性地加强有用的特征，同时抑制无用的特征。随后，我们测试了MAE模块的烧蚀影响，结果如表七所示。
After adding the MAE module, the overall performance of the network can be significantly improved and showed an superior performance to the Sobel operator. In fact, the MAE module can adaptively extract edge features from image manipulated regions with different scales, in which the benefits are to enhance the robustness of the network and avoid false positives caused by edge feature redundancy. In addition, Table VI demonstrates the effectiveness of the DSRD module. We compared different upsampling methods, such as nearest neighbor interpolation upsampling [59] and deconvolution upsampling [60]. The deconvolution upsampling performs the sampling operation through a parameter-learnable way and accordingly produces more accurate prediction results, especially when it is applied to the decoder. In general, the overall performance can benefit from the reconstructed DSRD module.

在添加了MAE模块后，网络的整体性能可以显著提高，并显示出优于Sobel运营商的性能。实际上，MAE模块可以自适应地从不同尺度的图像操纵区域中提取边缘特征，其优点是增强了网络的鲁棒性，避免了边缘特征冗余造成的误报。此外，表六还展示了DSRD模块的有效性。我们比较了不同的上采样方法，如最近邻插值上采样[59]和反褶积上采样[60]。反褶积上采样通过参数可学习的方式进行采样操作，从而产生更准确的预测结果，特别是当它应用于解码器时。一般来说，重构的DSRD模块

To gain more insight, we evaluated the visualization localization results during different stages by using the aforementioned experimental setup. The corresponding results are shown in Figure 12. As can be seen that during our network training, the SRBFI module significantly reduced false positives from non-manipulated regions, and the MAE module also strengthened the localization capability of edge pixels in manipulated regions. With the enhancement of localization capability of pixels, our method presents a significant superior performance in terms of image manipulation localization.

为了更深入地了解，我们使用上述实验设置，评估了不同阶段的可视化定位结果。相应的结果如图12所示。可以看出，在我们的网络训练过程中，SRBFI模块显著减少了非操纵区域的误报，MAE模块也增强了操纵区域边缘像素的定位能力。随着像素定位能力的增强，我们的方法在图像处理定位方面具有显著的优越性能。

Due to a serial connection of the attention module, the spatial attention may not remain focused on the tampered regions once the channel attention drifts away. As a solution, our network is designed to utilize global information for progressively learning and selectively strengthening useful features, while suppressing irrelevant ones

由于注意模块的串行连接，一旦通道注意力离开，空间注意力可能不会继续集中在被干扰的区域。作为一种解决方案，我们的网络旨在利用全球信息逐步学习和有选择地加强有用的功能，同时抑制不相关的功能
Image blending is an essential step in realistic quality image tampering. Due to differences in camera response and illumination, copies and pastes of overlapping areas may result in visible seams between the images. we use image blending to ensure the transformed pixels conform to the target domain to ensure consistency. It is possible to hide boundaries and reduce color differences between images using image blending. Thus, due to the blending, it has become more challenging to distinguish between authentic and forgery document images.
The tampered and real-world texts share a semantic space (text position and geometric structure) but have local texture differences. As shown in Fig. 1, both tampered and real-world texts exist in the same position and have the identical geometric structure , while tampered texts contain different local textures (e.g. smoothness) than real-world ones. Thus, text image manipulation detection methods need to maximize the discrimination of class-specific texture features while maintaining the semantic invariance.
CBAM inferred attention weights sequentially along two dimensions, spatial and channel. By multiplying the attention with the original feature map, weights are assigned to the feature map so that the features complete adaptive adjustment.
While all of these methods have been very promising, CNN in their current form tend to learn only features related to image content. However, most images can experience unpredictable changes caused by content manipulations or geometric distortions such as lossy compression, noising, resizing and / or filtering, both before and after the possible alteration.

虽然所有这些方法都很有前途，但CNN目前的形式往往只学习与图像内容相关的特征。然而，在可能的更改之前和之后，大多数图像可能会经历由内容操作或几何扭曲引起的不可预测的变化，如有损压缩、噪声、调整大小和/或过滤。

It is therefore essential that tampered image detection algorithms take into account the robustness faced with these manipulations. In this paper, motivated by the fact that lossy compression is the most relevant type of image post-processing, we propose a robust framework which contributes in improving camera model identification and image tampering detection. Our experiments will first demonstrate the importance of taking lossy compression into account and then highlight the performance of our proposal.

因此，被篡改的图像检测算法必须考虑到面对这些操作的鲁棒性。基于有损压缩是最相关的图像后处理类型，本文提出了一个鲁棒的框架，有助于改进相机模型识别和图像篡改检测。我们的实验将首先证明考虑有损压缩的重要性，然后强调我们的建议的性能。

The final output M is a binary mask, where black parts indicate patches belonging to the pristine region and white ones indicate forged patches. If no (or just a few) forged pixels are detected, the image is considered as pristine.

最终的输出M是一个二进制掩码，其中黑色部分表示属于原始区域的补丁，白色部分表示伪造的补丁。如果没有检测到（或只是几个）伪造的像素，则认为图像是原始的。

Finally, to evaluate the influence of compression, all images from the chosen datasets are compressed with different factor qualities (FQ): 90%, 80% and 70%. The trained CNN with those FQ are named CNN90, CNN80, CNN70 and CNNm respectively for 90%, 80%, 70% and mixed compressed data.