
Explore how modern image inpainting algorithms are evolving in 2025–2026. From classical patch-based methods and LaMa to MAT, Stable Diffusion, and the latest research papers — discover what’s actually working across photo editing, cultural heritage, medical imaging, and generative AI.
- What Is Image Inpainting?
- Why Image Inpainting Matters More Than Ever
- Types of Image Inpainting Tasks
- Classical Inpainting Methods
- LaMa — Large Mask Inpainting
- DeepFill v1 and v2
- MAT — Mask-Aware Transformer
- Stable Diffusion Inpainting
- ControlNet for Inpainting
- Algorithm Comparison: 2025–2026
- Top 5 SOTA Inpainting Datasets
- Recent Research in Image Inpainting — What the Papers Are Saying in 2025–2026
- MEGA-NET: Attention Fusion Meets Multi-Scale Guidance
- CSMC: A Mathematical Approach to Inpainting via Matrix Completion
- Spa-IHANet: Sparse Transformer Meets Hybrid Attention in a Two-Stage GAN
- InDiTE-Diff: When Your Reference Image Is Also Broken
- What These Papers Tell Us About Where Inpainting Is Heading
- Real-World Applications
- Challenges in 2025–2026
- Final Thoughts
- FAQ
Image inpainting has been around for centuries.
Before computers existed, museum conservators painstakingly filled in damaged sections of paintings by hand — matching colors, brushstrokes, and textures so carefully that the restoration was invisible to the naked eye.
The goal was always the same:
Make the image whole again without anyone knowing it was ever broken.
Today, AI can do a version of that in seconds.
Modern image inpainting algorithms can fill in missing regions of a photograph, remove unwanted objects, restore damaged historical images, and generate visually plausible content that blends seamlessly with the surrounding pixels.
What used to take skilled human artists hours or days now happens in a single forward pass through a neural network.
But the problem is harder than it looks from the outside.
Filling in missing pixels isn’t just about picking the right color. It requires understanding the structure of the scene, the texture of the surfaces, the semantic content of what should be there, and the visual coherence of the whole image.
Getting all of that right simultaneously — especially for large missing regions — is one of the genuinely hard problems in computer vision.
In 2025–2026, that problem is being attacked from multiple directions at once.
What Is Image Inpainting?
At its core, image inpainting is the task of reconstructing missing or damaged regions of an image in a way that looks natural and believable.
The input is typically:
- a damaged or incomplete image
- a mask that indicates which regions need to be filled
The output is a completed image where the masked regions have been filled with visually plausible content.
What makes this genuinely hard is that the problem is ill-posed — there isn’t a single correct answer. For any given masked region, there are multiple valid completions that would look natural.
Input: Damaged Image + Mask → Inpainting Model → Completed Image
A model has to learn not just what’s likely in general, but what’s likely given the specific surrounding context of that particular image.
The challenge is making that completed image look like it was always there.
Why Image Inpainting Matters More Than Ever
Image inpainting has quietly become one of the most practically important tasks in computer vision.
A few years ago, it was mostly used for photo touch-ups and academic benchmarks. Today the applications have expanded dramatically as the underlying technology improved.
In 2025–2026, image inpainting is actively being used in:
- photo editing and object removal
- cultural heritage and document restoration
- medical imaging and scan reconstruction
- film and video post-production
- autonomous systems and robotics
- generative AI and content creation
- satellite and aerial imagery reconstruction
The demand for faster, higher-quality, and more controllable inpainting is driving serious research investment — and 2025–2026 is producing some genuinely interesting results.
Types of Image Inpainting Tasks
Before getting into algorithms, it helps to understand what you’re actually trying to solve.

Object Removal Inpainting The most common everyday use case. Remove a person, object, or unwanted element from a photo and fill in the background behind it. The model needs to understand what the background would look like if the object had never been there.
Structural Inpainting Restoring physically damaged images — torn photographs, water-stained documents, scratched film frames. The missing regions are determined by actual physical damage, not intentional cropping.
Face Completion Reconstructing missing facial features — a partially occluded eye, a masked mouth, a corrupted section of a portrait. Faces are the hardest test of inpainting quality because human visual perception is extremely sensitive to facial structure.
Scene Completion Filling in large missing regions of complex scenes — a missing building facade, an occluded street, a corrupted landscape photograph. Requires strong semantic understanding of what should be there.
Reference-Guided Inpainting Using a second image of the same scene as a reference to guide reconstruction — particularly useful when restoring historical images that have corresponding reference photographs, even if those references are from a different time period.
Most real-world systems have to handle multiple task types simultaneously, which is part of what makes image inpainting genuinely hard to get right across the board.
Classical Inpainting Methods
To understand where the field is in 2025–2026, it helps to understand where it started.
Before deep learning, inpainting was done through two main approaches.
Diffusion-Based Methods
Diffusion-based methods propagated information from the known pixels into the unknown region by solving partial differential equations. They worked reasonably well for small gaps and smooth regions — cracks in plaster, small scratches in photographs.
But they produced blurry, structurally incorrect results for larger holes because they had no concept of what the image was actually depicting. They could smooth and blend, but they couldn’t generate.
Exemplar-Based Methods
Exemplar-based methods searched the rest of the image for patches similar to the area around the missing region and copied texture from those patches into the hole.
This worked better for repeating textures — grass, brick walls, fabric — but fell apart completely when the missing region required semantic understanding.
If a face was missing, exemplar methods couldn’t understand that a face should be there. They could only look for visually similar patches — which in a portrait might be a collar, a background, anything that happens to share local statistics with the boundary of the missing region.
Both classical approaches share the same fundamental limitation:
They don’t understand what an image is about.
They operate on pixels and local statistics, not semantics. They can propagate texture, but they can’t invent content.
Deep learning changed that entirely.
LaMa — Large Mask Inpainting
Published 2021 — still widely deployed in 2025–2026
LaMa — Large Mask inpainting — was a genuine breakthrough when it was released, and it remains one of the most practically useful inpainting models available four years later.
That kind of longevity in a fast-moving field means the core ideas were genuinely right.
The Core Idea
The key insight behind LaMa is that global context is critical for large-region inpainting, and standard convolutional networks — even with attention mechanisms — struggle to capture long-range dependencies efficiently.
LaMa’s solution is Fast Fourier Convolutions (FFC).
FFC operates in the frequency domain rather than the spatial domain. Each convolution operation has a receptive field that covers the entire image — not just a local neighborhood. This gives the model awareness of global image structure from the very beginning of processing, not just after many layers of stacked convolutions.
The result is that LaMa handles large irregular masks particularly well — the kind of masks that cover significant portions of an image with complex shapes. While other models produced blurry or structurally incoherent fills in these regions, LaMa generated sharp, globally coherent completions.
Why It’s Still Relevant in 2025–2026
LaMa remains a strong baseline and is widely used in production systems including Adobe Firefly and various mobile photo editing applications. The FFC architecture it introduced has also influenced many subsequent models — including MEGA-NET (discussed in the Research section), which extends FFC with additional attention mechanisms.
Strengths: Large mask handling, global coherence, efficient architecture
Weaknesses: Can struggle with fine texture details, less powerful than
diffusion models for complex semantic generation
Best for: Object removal, background completion, large irregular masks
DeepFill v1 and v2
Yu et al., 2018 and 2019 — foundational architecture still in use
DeepFill introduced two ideas that became foundational in image inpainting and are still used today in various forms.
DeepFill v1 — Coarse-to-Fine and Contextual Attention
DeepFill v1 introduced the coarse-to-fine two-stage architecture for inpainting. The first stage generates a rough initial completion — getting overall structure and semantics roughly right. The second stage refines that completion, improving texture quality and fine details.
This separation of concerns — handle structure first, then texture — became a widely adopted pattern in subsequent architectures.
v1 also introduced contextual attention — a mechanism that allows the model to explicitly borrow texture patches from elsewhere in the image and paste them (with learned transformations) into the missing region. This made texture in filled regions much more coherent with the surrounding image.
DeepFill v2 — Gated Convolutions
DeepFill v2 replaced standard convolutions with gated convolutions — a learnable masking mechanism where each convolutional feature map has a corresponding learned gate that decides how much each spatial location should contribute, based on whether it’s in the valid or masked region.
This was a significant improvement over partial convolutions (which used fixed binary masks) because the gates are learned from data rather than fixed by rule.
Gated convolutions became the standard building block for inpainting networks for several years after v2 was published — and you’ll still find them inside many modern architectures, including Spa-IHANet discussed in the Research section.
Strengths: Gated convolutions, contextual attention, strong texture quality
Weaknesses: Older architecture, outperformed by transformer and diffusion approaches
Best for: General-purpose inpainting, face completion, structured scenes
MAT — Mask-Aware Transformer
Li et al., CVPR 2022 — transformer-native inpainting
MAT brought Transformer architecture to image inpainting in a meaningful way — not just bolting attention onto a CNN backbone, but redesigning the inpainting pipeline around Transformers from the ground up.
The Problem It Solved
Standard Vision Transformers (ViT) treat all image patches equally, regardless of whether they’re valid pixels or masked regions. For inpainting, that’s a serious problem — the model needs to explicitly know which regions are missing and treat them fundamentally differently during feature extraction and attention computation.
MAT introduces a mask-aware token design where masked and unmasked tokens are handled differently throughout the Transformer. The model learns to focus attention on valid regions while generating content for masked regions, rather than blindly attending to everything including corrupted areas.
Results and Relevance
The results — particularly on high-resolution face completion using CelebA-HQ — were notably strong, showing that Transformer-based architectures could match and exceed GAN-based approaches on perceptual quality metrics.
MAT also introduced a style manipulation module that allows explicit control over the style of the generated content, which is useful in creative applications where you want some agency over how the filled region looks.
Strengths: High-quality completion, mask-aware attention, style control
Weaknesses: Computationally expensive, requires significant GPU memory
Best for: High-resolution face completion, creative editing applications
Stable Diffusion Inpainting
2022–present — actively developed through 2025–2026
Stable Diffusion changed what people expected from image inpainting.
Before diffusion models, inpainting was primarily a fill-in-the-gap task. You defined the missing region and the model filled it in. The results were often good but sometimes generic, and you had limited control over what got generated.
Stable Diffusion inpainting added text guidance to the entire process.
You can now describe what you want to appear in the missing region in plain English, and the model generates content that matches both the visual context of the surrounding image and your text description.
Example: "a wooden bench under an oak tree, afternoon sunlight, photorealistic"
How It Works
The model uses a masked diffusion process — it only adds noise to the masked regions during training, so at inference time it only generates new content where the mask indicates, while preserving the original pixels everywhere else.
In 2025–2026, Stable Diffusion inpainting (and its successors in the SDXL and SD 3.x lines) represent the state of the art for creative and semantic inpainting tasks. The combination of visual context understanding and language-guided generation makes it capable of results that earlier architectures simply could not produce.
The Honest Tradeoff
Diffusion models require multiple denoising steps, making them significantly slower than single-pass architectures like LaMa. For real-time or high-throughput applications, inference speed remains a practical limitation.
Strengths: Text-guided generation, exceptional quality, semantic understanding
Weaknesses: Slow inference, high compute requirements, less predictable outputs
Best for: Creative editing, object replacement with semantic control, complex scene generation
ControlNet for Inpainting
Zhang et al., 2023 — structural control for diffusion inpainting
ControlNet didn’t invent inpainting, but it significantly expanded what you could do with diffusion-based inpainting by adding structural control to the generation process.
The Problem It Addresses
When you use Stable Diffusion inpainting, you can describe what you want in text — but you have limited control over the exact structure, pose, edges, or depth of what gets generated. The results are semantically appropriate but structurally unpredictable.
ControlNet adds conditioning signals — edge maps, depth maps, pose skeletons, segmentation masks — that give you explicit control over the structure of the generated content while still leveraging the generative power of the underlying diffusion model.
For inpainting specifically, this means you can:
- specify that a filled region should contain a person in a specific pose
- control the edge structure of generated architecture
- ensure depth consistency between the filled region and the rest of the scene
- use segmentation maps to guide semantic region filling
In practice, ControlNet combined with Stable Diffusion inpainting has become the combination of choice for production visual effects work — where you need both semantic quality and structural precision.
Strengths: Structural control, pose/edge/depth conditioning, production-quality outputs
Weaknesses: Requires conditioning signals, added complexity, slow inference
Best for: VFX production, controlled object replacement, architecture and scene editing
Algorithm Comparison: 2025–2026
| Algorithm | Architecture | Mask Size | Speed | Semantic Control | Best Use Case |
|---|---|---|---|---|---|
| Diffusion-based (Classical) | PDE | Small | Fast | None | Crack and scratch repair |
| Exemplar-based (Classical) | Patch matching | Small-Medium | Fast | None | Repeating texture regions |
| LaMa | FFC + GAN | Large | Fast | None | Object removal, large masks |
| DeepFill v2 | Gated Conv + GAN | Medium | Fast | None | General inpainting |
| MAT | Transformer + GAN | Large | Moderate | Style | High-res face completion |
| Stable Diffusion | Diffusion Model | Any | Slow | Text | Creative editing |
| ControlNet + SD | Diffusion + Control | Any | Slow | Text + Structure | VFX, controlled replacement |
Top 5 SOTA Inpainting Datasets
Before any algorithm gets published, it has to prove itself on data.
And in image inpainting, the choice of dataset matters more than people often realize. Different datasets test completely different things — face reconstruction, scene completion, texture synthesis, urban restoration.
A model that performs brilliantly on faces might fall apart on complex street scenes.
Here’s a look at the benchmark datasets that show up most consistently across image inpainting research in 2025–2026, what each one actually tests, and why researchers keep coming back to them.
1. CelebA-HQ
30,000 high-quality celebrity face images at 1024×1024 resolution
Faces are the gold standard test for image inpainting — and for good reason.
The human visual system is extraordinarily sensitive to facial structure and appearance. A subtle error in eye symmetry, skin texture, or facial geometry that would go unnoticed in a landscape image is immediately obvious when it’s a face.
If your inpainting model can produce convincing face completions, it’s doing something right.
CelebA-HQ allows researchers to test inpainting on both small facial region reconstruction — eyes, mouth, specific features — and large-region completion covering half a face or more. It appears in virtually every major inpainting paper, including MEGA-NET, Spa-IHANet, and MAT.
Standard split: 28,000 training / 2,000 testing
Resolution: 1024×1024
Primary test: Facial structure and texture coherence
2. Places2
10M+ scene images across 365 scene categories
Places2 tests something fundamentally different from CelebA-HQ. Where face datasets test structure and detail precision, Places2 tests semantic diversity.
Your model has to understand what a beach looks like, what a kitchen looks like, what a forest looks like — and generate appropriate content for each of 365 scene categories.
It’s also much more challenging in terms of texture variety. The visual patterns across 365 scene categories cover almost the full range of natural image textures, from highly regular (tile floors, brick walls) to completely irregular (forest undergrowth, ocean waves).
For inpainting research, a strong result on Places2 typically means the model has learned genuinely general representations — not just memorized patterns for specific domains.
Standard split: ~34,500 training / 2,000 testing (subset)
Resolution: Variable, typically 256×256 for training
Primary test: Semantic diversity and scene-level coherence
3. Paris StreetView — ParisStreetView-RandomMasks
22,601 urban street-view images with random irregular masks
Paris StreetView occupies an interesting middle ground between faces and general scenes. It has consistent structural properties — buildings have windows, facades have regular patterns, streets have consistent perspective — but it’s not as tightly constrained as face data.
This makes it a good test of structural coherence in architectural contexts. Can the model understand that a building facade has a repeating window pattern and fill in a missing section appropriately? Can it maintain perspective consistency across a large masked region?
The ParisStreetView-RandomMasks dataset extends the original Paris StreetView with synthetically generated irregular random masks and pre-generated corrupted images — ready for direct use in supervised inpainting training without additional preprocessing.
Dataset at a glance:
- 22,601 urban street-view images
- Random irregular free-form masks (varying thickness and complexity)
- Pre-generated corrupted images ready for training
- Includes
train.txt,val.txt, andannotations.csv - Total size: 12.7 GB
- License: Creative Commons Attribution 4.0 International (CC BY 4.0)
- DOI:
10.5281/zenodo.20233925
Download — same dataset, choose your preferred platform:
| Platform | Link | Best For |
|---|---|---|
| Zenodo | Download from Zenodo → | Academic citation, DOI reference, direct file download |
| Kaggle | Download from Kaggle → | Kaggle notebook integration, API access |
| Hugging Face | Download from Hugging Face → | HF datasets library, streaming support |
Cite as: Kannan, Wisen. (2026). ParisStreetView-RandomMasks: Large-Scale Urban Image Inpainting Dataset with Random Irregular Masks (Version v1.0). Zenodo. https://doi.org/10.5281/zenodo.20233925
Images: 22,601 street-view scenes
Masks: Irregular free-form random strokes
Resolution: Variable
Primary test: Structural coherence in architectural urban scenes
4. FFHQ — Flickr-Faces-HQ
70,000 high-quality human face images at 1024×1024 resolution
FFHQ (Flickr-Faces-HQ) is the larger, more diverse companion to CelebA-HQ in face-based inpainting benchmarks. Where CelebA-HQ focuses on celebrity faces with relatively consistent photography conditions, FFHQ includes a much wider range of ages, ethnicities, lighting conditions, accessories, and image styles — sourced from Flickr rather than controlled photography sessions.
This diversity makes FFHQ a harder and more realistic test of facial inpainting. A model that performs well on FFHQ has demonstrated the ability to generalize across the genuine variety of human faces, not just the specific distribution of professional celebrity photographs.
In 2025–2026, FFHQ is increasingly being used alongside or as an alternative to CelebA-HQ in papers that want to demonstrate generalization across face diversity.
Total images: 70,000
Resolution: 1024×1024
Primary test: Diverse facial structure, varying lighting and demographics
License: Creative Commons BY 2.0 (Flickr images)
5. TAMP-Street Dataset
1,362 time-variant urban street scene pairs with irregular masks
TAMP-Street is the newest and most practically motivated dataset in this list — assembled specifically for the Time-vAriant iMage inPainting (TAMP) task introduced by Xing et al. (IEEE TIP, 2026).
What makes it fundamentally different from every other dataset here is that it provides pairs of images taken at significantly different times of the same urban scene — one year apart in Pittsburgh — capturing genuine temporal change. Same location, different season, different objects, different appearance.
This forces the model to deal with realistic inpainting challenges that conventional datasets completely ignore: reference images that don’t perfectly match the target, content that has changed, and regions where no reference exists at all because both images are damaged in the same area.
Total pairs: 1,362 image pairs (816 training / 256 validation / 290 testing)
Mask ratios: 20%–60% at 10% intervals
Base data: VL-CMU-CD (one year of Pittsburgh urban change documentation)
Primary test: Time-variant reference-guided inpainting
Download: https://drive.google.com/drive/folders/1frK37q4CZ2tDUc5_x65N2wOXEd2oJcL
Common Mask Types Used Across Datasets
The dataset is only half the story. The masks used to simulate damage matter just as much.

Irregular masks — random freeform shapes that simulate real-world damage patterns. Introduced by Liu et al. (partial convolutions paper, 2018) and now standard across most benchmarks. Typically split by mask ratio from 0%–60%.
Large square masks — center or randomly placed rectangular masks that test completion of large structured regions. Often used to stress-test architectural reasoning.
Object segmentation masks — masks defined by object boundaries, simulating object removal. Tests semantic-level completion rather than just texture synthesis.
Most recent papers test across multiple mask types and mask ratios simultaneously — since a model that excels at small irregular masks may fail completely on large structured ones.
Recent Research in Image Inpainting — What the Papers Are Saying in 2025–2026
The academic community hasn’t slowed down.
In 2025–2026, several papers are pushing the field in genuinely interesting directions — from attention-based architectures tackling large missing regions, to mathematical approaches rethinking the problem from first principles, to entirely new problem formulations that nobody had seriously addressed before.
Here’s a look at the notable papers making the rounds right now.
MEGA-NET: Attention Fusion Meets Multi-Scale Guidance
Liu et al., International Journal of Machine Learning and Cybernetics, April 2026 Read Paper →
Large missing regions have always been the hardest problem in image inpainting. Fill too aggressively and you lose local detail. Focus too much on local features and global coherence falls apart.
MEGA-NET addresses this tension directly.
The model is a single-stage architecture that combines three key components working together:
Fast Fourier Convolution (FFC) module — dramatically expands the receptive field and captures long-range dependencies across the image, building on the foundational idea from LaMa but extending it further with additional architectural components.
Global Channel-Spatial Attention (GCSA) — recalibrates features adaptively, making sure local detail doesn’t get lost in the pursuit of global context. The channel attention component decides what to emphasize, while the spatial attention decides where.
Efficient Upsampling Convolution Block (EUCB) — replaces standard convolutional upsampling to preserve structural and textural information during decoding. Standard upsampling operations are a surprisingly common source of information loss in inpainting networks.
The model also uses multi-scale discriminators during adversarial training, which lets it balance local detail and global structure simultaneously at different scales.
Tested on CelebA-HQ and Paris StreetView benchmarks, MEGA-NET outperforms current state-of-the-art methods — particularly in complex scenes with large missing regions where most existing approaches struggle to maintain visual coherence.
What makes this paper interesting beyond the benchmark numbers is the architectural philosophy. Instead of stacking more layers or increasing model size, it focuses on what information is preserved and where — which tends to produce cleaner, more natural results in practice.
Key takeaway: Single-stage architecture combining FFC + attention + efficient
upsampling for large-region inpainting. Strong results on CelebA-HQ and
Paris StreetView.
CSMC: A Mathematical Approach to Inpainting via Matrix Completion
Krajewska & Niewiadomska-Szynkiewicz, Machine Learning (Springer), March 2026 Read Paper → (Open Access)
Most inpainting research in recent years has gone all-in on deep learning. This paper takes a different and genuinely interesting direction — treating image inpainting as a matrix completion problem.
The core idea is that an image can be represented as a matrix, and missing pixels are simply missing entries. If that matrix has a low-rank structure (which natural images generally do), you can recover those missing entries through mathematical optimization rather than neural network training.
The authors introduce Columns Selected Matrix Completion (CSMC) — a two-stage method designed specifically for matrices where one dimension is much larger than the other, a common situation in real image data.
Stage 1: Select a random subset of columns and complete them using nuclear norm minimization — a convex optimization technique with strong theoretical recovery guarantees.
Stage 2: Use the completed columns to reconstruct the full matrix by solving a least squares problem — computationally cheap and easily parallelizable.
The result is a method that achieves reconstruction quality on par with leading convex optimization approaches, at significantly reduced computational cost. The paper also provides formal theoretical guarantees — establishing the probability of exact recovery under standard incoherence conditions.
Why does this matter for inpainting?
Because it represents an alternative path that doesn’t require massive training datasets, doesn’t need GPU clusters, and comes with mathematical guarantees instead of empirical benchmarks. For domains where data is scarce or interpretability matters — medical imaging, scientific data restoration — that’s a meaningful advantage.
Key takeaway: Non-deep-learning approach to inpainting via randomized matrix
completion. Theoretically grounded, computationally efficient, validated on
real-world image datasets. Open Access.
Spa-IHANet: Sparse Transformer Meets Hybrid Attention in a Two-Stage GAN
Published in Computers & Graphics, 2026 View Paper →
⚠️ Note: This paper is not open access. The following summary is based on the abstract and partial information available publicly. Some details may be incomplete or subject to revision upon full access to the paper. The information provided here may be inaccurate — please verify directly with the original source.
Spa-IHANet takes a two-stage GAN approach — a coarse pass followed by fine-detail refinement — which is a common structural choice in inpainting, but the components it uses at each stage are notably different from existing work.
In the coarse stage, the model uses a Sparse Self-attention Transformer (Spa-former) — designed to model global context efficiently without the quadratic complexity that makes standard transformers expensive at high resolutions. This stage produces a preliminary inpainted image that handles overall structure and layout.
In the fine stage, two custom attention modules take over:
- HAEGC (Hybrid Attention Enhanced Gated Convolution) — strengthens local feature representation by combining attention mechanisms with gated convolution, building on the gated convolution concept from DeepFill v2
- MAM (Multi-level Attention Mechanism) — captures critical information across multiple feature levels to improve detail consistency and semantic coherence in the final output
Tested across three standard datasets — CelebA-HQ, Places2, and Paris Street View — Spa-IHANet reportedly outperforms both traditional and recent state-of-the-art inpainting networks on visual quality metrics and computational efficiency.
The key contribution is the combination of sparse attention for global efficiency with hybrid attention for local detail fidelity inside a unified GAN framework — addressing the computational cost that’s often the bottleneck when deploying transformer-based inpainting in practice.
Key takeaway: Two-stage GAN with sparse transformer for global context + hybrid
attention for fine detail. Targets both visual quality and computational efficiency.
⚠️ Based on publicly available abstract information only.
Full paper requires institutional or paid access via ScienceDirect.
InDiTE-Diff: When Your Reference Image Is Also Broken
Xing et al., IEEE Transactions on Image Processing, Vol. 35, April 2026 Read Paper → (Open Access — CC BY-NC-ND 4.0)
Most image inpainting research starts with a comfortable assumption: if you’re using a reference image to guide restoration, that reference image is clean, intact, and taken at roughly the same time as the damaged one.
That assumption breaks down constantly in the real world.
Think about trying to restore a damaged historical photograph by referring to another image of the same scene — captured in a different season, with different objects, different lighting, possibly also damaged. That’s the problem this paper actually tackles head-on.
The authors call it TAMP — Time-vAriant iMage inPainting — and it covers two distinct real-world situations:
- tvRefInpaint — the reference image is intact but was captured at a significantly different time, so objects and appearance may have changed due to seasonal shifts, construction, or natural scene evolution
- tvDuoInpaint — both the target and reference images are damaged, meaning there may be entire regions with no usable reference at all
Both situations fall completely outside what existing reference-guided inpainting methods were designed to handle. The authors tested the best current methods — including LeftRefill, which uses a powerful text-to-image foundation model — on time-variant images and found they all failed. Not because they’re bad models, but because the semantic correspondences are disrupted by temporal change in ways those models weren’t designed to handle.
The Solution — InDiTE
The paper proposes the Interactive Distribution Transition Estimation (InDiTE) module, built on one key insight:
Even though two time-variant images of the same scene may look very different, they share the same underlying geometric structure.
Buildings are in the same place. Roads follow the same layout. The large-scale geometry is stable even when surface appearance changes dramatically.
InDiTE treats the two images as samples from different but geometrically related data distributions and learns to estimate the transition between them — filtering out semantically inconsistent content while preserving and using geometrically consistent content.
The architecture uses three main components:
Siamese Backbone — a parameter-shared network with U-Net skip connections that processes both images in parallel. By sharing weights, it learns symmetric representations of both the target and reference regardless of which is which.
Semantic Predictive Filtering (SPF) — a learned filter that estimates which content from the reference is semantically consistent with the target and should be used for complementation. Instead of blindly copying pixels between images, it learns to filter what’s relevant.
Complement Head + Confidence Head — the complement head produces the final complemented image. The confidence head produces a pixel-wise confidence map indicating which complemented regions are reliable and which need further processing by a generative model.
InDiTE-Diff then takes the complemented images and confidence masks and feeds them into a diffusion model (DDNM) for final generation. High-confidence regions are preserved. Low-confidence regions — where semantic complementation was uncertain — are regenerated by the diffusion prior.
Cross-reference during diffusion sampling ensures the generated content for both images stays consistent with each other at a low-frequency level.
Results on TAMP-Street
The newly assembled TAMP-Street dataset — built from one year of urban street scene changes in Pittsburgh — provides a realistic evaluation ground.
InDiTE-Diff consistently outperforms all three baselines (TransFill, TransRef, LeftRefill) across all mask ratios under both settings. The improvement is most pronounced at large mask ratios (50%–60%) in the tvDuoInpaint setting — exactly the hardest case. PSNR improvements of up to 5.2% over LeftRefill were observed at the 50%–60% mask ratio.
A particularly compelling finding: InDiTE is plug-and-play. When used to pre-process images before feeding them into existing inpainting methods, it consistently improves their performance too — suggesting the complementation module provides value independent of the specific downstream inpainting model.
Key takeaway: First serious treatment of time-variant reference-guided inpainting.
InDiTE learns semantic distribution transition between time-variant images,
combined with diffusion for final generation. New TAMP-Street dataset included.
Open Access (CC BY-NC-ND 4.0).
What These Papers Tell Us About Where Inpainting Is Heading
Reading across these 2025–2026 papers together, a few clear patterns emerge.
Attention is everywhere — but efficiency is now the constraint. MEGA-NET, Spa-IHANet, and most other recent architectures heavily use attention mechanisms. The research frontier has shifted from “does attention help?” (it clearly does) to “how do we make attention work at scale without the computational cost destroying everything?” Sparse attention, efficient upsampling, and multi-scale discriminators are all answers to that same question.
Single-stage vs. two-stage architectures are still contested. MEGA-NET argues for single-stage simplicity with carefully designed modules. Spa-IHANet argues for two-stage separation of concerns. Both have strong results. The debate isn’t settled.
Mathematical approaches haven’t disappeared. CSMC is a reminder that not every inpainting problem needs a neural network. For structured data, small datasets, or applications that require theoretical guarantees, matrix completion methods remain competitive and often faster.
The problem definition itself is expanding. InDiTE-Diff tackles a problem — time-variant reference-guided inpainting — that simply wasn’t addressed in the literature before. As inpainting moves from controlled lab settings to real-world deployment, the scenarios being studied are becoming more realistic and more demanding.
Benchmark datasets are stabilizing. CelebA-HQ, Places2, and Paris Street View appear in nearly every recent paper. That consistency makes cross-paper comparisons more meaningful — and signals that the field has matured enough to have agreed-upon evaluation standards.
Real-World Applications
Photo editing software LaMa and similar models run under the hood of tools like Adobe Firefly, Canva’s background remover, and various mobile photo editing apps. Most users never know the algorithm name — they just click “remove object” and it works.
Cultural heritage preservation Museums, libraries, and archives are using inpainting to digitally restore damaged photographs, deteriorated manuscripts, and aged artworks. The ability to restore historical images while preserving authenticity is transforming archival work at scale.
Medical imaging Filling in missing or corrupted regions in MRI and CT scans, handling motion artifacts, and reconstructing incomplete imaging data. The CSMC matrix completion approach is particularly relevant here — its theoretical guarantees matter in medical contexts where you need to understand the limits of what’s been reconstructed.
Film and television post-production Wire removal, rig removal, set extension, object replacement — the VFX pipeline is increasingly incorporating AI inpainting to automate work that previously required frame-by-frame manual editing.
Satellite and aerial imaging Reconstructing cloud-obscured regions in satellite imagery, filling in missing data from sensor failures, and completing partially scanned aerial photographs. The time-variant problem addressed by InDiTE-Diff is directly relevant here — satellite images of the same location taken months apart face exactly the object discrepancy and appearance mismatch challenges the paper tackles.
Autonomous driving Handling missing or corrupted visual data in perception systems, filling in occluded regions in camera feeds, and improving the robustness of visual navigation under adverse conditions.
Challenges in 2025–2026
Even with all the progress, several genuinely hard problems remain.
Semantic coherence at scale Generating semantically appropriate content for very large missing regions — 50% or more of the image area — remains difficult. The model has to essentially hallucinate a significant portion of an image while staying coherent with what remains. Results are improving but still fall short of human performance on complex scenes.
Inference speed vs. quality tradeoff Diffusion models produce the best results but are too slow for real-time applications. Faster architectures sacrifice some quality. Finding the right balance for different deployment scenarios is an active area of optimization research.
Evaluating what looks good Standard metrics like PSNR and SSIM don’t always align with human perception of inpainting quality. A fill that scores well on PSNR might look obviously wrong to a human eye. LPIPS (perceptual similarity) helps but isn’t perfect. Better evaluation metrics remain an open problem.
Time-variant and multi-source inpainting InDiTE-Diff opened up the time-variant problem, but the field is still in early stages here. Handling multiple reference images, very large temporal gaps, and extreme appearance changes all remain challenging.
Domain-specific adaptation A model trained on natural photographs may struggle on medical scans, satellite imagery, or technical documents. Fine-tuning for specific domains requires labeled data that isn’t always available.
Preventing harmful hallucination In high-stakes applications — medical imaging, forensic reconstruction, archival restoration — generating plausible-but-incorrect content can be actively harmful. Knowing when to say “I’m not sure what should be here” is something current models don’t do well.
Final Thoughts
Image inpainting has come an extraordinarily long way.
From manual museum restoration, to PDE-based diffusion, to exemplar patching, to GANs, to Transformers, to diffusion models — each generation of approaches produced results that the previous generation simply couldn’t achieve.
And in 2025–2026, the pace of improvement hasn’t slowed.
What’s changing is the nature of the problems being solved.
Early inpainting research focused on making small holes look natural. Then it moved to large irregular masks. Then to semantic understanding and text guidance. Now it’s tackling problems like time-variant restoration — where the reference itself is unreliable — problems directly motivated by real-world deployment challenges rather than controlled benchmark settings.
That shift toward real-world problem formulation is healthy for the field.
The algorithms are getting better. The datasets are getting more realistic. The applications are getting more demanding. And the gap between what AI inpainting can do in a lab and what it can do in production is closing faster than most people expected.
For developers and researchers working in this space: the fundamentals still matter.
Understanding what LaMa does with frequency-domain convolutions, why gated convolutions were an improvement over partial convolutions, and how diffusion models approach the generation problem will serve you better than just knowing which model to call.
The tools are more powerful than ever.
Using them well still requires understanding what’s happening underneath.
FAQ
What is image inpainting?
Image inpainting is the process of reconstructing missing or damaged regions in an image while preserving visual consistency and realism.
What are classical image inpainting techniques?
Classical techniques include PDE-based diffusion methods and patch-based texture synthesis approaches.
Why are diffusion models important for image inpainting?
Diffusion models generate highly realistic textures and semantic structures, making them extremely effective for modern AI-based image editing and restoration.
What is transformer-based image inpainting?
Transformer-based inpainting uses attention mechanisms and long-range contextual modeling to reconstruct complex image regions more effectively.
Where is image inpainting used?
Image inpainting is used in photo restoration, object removal, medical imaging, satellite analysis, video editing, augmented reality, and generative AI systems.