Click2Mask: Local Editing with Dynamic Mask Generation

The Hebrew University of Jerusalem

Click2Mask: Given an image, a Click (blue dot), and a prompt for an added object, a Mask is generated dynamically, simultaneously with the object generation throughout the diffusion process.

Abstract

Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

Method

Click2Mask aims to simplify local image editing by requiring only a single point of reference (such as a mouse click) for adding objects described in the prompt.

In contrast, current methods either rely on existing objects or segments and do not allow free-form editing, or they require a painstaking user effort to create masks or specify input images, target images, or edit locations. Our goal is to enable free-form editing where the manipulated area is not well-defined in advance, by providing only a Click. Our dynamically evolving mask approach can be also integrated as a mask generation/finetuning component within other methods that internally employ a mask.


We utilize Blended Latent Diffusion as our image editing backbone (green block), which performs diffusion steps while blending input background latents and text guided foreground latents. The pink block is our mask evolution process. Throughout the diffusion steps we evolve a mask dynamically by utilizing a masked CLIP-based semantic loss, to optimize the mask itself. For a more detailed explanation, please refer to the paper

Results

Output Examples

Click2Mask output examples.
Each example includes an input image with a Click (blue dot), followed by outputs corresponding to the prompts below.

Qualitative Comparisons with SoTA Methods

A qualitative comparison of the SoTA methods Emu Edit, MagicBrush and InstructPix2Pix with our model Click2Mask.
Upper prompts were given to baselines, and lower (shorter) ones to Click2Mask.
Inputs contain the Click (blue dot) given to Click2Mask.

"have a knife laying between the orange and apple"

"a knife"



"add a fruit stand to the right of the image"

"a fruit stand"



"add fringe to the pink lampshade"

"fringe"



"add a tennis ball on top of the racket"

"a tennis ball"



"add Jupiter to the sky"

"Jupiter"



"add a piece of fried chicken to the plate"

"piece of fried chicken"

Comparison Study with SoTA Methods

We compare our method with mask-free SoTA methods: Emu Edit, and MagicBrush. Due to the closed-source nature of Emu Edit, we rely on the Emu Edit Benchmark. We focused on the "add" category, sampling 100 items for evaluation. Each sample was pre-processed by removing location instructions and using instead a click to direct the edit. As shown below, Click2Mask outperformed Emu Edit and MagicBrush in both a user study (involving 149 participants) and automatic metrics. For further details, please refer to the paper.

User Study Image
Automatic Image

Generated Masks

Examples of Masks generated by our method.
Each example shows a prompt and a Clicked point (blue dot) on the input image, followed by the generated Mask (purple overlay), and Click2Mask output.

Evaluating Edited Regions in Maskless Methods

In order to evaluate the edited region in mask-free methods (such as Emu Edit, MagicBrush, InstructPix2Pix, and others), thus allowing comparisons, we offer the "Edited Alpha-CLIP" procedure. We extract a mask specifying closely the edited region, and utilize Alpha-CLIP to evaluate the similarity between the masked area and the prompt. For further details, see the paper.

Examples of mask extractions can be found below. The upper prompt was given to Emu Edit and MagicBrush, and the lower was given to Click2Mask.
In each method, the left image is the output, and the right image shows the extracted mask (green overlay).

"Add a sandcastle to the right of the dog"

"A sandcastle"

BibTeX


@misc{regev2024click2masklocaleditingdynamic,
  title={Click2Mask: Local Editing with Dynamic Mask Generation}, 
  author={Omer Regev and Omri Avrahami and Dani Lischinski},
  year={2024},
  eprint={2409.08272},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2409.08272}, 
}