Learning by Planning: Language-Guided Global Image Editing
Jing Shi
Ning Xu
Yihang Xu
Trung Bui
Frank Dernoncourt
Chenliang Xu
University of Rochester
Adobe Research
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Overview of our task and method. Language-Guided Global Image Editing: given the input image and the request, we predict a sequence of actions to edit the image progressively with a series of intermediate images generated. And the final edited image is our output, which should accord with the request. Operation Planning: the input image and target image are given, and we plan a sequence of action to make the final edited image reach the target image. Since we only have the supervision of the target image, we obtain the planed action sequences via operation planning as pseudo ground-truth, which is used to train our model text-to-operation network (T2ONet).

Download Dataset



GitHub Repo


Recently, language-guided global image editing draws increasing attention with growing application potentials. However, previous GAN-based methods are not only confined to domain-specific, low-resolution data but also lacking in interpretability. To overcome the collective difficulties, we develop a text-to-operation model to map the vague editing language request into a series of editing operations, e.g., change contrast, brightness, and saturation. Each operation is interpretable and differentiable. Furthermore, the only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. Hence, we propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth. Comparison experiments on the newly collected MA5k-Req dataset and GIER dataset show the advantages of our methods.

Related Work

We maintain a curated paper list for language-guided image editing here.

5-Minute presentation video

Intermediate actions and images visualization

We present several examples of operation planning process and image editing process.
Operation Planning
Language-Guided Image Editing

Visual comparison with other methods

Visualization for comparison of our method T2ONet with other methods on MA5k-Req (left) and GIER (right)

Methodological Advantages and Extensions

(1) Resolution Independent.

Compared with the GAN-based method GeNeVa and Pix2pixAug, although all the methods conduct the correct editing, our method has no pixel distortion and is independent to image resolution. This is because our editing operation is resolution-independent.

(2) Generate multiple pissible outputs.

Visualization for diversified output given the same input and request by sampling the operation parameter at inference stage.

(3) Extension: Planning for local editing.

By adding a segmentation model to get object masks and add the operation “inpainting”, the operation planning algorithm can be extended to local editing. The recovered output is the planning result that is similar to target image


MA5k-Req Image & Annotation

Check its README to see the data structure.

Paper & Appendix

Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu
Learning by Planning: Language-Guided Global Image Editing
In CVPR, 2021.


  title={Learning by Planning: Language-Guided Global Image Editing},
  author={Shi, Jing and Xu, Ning and Xu, Yihang and Bui, Trung and Dernoncourt, Franck and Xu, Chenliang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},


This work was supported in part by an Adobe research gift, and NSF 1813709, 1741472 and 1909912. The article solely reflects the opinions and conclusions of its authors but not the funding agents. The template of this webpage is borrowed from Richard Zhang.


For further questions and suggestions, please contact Jing Shi (j.shi@rochester.edu).