Author: "Dekel, Tali" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Dekel, Tali"' showing total 152 results

Start Over Author "Dekel, Tali"

152 results on '"Dekel, Tali"'

1. DynVFX: Augmenting Real Videos with Dynamic Content

Author: Yatim, Danah, Fridman, Rafail, Bar-Tal, Omer, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion., Comment: Project page: https://dynvfx.github.io
Published: 2025

2. TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Author: Garibi, Daniel, Yadin, Shahar, Paiss, Roni, Tov, Omer, Zada, Shiran, Ephrat, Ariel, Michaeli, Tomer, Mosseri, Inbar, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/
Published: 2025

3. What Makes for a Good Stereoscopic Image?

Author: Tamir, Netanel Y., Amir, Shir, Itzhaky, Ranel, Atia, Noam, Sundaram, Shobhita, Fu, Stephanie, Sokolovsky, Ron, Isola, Phillip, Dekel, Tali, Zhang, Richard, and Farber, Miriam
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: With rapid advancements in virtual reality (VR) headsets, effectively measuring stereoscopic quality of experience (SQoE) has become essential for delivering immersive and comfortable 3D experiences. However, most existing stereo metrics focus on isolated aspects of the viewing experience such as visual discomfort or image quality, and have traditionally faced data limitations. To address these gaps, we present SCOPE (Stereoscopic COntent Preference Evaluation), a new dataset comprised of real and synthetic stereoscopic images featuring a wide range of common perceptual distortions and artifacts. The dataset is labeled with preference annotations collected on a VR headset, with our findings indicating a notable degree of consistency in user preferences across different headsets. Additionally, we present iSQoE, a new model for stereo quality of experience assessment trained on our dataset. We show that iSQoE aligns better with human preferences than existing methods when comparing mono-to-stereo conversion methods.
Published: 2024

4. What's in the Image? A Deep-Dive into the Vision of Vision Language Models

Author: Kaduri, Omri, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally.(iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.
Published: 2024

5. Generative Omnimatte: Learning to Decompose Video into Layers

Author: Lee, Yao-Chih, Lu, Erika, Rumbley, Sarah, Geyer, Michal, Huang, Jia-Bin, Dekel, Tali, and Cole, Forrester
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more., Comment: Project page: https://gen-omnimatte.github.io/
Published: 2024

6. VidPanos: Generative Panoramic Videos from Casual Panning Videos

Author: Ma, Jingwei, Lu, Erika, Paiss, Roni, Zada, Shiran, Holynski, Aleksander, Dekel, Tali, Curless, Brian, Rubinstein, Michael, and Cole, Forrester
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, I.3.3, I.4
Abstract: Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera's field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene. We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features., Comment: Project page at https://vidpanos.github.io/. To appear at SIGGRAPH Asia 2024 (conference track)
Published: 2024

7. Still-Moving: Customized Video Generation without Customized Video Data

Author: Chefer, Hila, Zada, Shiran, Paiss, Roni, Ephrat, Ariel, Tov, Omer, Rubinstein, Michael, Wolf, Lior, Dekel, Tali, Michaeli, Tomer, and Mosseri, Inbar
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight $\textit{Spatial Adapters}$ that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on $\textit{"frozen videos"}$ (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel $\textit{Motion Adapter}$ module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model., Comment: Webpage: https://still-moving.github.io/ | Video: https://www.youtube.com/watch?v=U7UuV_VIjnA
Published: 2024

8. DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video

Author: Tumanyan, Narek, Singer, Assaf, Bagon, Shai, Dekel, Tali, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

9. DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Author: Tumanyan, Narek, Singer, Assaf, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions., Comment: Accepted to ECCV 2024. Project page: https://dino-tracker.github.io/
Published: 2024

10. Lumiere: A Space-Time Diffusion Model for Video Generation

Author: Bar-Tal, Omer, Chefer, Hila, Tov, Omer, Herrmann, Charles, Paiss, Roni, Zada, Shiran, Ephrat, Ariel, Hur, Junhwa, Liu, Guanghui, Raj, Amit, Li, Yuanzhen, Rubinstein, Michael, Michaeli, Tomer, Wang, Oliver, Sun, Deqing, Dekel, Tali, and Mosseri, Inbar
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation., Comment: Webpage: https://lumiere-video.github.io/ | Video: https://www.youtube.com/watch?v=wxLr02Dz2Sc
Published: 2024

11. Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Author: Yatim, Danah, Fridman, Rafail, Bar-Tal, Omer, Kasten, Yoni, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits., Comment: Project page: https://diffusion-motion-transfer.github.io/
Published: 2023

12. Disentangling Structure and Appearance in ViT Feature Space

Author: Tumanyan, Narek, Bar-Tal, Omer, Amir, Shir, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io., Comment: Accepted to ACM Transactions on Graphics. arXiv admin note: substantial text overlap with arXiv:2201.00424
Published: 2023
Full Text: View/download PDF

13. State of the Art on Diffusion Models for Visual Computing

Author: Po, Ryan, Yifan, Wang, Golyanik, Vladislav, Aberman, Kfir, Barron, Jonathan T., Bermano, Amit H., Chan, Eric Ryan, Dekel, Tali, Holynski, Aleksander, Kanazawa, Angjoo, Liu, C. Karen, Liu, Lingjie, Mildenhall, Ben, Nießner, Matthias, Ommer, Björn, Theobalt, Christian, Wonka, Peter, and Wetzstein, Gordon
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: The field of visual computing is rapidly advancing due to the emergence of generative artificial intelligence (AI), which unlocks unprecedented capabilities for the generation, editing, and reconstruction of images, videos, and 3D scenes. In these domains, diffusion models are the generative AI architecture of choice. Within the last year alone, the literature on diffusion-based tools and applications has seen exponential growth and relevant papers are published across the computer graphics, computer vision, and AI communities with new works appearing daily on arXiv. This rapid growth of the field makes it difficult to keep up with all recent developments. The goal of this state-of-the-art report (STAR) is to introduce the basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model, as well as overview important aspects of these generative AI tools, including personalization, conditioning, inversion, among others. Moreover, we give a comprehensive overview of the rapidly growing literature on diffusion-based generation and editing, categorized by the type of generated medium, including 2D images, videos, 3D objects, locomotion, and 4D scenes. Finally, we discuss available datasets, metrics, open challenges, and social implications. This STAR provides an intuitive starting point to explore this exciting topic for researchers, artists, and practitioners alike.
Published: 2023

14. TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Author: Geyer, Michal, Bar-Tal, Omer, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/
Published: 2023

15. DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Author: Fu, Stephanie, Tamir, Netanel, Sundaram, Shobhita, Chai, Lucy, Zhang, Richard, Dekel, Tali, and Isola, Phillip
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks., Comment: Website: https://dreamsim-nights.github.io/ Code: https://github.com/ssundaram21/dreamsim
Published: 2023

16. Teaching CLIP to Count to Ten

Author: Paiss, Roni, Ephrat, Ariel, Tov, Omer, Zada, Shiran, Mosseri, Inbar, Irani, Michal, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.
Published: 2023

17. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Author: Bar-Tal, Omer, Yariv, Lior, Lipman, Yaron, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io
Published: 2023

18. Neural Congealing: Aligning Images to a Joint Semantic Atlas

Author: Ofri-Amar, Dolev, Geyer, Michal, Kasten, Yoni, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present Neural Congealing -- a zero-shot self-supervised framework for detecting and jointly aligning semantically-common content across a given set of images. Our approach harnesses the power of pre-trained DINO-ViT features to learn: (i) a joint semantic atlas -- a 2D grid that captures the mode of DINO-ViT features in the input set, and (ii) dense mappings from the unified atlas to each of the input images. We derive a new robust self-supervised framework that optimizes the atlas representation and mappings per image set, requiring only a few real-world images as input without any additional input information (e.g., segmentation masks). Notably, we design our losses and training paradigm to account only for the shared content under severe variations in appearance, pose, background clutter or other distracting objects. We demonstrate results on a plethora of challenging image sets including sets of mixed domains (e.g., aligning images depicting sculpture and artwork of cats), sets depicting related yet different object categories (e.g., dogs and tigers), or domains for which large-scale training data is scarce (e.g., coffee mugs). We thoroughly evaluate our method and show that our test-time optimization approach performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets., Comment: Project page: https://neural-congealing.github.io/
Published: 2023

19. SceneScape: Text-Driven Consistent Scene Generation

Author: Fridman, Rafail, Abecasis, Amit, Kasten, Yoni, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for text-driven perpetual view generation -- synthesizing long-term videos of various scenes solely, given an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles., Comment: Project page: https://scenescape.github.io/
Published: 2023

20. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Author: Tumanyan, Narek, Geyer, Michal, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the target image, requiring no training or fine-tuning and applicable for both real or generated guidance images. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifications of global qualities such as lighting and color.
Published: 2022

21. Imagic: Text-Based Real Image Editing with Diffusion Models

Author: Kawar, Bahjat, Zada, Shiran, Lang, Oran, Tov, Omer, Chang, Huiwen, Dekel, Tali, Mosseri, Inbar, and Irani, Michal
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework., Comment: Project page: https://imagic-editing.github.io/
Published: 2022

22. Diverse Video Generation from a Single Video

Author: Haim, Niv, Feinstein, Ben, Granot, Niv, Shocher, Assaf, Bagon, Shai, Dekel, Tali, and Irani, Michal
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: GANs are able to perform generation and manipulation tasks, trained on a single video. However, these single video GANs require unreasonable amount of time to train on a single video, rendering them almost impractical. In this paper we question the necessity of a GAN for generation from a single video, and introduce a non-parametric baseline for a variety of generation and manipulation tasks. We revive classical space-time patches-nearest-neighbors approaches and adapt them to a scalable unconditional generative model, without any learning. This simple baseline surprisingly outperforms single-video GANs in visual quality and realism (confirmed by quantitative and qualitative evaluations), and is disproportionately faster (runtime reduced from several days to seconds). Our approach is easily scaled to Full-HD videos. We also use the same framework to demonstrate video analogies and spatio-temporal retargeting. These observations show that classical approaches significantly outperform heavy deep learning machinery for these tasks. This sets a new baseline for single-video generation and manipulation tasks, and no less important -- makes diverse generation from a single video practically possible for the first time., Comment: AI for Content Creation Workshop @ CVPR 2022
Published: 2022

23. Text2LIVE: Text-Driven Layered Image and Video Editing

Author: Bar-Tal, Omer, Ofri-Amar, Dolev, Fridman, Rafail, Kasten, Yoni, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for zero-shot, text-driven appearance manipulation in natural images and videos. Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects (e.g., object's texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantically meaningful manner. We train a generator using an internal dataset of training examples, extracted from a single input (image or video and target text prompt), while leveraging an external pre-trained CLIP model to establish our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the original input. This allows us to constrain the generation process and maintain high fidelity to the original input via novel text-driven losses that are applied directly to the edit layer. Our method neither relies on a pre-trained generator nor requires user-provided edit masks. We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes., Comment: Project page: https://text2live.github.io
Published: 2022

24. Self-Distilled StyleGAN: Towards Generation from Internet Photos

Author: Mokady, Ron, Yarom, Michal, Tov, Omer, Lang, Oran, Cohen-Or, Daniel, Dekel, Tali, Irani, Michal, and Mosseri, Inbar
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated. In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two main challenges to StyleGAN: they contain many outlier images, and are characterized by a multi-modal distribution. Training StyleGAN on such raw image collections results in degraded image synthesis quality. To meet these challenges, we proposed a StyleGAN-based self-distillation approach, which consists of two main components: (i) A generative-based self-filtering of the dataset to eliminate outlier images, in order to generate an adequate training set, and (ii) Perceptual clustering of the generated images to detect the inherent data modalities, which are then employed to improve StyleGAN's "truncation trick" in the image synthesis process. The presented technique enables the generation of high-quality images, while minimizing the loss in diversity of the data. Through qualitative and quantitative evaluation, we demonstrate the power of our approach to new challenging and diverse domains collected from the Internet. New datasets and pre-trained models are available at https://self-distilled-stylegan.github.io/ .
Published: 2022

25. Splicing ViT Features for Semantic Appearance Transfer

Author: Tumanyan, Narek, Bar-Tal, Omer, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. Our method works by training a generator given only a single structure/appearance image pair as input. To integrate semantic information into our framework - a pivotal component in tackling this task - our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model which serves as an external semantic prior. Specifically, we derive novel representations of structure and appearance extracted from deep ViT features, untwisting them from the learned self-attention modules. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Our framework, which we term "Splice", does not involve adversarial training, nor does it require any additional input information such as semantic segmentation or correspondences, and can generate high-resolution results, e.g., work in HD. We demonstrate high quality results on a variety of in-the-wild image pairs, under significant variations in the number of objects, their pose and appearance.
Published: 2022

26. Deep ViT Features as Dense Visual Descriptors

Author: Amir, Shir, Gandelsman, Yossi, Bagon, Shai, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in dino-vit-features.github.io., Comment: Revised version - high res figures
Published: 2021

27. Layered Neural Atlases for Consistent Video Editing

Author: Kasten, Yoni, Ofri, Dolev, Wang, Oliver, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases, each providing a unified representation of the appearance of an object (or background) over the video. For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases, giving us a consistent parameterization of the video, along with an associated alpha (opacity) value. Importantly, we design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain, with minimal manual work required. Edits applied to a single 2D atlas (or input video frame) are automatically and consistently mapped back to the original video frames, while preserving occlusions, deformation, and other complex scene effects such as shadows and reflections. Our method employs a coordinate-based Multilayer Perceptron (MLP) representation for mappings, atlases, and alphas, which are jointly optimized on a per-video basis, using a combination of video reconstruction and regularization losses. By operating purely in 2D, our method does not require any prior 3D knowledge about scene geometry or camera poses, and can handle complex dynamic real world videos. We demonstrate various video editing applications, including texture mapping, video style transfer, image-to-video texture transfer, and segmentation/labeling propagation, all automatically produced by editing a single 2D atlas image.
Published: 2021

28. Diverse Generation from a Single Video Made Possible

Author: Haim, Niv, Feinstein, Ben, Granot, Niv, Shocher, Assaf, Bagon, Shai, Dekel, Tali, and Irani, Michal
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: GANs are able to perform generation and manipulation tasks, trained on a single video. However, these single video GANs require unreasonable amount of time to train on a single video, rendering them almost impractical. In this paper we question the necessity of a GAN for generation from a single video, and introduce a non-parametric baseline for a variety of generation and manipulation tasks. We revive classical space-time patches-nearest-neighbors approaches and adapt them to a scalable unconditional generative model, without any learning. This simple baseline surprisingly outperforms single-video GANs in visual quality and realism (confirmed by quantitative and qualitative evaluations), and is disproportionately faster (runtime reduced from several days to seconds). Other than diverse video generation, we demonstrate other applications using the same framework, including video analogies and spatio-temporal retargeting. Our proposed approach is easily scaled to Full-HD videos. These observations show that the classical approaches, if adapted correctly, significantly outperform heavy deep learning machinery for these tasks. This sets a new baseline for single-video generation and manipulation tasks, and no less important -- makes diverse generation from a single video practically possible for the first time.
Published: 2021

29. Consistent Depth of Moving Objects in Video

Author: Zhang, Zhoutong, Cole, Forrester, Tucker, Richard, Freeman, William T., and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera. We seek a geometrically and temporally consistent solution to this underconstrained problem: the depth predictions of corresponding points across frames should induce plausible, smooth motion in 3D. We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction MLP over the entire input video. By recursively unrolling the scene-flow prediction MLP over varying time steps, we compute both short-range scene flow to impose local smooth motion priors directly in 3D, and long-range scene flow to impose multi-view consistency constraints with wide baselines. We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars), as well as camera motion. Our depth maps give rise to a number of depth-and-motion aware video editing effects such as object and lighting insertion., Comment: Published at SIGGRAPH 2021
Published: 2021
Full Text: View/download PDF

30. Omnimatte: Associating Objects and Their Effects in Video

Author: Lu, Erika, Cole, Forrester, Dekel, Tali, Zisserman, Andrew, Freeman, William T., and Rubinstein, Michael
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects -- shadows, reflections, generated smoke, etc -- are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject -- an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic -- it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject., Comment: CVPR 2021 Oral. Project webpage: https://omnimatte.github.io/. Added references
Published: 2021

31. Layered Neural Rendering for Retiming People in Video

Author: Lu, Erika, Cole, Forrester, Dekel, Tali, Xie, Weidi, Zisserman, Andrew, Salesin, David, Freeman, William T., and Rubinstein, Michael
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate -- e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running., Comment: In SIGGRAPH Asia 2020. Project webpage: https://retiming.github.io/. Added references
Published: 2020

32. SpeedNet: Learning the Speediness in Videos

Author: Benaim, Sagie, Ephrat, Ariel, Lang, Oran, Mosseri, Inbar, Freeman, William T., Rubinstein, Michael, Irani, Michal, and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly., Comment: Accepted to CVPR 2020 (oral). Project webpage: http://speednet-cvpr20.github.io
Published: 2020

33. Semantic Pyramid for Image Generation

Author: Shocher, Assaf, Gandelsman, Yossi, Mosseri, Inbar, Yarom, Michal, Irani, Michal, Freeman, William T., and Dekel, Tali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
Published: 2020

34. On the Effectiveness of ViT Features as Local Semantic Descriptors

Author: Amir, Shir, Gandelsman, Yossi, Bagon, Shai, Dekel, Tali, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Karlinsky, Leonid, editor, Michaeli, Tomer, editor, and Nishino, Ko, editor
Published: 2023
Full Text: View/download PDF

35. Speech2Face: Learning the Face Behind a Voice

Author: Oh, Tae-Hyun, Dekel, Tali, Kim, Changil, Mosseri, Inbar, Freeman, William T., Rubinstein, Michael, and Matusik, Wojciech
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how--and in what manner--our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers., Comment: To appear in CVPR2019. Project page: http://speech2face.github.io
Published: 2019

36. SinGAN: Learning a Generative Model from a Single Natural Image

Author: Shaham, Tamar Rott, Dekel, Tali, and Michaeli, Tomer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks., Comment: ICCV 2019
Published: 2019

37. Learning the Depths of Moving People by Watching Frozen People

Author: Li, Zhengqi, Dekel, Tali, Cole, Forrester, Tucker, Richard, Snavely, Noah, Liu, Ce, and Freeman, William T.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Because people are stationary, training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and show various 3D effects produced using our predicted depth., Comment: CVPR 2019 (Oral)
Published: 2019

38. Diverse Generation from a Single Video Made Possible

Author: Haim, Niv, Feinstein, Ben, Granot, Niv, Shocher, Assaf, Bagon, Shai, Dekel, Tali, Irani, Michal, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

39. MoSculp: Interactive Visualization of Shape and Time

Author: Zhang, Xiuming, Dekel, Tali, Xue, Tianfan, Owens, Andrew, He, Qiurui, Wu, Jiajun, Mueller, Stefanie, and Freeman, William T.
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: We present a system that allows users to visualize complex human motion via 3D motion sculptures---a representation that conveys the 3D structure swept by a human body as it moves through space. Given an input video, our system computes the motion sculptures and provides a user interface for rendering it in different styles, including the options to insert the sculpture back into the original video, render it in a synthetic scene or physically print it. To provide this end-to-end workflow, we introduce an algorithm that estimates that human's 3D geometry over time from a set of 2D images and develop a 3D-aware image-based rendering approach that embeds the sculpture back into the scene. By automating the process, our system takes motion sculpture creation out of the realm of professional artists, and makes it applicable to a wide range of existing video material. By providing viewers with 3D information, motion sculptures reveal space-time motion information that is difficult to perceive with the naked eye, and allow viewers to interpret how different parts of the object interact over time. We validate the effectiveness of this approach with user studies, finding that our motion sculpture visualizations are significantly more informative about motion than existing stroboscopic and space-time visualization methods., Comment: UIST 2018. Project page: http://mosculp.csail.mit.edu/
Published: 2018
Full Text: View/download PDF

40. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Author: Ephrat, Ariel, Mosseri, Inbar, Lang, Oran, Dekel, Tali, Wilson, Kevin, Hassidim, Avinatan, Freeman, William T., and Rubinstein, Michael
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest)., Comment: Accepted to SIGGRAPH 2018. Project webpage: https://looking-to-listen.github.io
Published: 2018
Full Text: View/download PDF

41. Smart, Sparse Contours to Represent and Edit Images

Author: Dekel, Tali, Gan, Chuang, Krishnan, Dilip, Liu, Ce, and Freeman, William T.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle texture information and to ensure image quality. Our model, based on generative adversarial networks, synthesizes texture and details in regions where no input information is provided. The semantic knowledge encoded into our model and the sparsity of the input allows to use contours as an intuitive interface for semantically-aware image manipulation: local edits in contour domain translate to long-range and coherent changes in pixel space. We can perform complex structural changes such as changing facial expression by simple edits of contours. Our experiments demonstrate that humans as well as a face recognition system mostly cannot distinguish between our reconstructions and the source images., Comment: Accepted to CVPR'18; Project page: contour2im.github.io
Published: 2017

42. Best-Buddies Similarity - Robust Template Matching using Mutual Nearest Neighbors

Author: Oron, Shaul, Dekel, Tali, Xue, Tianfan, Freeman, William T., and Avidan, Shai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)--pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real-world dataset while using different types of features.
Published: 2016

43. Text2LIVE: Text-Driven Layered Image and Video Editing

Author: Bar-Tal, Omer, primary, Ofri-Amar, Dolev, additional, Fridman, Rafail, additional, Kasten, Yoni, additional, and Dekel, Tali, additional
Published: 2022
Full Text: View/download PDF

44. Disentangling Structure and Appearance in ViT Feature Space

Author: Tumanyan, Narek, primary, Bar-Tal, Omer, additional, Amir, Shir, additional, Bagon, Shai, additional, and Dekel, Tali, additional
Published: 2023
Full Text: View/download PDF

45. Detecting moving regions in CrowdCam images

Author: Dafni, Adi, Moses, Yael, Avidan, Shai, and Dekel, Tali
Published: 2017
Full Text: View/download PDF

46. Neural Congealing: Aligning Images to a Joint Semantic Atlas

Author: Ofri-Amar, Dolev, primary, Geyer, Michal, additional, Kasten, Yoni, additional, and Dekel, Tali, additional
Published: 2023
Full Text: View/download PDF

47. Imagic: Text-Based Real Image Editing with Diffusion Models

Author: Kawar, Bahjat, primary, Zada, Shiran, additional, Lang, Oran, additional, Tov, Omer, additional, Chang, Huiwen, additional, Dekel, Tali, additional, Mosseri, Inbar, additional, and Irani, Michal, additional
Published: 2023
Full Text: View/download PDF

48. Disentangling Structure and Appearance in ViT Feature Space.

Author: Tumanyan, Narek, Bar-Tal, Omer, Amir, Shir, Bagon, Shai, and Dekel, Tali
Subjects: TRANSFORMER models, FEEDFORWARD neural networks, TRANSFER of training
Abstract: We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feedforward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

49. Unveiling Unexpected Training Data in Internet Video.

Author: DEKEL, TALI and SNAVELY, NOAH
Subjects: *STREAMING video & television, *MACHINE learning, *SUPERVISED learning, *DATA mining, *DEPTH maps (Digital image processing)
Abstract: The article examines the data mining of internet video in the training of machine learning systems. Alleged shortcomings of unsupervised learning based on internet videos are identified and the author calls for the adoption of a "self-supervised" approach. The example of computing depth maps based on real estate videos found on the YouTube platform using a parallax scene format called MPI is proffered along with an example of speech isolation and speaker identification.
Published: 2021
Full Text: View/download PDF

50. Self-Distilled StyleGAN: Towards Generation from Internet Photos

Author: Mokady, Ron, primary, Tov, Omer, additional, Yarom, Michal, additional, Lang, Oran, additional, Mosseri, Inbar, additional, Dekel, Tali, additional, Cohen-Or, Daniel, additional, and Irani, Michal, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

152 results on '"Dekel, Tali"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources