Author: "Ullman, Shimon" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ullman, Shimon"' showing total 645 results

Start Over Author "Ullman, Shimon"

645 results on '"Ullman, Shimon"'

1. Towards Multimodal In-Context Learning for Vision & Language Models

Author: Doveh, Sivan, Perek, Shaked, Mirza, M. Jehanzeb, Lin, Wei, Alfassy, Amit, Arbelle, Assaf, Ullman, Shimon, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.
Published: 2024

2. Biologically-Motivated Learning Model for Instructed Visual Processing

Author: Abel, Roy and Ullman, Shimon
Subjects: Computer Science - Artificial Intelligence
Abstract: As part of understanding how the brain learns, ongoing work seeks to combine biological knowledge and current artificial intelligence (AI) modeling in an attempt to find an efficient biologically plausible learning scheme. Current models of biologically plausible learning often use a cortical-like combination of bottom-up (BU) and top-down (TD) processing, where the TD part carries feedback signals used for learning. However, in the visual cortex, the TD pathway plays a second major role of visual attention, by guiding the visual process to locations and tasks of interest. A biological model should therefore combine the two tasks, and learn to guide the visual process. We introduce a model that uses a cortical-like combination of BU and TD processing that naturally integrates the two major functions of the TD stream. The integrated model is obtained by an appropriate connectivity pattern between the BU and TD streams, a novel processing cycle that uses the TD part twice, and the use of 'Counter-Hebb' learning that operates across the streams. We show that the 'Counter-Hebb' mechanism can provide an exact backpropagation synaptic modification. We further demonstrate the model's ability to guide the visual stream to perform a task of interest, achieving competitive performance compared with AI models on standard multi-task learning benchmarks. The successful combination of learning and visual guidance could provide a new view on combining BU and TD processing in human vision, and suggests possible directions for both biologically plausible models and artificial instructed models, such as vision-language models (VLMs).
Published: 2023

3. Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Author: Doveh, Sivan, Arbelle, Assaf, Harary, Sivan, Herzig, Roei, Kim, Donghyun, Cascante-bonilla, Paola, Alfassy, Amit, Panda, Rameswar, Giryes, Raja, Feris, Rogerio, Ullman, Shimon, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27\%$ over the base model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average.
Published: 2023

4. Teaching Structured Vision&Language Concepts to Vision&Language Models

Author: Doveh, Sivan, Arbelle, Assaf, Harary, Sivan, Panda, Rameswar, Herzig, Roei, Schwartz, Eli, Kim, Donghyun, Giryes, Raja, Feris, Rogerio, Ullman, Shimon, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model.
Published: 2022

5. A model for full local image interpretation

Author: Ben-Yosef, Guy, Assif, Liav, Harari, Daniel, and Ullman, Shimon
Subjects: Computer Science - Artificial Intelligence, Quantitative Biology - Neurons and Cognition
Abstract: We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed-forward but limited top-down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class-specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models., Comment: Published in the Proceedings of the 37th Annual Meeting of the Cognitive Science Society (CogSci), 2015
Published: 2021

6. Image interpretation by iterative bottom-up top-down processing

Author: Ullman, Shimon, Assif, Liav, Strugatski, Alona, Vatashsky, Ben-Zion, Levy, Hila, Netanyahu, Aviv, and Yaari, Adam
Subjects: Computer Science - Computer Vision and Pattern Recognition, Quantitative Biology - Neurons and Cognition
Abstract: Scene understanding requires the extraction and representation of scene components together with their properties and inter-relations. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (counter-streams structure). The model constructs a scene representation by the iterative use of three components. The first model component is a BU stream that extracts selected scene elements, properties and relations. The second component (cognitive augmentation) augments the extracted visual representation based on relevant non-visual stored representations. It also provides input to the third component, the TD stream, in the form of a TD instruction, instructing the model what task to perform next. The TD stream then guides the BU visual stream to perform the selected task in the next cycle. During this process, the visual representations extracted from the image can be combined with relevant non-visual representations, so that the final scene representation is based on both visual information extracted from the scene and relevant stored knowledge of the world. We describe how a sequence of TD-instructions is used to extract from the scene structures of interest, including an algorithm to automatically select the next TD-instruction in the sequence. The extraction process is shown to have favorable properties in terms of combinatorial generalization, generalizing well to novel scene structures and new combinations of objects, properties and relations not seen during training. Finally, we compare the model with relevant aspects of the human vision, and suggest directions for using the BU-TD scheme for integrating visual and cognitive components in the process of scene understanding.
Published: 2021

7. Detector-Free Weakly Supervised Grounding by Separation

Author: Arbelle, Assaf, Doveh, Sivan, Alfassy, Amit, Shtok, Joseph, Lev, Guy, Schwartz, Eli, Kuehne, Hilde, Levi, Hila Barak, Sattigeri, Prasanna, Panda, Rameswar, Chen, Chun-Fu, Bronstein, Alex, Saenko, Kate, Ullman, Shimon, Giryes, Raja, Feris, Rogerio, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG.
Published: 2021

8. What can human minimal videos tell us about dynamic recognition models?

Author: Ben-Yosef, Guy, Kreiman, Gabriel, and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Quantitative Biology - Neurons and Cognition
Abstract: In human vision objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that human visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art deep networks for dynamic visual recognition cannot replicate human behavior in these configurations. This gap between humans and machines points to critical mechanisms in human dynamic vision that are lacking in current models., Comment: Published as a workshop paper at Bridging AI and Cognitive Science (ICLR 2020). Extended paper was published at Cognition
Published: 2021
Full Text: View/download PDF

9. What takes the brain so long: Object recognition at the level of minimal images develops for up to seconds of presentation time

Author: Benoni, Hanna, Harari, Daniel, and Ullman, Shimon
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Rich empirical evidence has shown that visual object recognition in the brain is fast and effortless, with relevant brain signals reported to start as early as 80 ms. Here we study the time trajectory of the recognition process at the level of minimal recognizable images (termed MIRC). These are images that can be recognized reliably, but in which a minute change of the image (reduction by either size or resolution) has a drastic effect on recognition. Subjects were assigned to one of nine exposure conditions: 200, 500, 1000, 2000 ms with or without masking, as well as unlimited time. The subjects were not limited in time to respond after presentation. The results show that in the masked conditions, recognition rates develop gradually over an extended period, e.g. average of 18% for 200 ms exposure and 45% for 500 ms, increasing significantly with longer exposure even above 2 secs. When presented for unlimited time (until response), MIRC recognition rates were equivalent to the rates of full-object images presented for 50 ms followed by masking. What takes the brain so long to recognize such images? We discuss why processes involving eye-movements, perceptual decision-making and pattern completion are unlikely explanations. Alternatively, we hypothesize that MIRC recognition requires an extended top-down process complementing the feed-forward phase., Comment: 7 pages, 2 figures, 1 table
Published: 2020

10. Multi-Task Learning by a Top-Down Control Network

Author: Levi, Hila and Ullman, Shimon
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: As the range of tasks performed by a general vision system expands, executing multiple tasks accurately and efficiently in a single network has become an important and still open problem. Recent computer vision approaches address this problem by branching networks, or by a channel-wise modulation of the network feature-maps with task specific vectors. We present a novel architecture that uses a dedicated top-down control network to modify the activation of all the units in the main recognition network in a manner that depends on the selected task, image content, and spatial location. We show the effectiveness of our scheme by achieving significantly better results than alternative state-of-the-art approaches on four datasets. We further demonstrate our advantages in terms of task selectivity, scaling the number of tasks and interpretability.
Published: 2020

11. Efficient Coarse-to-Fine Non-Local Module for the Detection of Small Objects

Author: Levi, Hila and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: An image is not just a collection of objects, but rather a graph where each object is related to other objects through spatial and semantic relations. Using relational reasoning modules, such as the non-local module \cite{wang2017non}, can therefore improve object detection. Current schemes apply such dedicated modules either to a specific layer of the bottom-up stream, or between already-detected objects. We show that the relational process can be better modeled in a coarse-to-fine manner and present a novel framework, applying a non-local module sequentially to increasing resolution feature maps along the top-down stream. In this way, information can naturally passed from larger objects to smaller related ones. Applying the module to fine feature maps further allows the information to pass between the small objects themselves, exploiting repetitions of instances of the same class. In practice, due to the expensive memory utilization of the non-local module, it is infeasible to apply the module as currently used to high-resolution feature maps. We redesigned the non local module, improved it in terms of memory and number of operations, allowing it to be placed anywhere along the network. We further incorporated relative spatial information into the module, in a manner that can be incorporated into our efficient implementation. We show the effectiveness of our scheme by improving the results of detecting small objects on COCO by 1-2 AP points over Faster and Mask RCNN and by 1 AP over using non-local module on the bottom-up stream.
Published: 2018

12. VQA with no questions-answers training

Author: Vatashsky, Ben-Zion and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Methods for teaching machines to answer visual questions have made significant progress in recent years, but current methods still lack important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answers and handling new domains without explicit examples. We propose a novel method that consists of two main parts: generating a question graph representation, and an answering procedure, guided by the abstract structure of the question graph to invoke an extendable set of visual estimators. Training is performed for the language part and the visual part on their own, but unlike existing schemes, the method does not require any training using images with associated questions and answers. This approach is able to handle novel domains (extended question types and new object classes, properties and relations) as long as corresponding visual estimators are available. In addition, it can provide explanations to its answers and suggest alternatives when questions are not grounded in the image. We demonstrate that this approach achieves both high performance and domain extensibility without any questions-answers training., Comment: Accepted to CVPR 2020
Published: 2018

13. Understand, Compose and Respond - Answering Visual Questions by a Composition of Abstract Procedures

Author: Vatashsky, Ben Zion and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: An image related question defines a specific visual task that is required in order to produce an appropriate answer. The answer may depend on a minor detail in the image and require complex reasoning and use of prior knowledge. When humans perform this task, they are able to do it in a flexible and robust manner, integrating modularly any novel visual capability with diverse options for various elaborations of the task. In contrast, current approaches to solve this problem by a machine are based on casting the problem as an end-to-end learning problem, which lacks such abilities. We present a different approach, inspired by the aforementioned human capabilities. The approach is based on the compositional structure of the question. The underlying idea is that a question has an abstract representation based on its structure, which is compositional in nature. The question can consequently be answered by a composition of procedures corresponding to its substructures. The basic elements of the representation are logical patterns, which are put together to represent the question. These patterns include a parametric representation for object classes, properties and relations. Each basic pattern is mapped into a basic procedure that includes meaningful visual tasks, and the patterns are composed to produce the overall answering procedure. The UnCoRd (Understand Compose and Respond) system, based on this approach, integrates existing detection and classification schemes for a set of object classes, properties and relations. These schemes are incorporated in a modular manner, providing elaborated answers and corrections for negative answers. In addition, an external knowledge base is queried for required common-knowledge. We performed a qualitative analysis of the system, which demonstrates its representation capabilities and provide suggestions for future developments.
Published: 2018

14. Discovery and usage of joint attention in images

Author: Harari, Daniel, Tenenbaum, Joshua B., and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics, Quantitative Biology - Neurons and Cognition
Abstract: Joint visual attention is characterized by two or more individuals looking at a common target at the same time. The ability to identify joint attention in scenes, the people involved, and their common target, is fundamental to the understanding of social interactions, including others' intentions and goals. In this work we deal with the extraction of joint attention events, and the use of such events for image descriptions. The work makes two novel contributions. First, our extraction algorithm is the first which identifies joint visual attention in single static images. It computes 3D gaze direction, identifies the gaze target by combining gaze direction with a 3D depth map computed for the image, and identifies the common gaze target. Second, we use a human study to demonstrate the sensitivity of humans to joint attention, suggesting that the detection of such a configuration in an image can be useful for understanding the image, including the goals of the agents and their joint activity, and therefore can contribute to image captioning and related tasks., Comment: 6 pages, 3 figures
Published: 2018

15. Large Field and High Resolution: Detecting Needle in Haystack

Author: Gorodissky, Hadar, Harari, Daniel, and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The growing use of convolutional neural networks (CNN) for a broad range of visual tasks, including tasks involving fine details, raises the problem of applying such networks to a large field of view, since the amount of computations increases significantly with the number of pixels. To deal effectively with this difficulty, we develop and compare methods of using CNNs for the task of small target localization in natural images, given a limited "budget" of samples to form an image. Inspired in part by human vision, we develop and compare variable sampling schemes, with peak resolution at the center and decreasing resolution with eccentricity, applied iteratively by re-centering the image at the previous predicted target location. The results indicate that variable resolution models significantly outperform constant resolution models. Surprisingly, variable resolution models and in particular multi-channel models, outperform the optimal, "budget-free" full-resolution model, using only 5\% of the samples., Comment: 15 pages, 7 figures
Published: 2018

16. Cakewalk Sampling

Author: Patish, Uri and Ullman, Shimon
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We study the task of finding good local optima in combinatorial optimization problems. Although combinatorial optimization is NP-hard in general, locally optimal solutions are frequently used in practice. Local search methods however typically converge to a limited set of optima that depend on their initialization. Sampling methods on the other hand can access any valid solution, and thus can be used either directly or alongside methods of the former type as a way for finding good local optima. Since the effectiveness of this strategy depends on the sampling distribution, we derive a robust learning algorithm that adapts sampling distributions towards good local optima of arbitrary objective functions. As a first use case, we empirically study the efficiency in which sampling methods can recover locally maximal cliques in undirected graphs. Not only do we show how our adaptive sampler outperforms related methods, we also show how it can even approach the performance of established clique algorithms. As a second use case, we consider how greedy algorithms can be combined with our adaptive sampler, and we demonstrate how this leads to superior performance in k-medoid clustering. Together, these findings suggest that our adaptive sampler can provide an effective strategy to combinatorial optimization problems that arise in practice., Comment: Accepted as a conference paper by AAAI-2020 (oral presentation)
Published: 2018

17. A model for interpreting social interactions in local image regions

Author: Ben-Yosef, Guy, Yachin, Alon, and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding social interactions (such as 'hug' or 'fight') is a basic and important capacity of the human visual system, but a challenging and still open problem for modeling. In this work we study visual recognition of social interactions, based on small but recognizable local regions. The approach is based on two novel key components: (i) A given social interaction can be recognized reliably from reduced images (called 'minimal images'). (ii) The recognition of a social interaction depends on identifying components and relations within the minimal image (termed 'interpretation'). We show psychophysics data for minimal images and modeling results for their interpretation. We discuss the integration of minimal configurations in recognizing social interactions in a detailed, high-resolution image., Comment: In AAAI spring symposium on Science of Intelligence: Computational Principles of Natural and Artificial Intelligence, Palo Alto, 2017
Published: 2017

18. Oculo-retinal dynamics can explain the perception of minimal recognizable configurations

Author: Gruber, Liron Zipora, Ullman, Shimon, and Ahissar, Ehud
Published: 2021

19. Structured learning and detailed interpretation of minimal object images

Author: Ben-Yosef, Guy, Assif, Liav, and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We model the process of human full interpretation of object images, namely the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. We model interpretation by a structured learning framework, in which there are primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of minimal configurations, namely reduced local regions that are minimal in the sense that further reduction will turn them unrecognizable and uninterpretable. We show experimental results of our model, and results of predicting and testing relations that were useful to the model via transformed minimal images., Comment: Accepted to Workshop on Mutual Benefits of Cognitive and Computer Vision, at the International Conference on Computer Vision. Venice, Italy, 2017
Published: 2017

20. Visual Cortex Models for Object Recognition

Author: Poggio, Tomaso, Ullman, Shimon, Deguchi, Koichiro, Section editor, and Ikeuchi, Katsushi, editor
Published: 2021
Full Text: View/download PDF

21. Machine Recognition of Objects

Author: Poggio, Tomaso, Ullman, Shimon, Deguchi, Koichiro, Section editor, and Ikeuchi, Katsushi, editor
Published: 2021
Full Text: View/download PDF

22. Measuring and modeling the perception of natural and unconstrained gaze in humans and machines

Author: Harari, Daniel, Gao, Tao, Kanwisher, Nancy, Tenenbaum, Joshua, and Ullman, Shimon
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Humans are remarkably adept at interpreting the gaze direction of other individuals in their surroundings. This skill is at the core of the ability to engage in joint visual attention, which is essential for establishing social interactions. How accurate are humans in determining the gaze direction of others in lifelike scenes, when they can move their heads and eyes freely, and what are the sources of information for the underlying perceptual processes? These questions pose a challenge from both empirical and computational perspectives, due to the complexity of the visual input in real-life situations. Here we measure empirically human accuracy in perceiving the gaze direction of others in lifelike scenes, and study computationally the sources of information and representations underlying this cognitive capacity. We show that humans perform better in face-to-face conditions compared with recorded conditions, and that this advantage is not due to the availability of input dynamics. We further show that humans are still performing well when only the eyes-region is visible, rather than the whole face. We develop a computational model, which replicates the pattern of human performance, including the finding that the eyes-region contains on its own, the required information for estimating both head orientation and direction of gaze. Consistent with neurophysiological findings on task-specific face regions in the brain, the learned computational representations reproduce perceptual effects such as the Wollaston illusion, when trained to estimate direction of gaze, but not when trained to recognize objects or faces., Comment: Daniel Harari and Tao Gao contributed equally to this work
Published: 2016

23. Discovering containment: from infants to machines

Author: Ullman, Shimon, Dorfman, Nimrod, and Harari, Daniel
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Current artificial learning systems can recognize thousands of visual categories, or play Go at a champion"s level, but cannot explain infants learning, in particular the ability to learn complex concepts without guidance, in a specific order. A notable example is the category of 'containers' and the notion of containment, one of the earliest spatial relations to be learned, starting already at 2.5 months, and preceding other common relations (e.g., support). Such spontaneous unsupervised learning stands in contrast with current highly successful computational models, which learn in a supervised manner, that is, by using large data sets of labeled examples. How can meaningful concepts be learned without guidance, and what determines the trajectory of infant learning, making some notions appear consistently earlier than others?
Published: 2016
Full Text: View/download PDF

24. Action Classification via Concepts and Attributes

Author: Rosenfeld, Amir and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Classes in natural images tend to follow long tail distributions. This is problematic when there are insufficient training examples for rare classes. This effect is emphasized in compound classes, involving the conjunction of several concepts, such as those appearing in action-recognition datasets. In this paper, we propose to address this issue by learning how to utilize common visual concepts which are readily available. We detect the presence of prominent concepts in images and use them to infer the target labels instead of using visual features directly, combining tools from vision and natural-language processing. We validate our method on the recently introduced HICO dataset reaching a mAP of 31.54\% and on the Stanford-40 Actions dataset, where the proposed method outperforms that obtained by direct visual features, obtaining an accuracy 83.12\%. Moreover, the method provides for each class a semantically meaningful list of keywords and relevant image regions relating it to its constituent concepts.
Published: 2016

25. Human Pose Estimation using Deep Consensus Voting

Author: Lifshitz, Ita, Fetaya, Ethan, and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to utilize information from the whole image, rather than rely on a sparse set of keypoint locations. Using dense, multi-target votes, not only produces good keypoint predictions, but also enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting. This differs from most previous methods where joint probabilities are learned from relative keypoint locations and are independent of the image. We finally combine the keypoints votes and joint probabilities in order to identify the optimal pose configuration. We show our competitive performance on the MPII Human Pose and Leeds Sports Pose datasets.
Published: 2016

26. Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Author: Berzak, Yevgeni, Barbu, Andrei, Harari, Daniel, Katz, Boris, and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types., Comment: EMNLP 2015
Published: 2016

27. Visual Concept Recognition and Localization via Iterative Introspection

Author: Rosenfeld, Amir and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Convolutional neural networks have been shown to develop internal representations, which correspond closely to semantically meaningful objects and parts, although trained solely on class labels. Class Activation Mapping (CAM) is a recent method that makes it possible to easily highlight the image regions contributing to a network's classification decision. We build upon these two developments to enable a network to re-examine informative image regions, which we term introspection. We propose a weakly-supervised iterative scheme, which shifts its center of attention to increasingly discriminative regions as it progresses, by alternating stages of classification and introspection. We evaluate our method and show its effectiveness over a range of several datasets, where we obtain competitive or state-of-the-art results: on Stanford-40 Actions, we set a new state-of the art of 81.74%. On FGVC-Aircraft and the Stanford Dogs dataset, we show consistent improvements over baselines, some of which include significantly more supervision.
Published: 2016

28. Face-space Action Recognition by Face-Object Interactions

Author: Rosenfeld, Amir and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations. However, there are still many cases in which performance remains far from that of humans. In this paper, we approach the problem by learning explicitly, and then integrating three components of transitive actions: (1) the human body part relevant to the action (2) the object being acted upon and (3) the specific form of interaction between the person and the object. The process uses class-specific features and relations not used in the past for action recognition and which use inherently two cycles in the process unlike most standard approaches. We focus on face-related actions (FRA), a subset of actions that includes several currently challenging categories. We present an average relative improvement of 52% over state-of-the art. We also make a new benchmark publicly available., Comment: our more recent work on a related topic is described in a separate paper : http://arxiv.org/abs/1511.03814
Published: 2016

29. Hand-Object Interaction and Precise Localization in Transitive Action Recognition

Author: Rosenfeld, Amir and Ullman, Shimon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations produced by deep neural networks. However, there are still many cases in which performance remains far from that of humans. A major difficulty arises in distinguishing between transitive actions in which the overall actor pose is similar, and recognition therefore depends on details of the grasp and the object, which may be largely occluded. In this paper we demonstrate how recognition is improved by obtaining precise localization of the action-object and consequently extracting details of the object shape together with the actor-object interaction. To obtain exact localization of the action object and its interaction with the actor, we employ a coarse-to-fine approach which combines semantic segmentation and contextual features, in successive stages. We focus on (but are not limited) to face-related actions, a set of actions that includes several currently challenging categories. We present an average relative improvement of 35% over state-of-the art and validate through experimentation the effectiveness of our approach., Comment: Minor changes: title and abstract
Published: 2015

30. Learning Local Invariant Mahalanobis Distances

Author: Fetaya, Ethan and Ullman, Shimon
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: For many tasks and data types, there are natural transformations to which the data should be invariant or insensitive. For instance, in visual recognition, natural images should be insensitive to rotation and translation. This requirement and its implications have been important in many machine learning applications, and tolerance for image transformations was primarily achieved by using robust feature vectors. In this paper we propose a novel and computationally efficient way to learn a local Mahalanobis metric per datum, and show how we can learn a local invariant metric to any transformation in order to improve performance.
Published: 2015

31. When Computer Vision Gazes at Cognition

Author: Gao, Tao, Harari, Daniel, Tenenbaum, Joshua, and Ullman, Shimon
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Joint attention is a core, early-developing form of social interaction. It is based on our ability to discriminate the third party objects that other people are looking at. While it has been shown that people can accurately determine whether another person is looking directly at them versus away, little is known about human ability to discriminate a third person gaze directed towards objects that are further away, especially in unconstraint cases where the looker can move her head and eyes freely. In this paper we address this question by jointly exploring human psychophysics and a cognitively motivated computer vision model, which can detect the 3D direction of gaze from 2D face images. The synthesis of behavioral study and computer vision yields several interesting discoveries. (1) Human accuracy of discriminating targets 8{\deg}-10{\deg} of visual angle apart is around 40% in a free looking gaze task; (2) The ability to interpret gaze of different lookers vary dramatically; (3) This variance can be captured by the computational model; (4) Human outperforms the current model significantly. These results collectively show that the acuity of human joint attention is indeed highly impressive, given the computational challenge of the natural looking task. Moreover, the gap between human and model performance, as well as the variability of gaze interpretation across different lookers, require further understanding of the underlying mechanisms utilized by humans for this challenging task., Comment: Tao Gao and Daniel Harari contributed equally to this work
Published: 2014

32. Minimal videos: Trade-off between spatial and temporal information in human and machine vision

Author: Ben-Yosef, Guy, Kreiman, Gabriel, and Ullman, Shimon
Published: 2020
Full Text: View/download PDF

33. Graph Approximation and Clustering on a Budget

Author: Fetaya, Ethan, Shamir, Ohad, and Ullman, Shimon
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: We consider the problem of learning from a similarity matrix (such as spectral clustering and lowd imensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed. We provide a theoretical analysis using standard notions of graph approximation, significantly generalizing previous results (which focused on spectral clustering with two clusters). We also propose a new algorithmic approach based on adaptive sampling, which experimentally matches or improves on previous methods, while being considerably more general and computationally cheaper.
Published: 2014

34. View-tuned and view-invariant face encoding in IT cortex is explained by selected natural image fragments

Author: Nam, Yunjun, Sato, Takayuki, Uchida, Go, Malakhova, Ekaterina, Ullman, Shimon, and Tanifuji, Manabu
Published: 2021
Full Text: View/download PDF

35. Author Correction: View-tuned and view-invariant face encoding in IT cortex is explained by selected natural image fragments

Author: Nam, Yunjun, Sato, Takayuki, Uchida, Go, Malakhova, Ekaterina, Ullman, Shimon, and Tanifuji, Manabu
Published: 2021
Full Text: View/download PDF

36. A model for full local image interpretation

Author: Ben-Yosef, Guy, Assif, Liav, Harari, Daniel, and Ullman, Shimon
Subjects: Image understanding, visual objectinterpretation, objects and parts recognition, top-downprocessing
Abstract: We describe a computational model of humans' ability toprovide a detailed interpretation of a scene‚Äôs components.Humans can identify in an image meaningful componentsalmost everywhere, and identifying these components is anessential part of the visual process, and of understanding thesurrounding scene and its potential meaning to the viewer.Detailed interpretation is beyond the scope of currentmodels of visual recognition. Our model suggests that this isa fundamental limitation, related to the fact that existingmodels rely on feed-forward but limited top-downprocessing. In our model, a first recognition stage leads tothe initial activation of class candidates, which isincomplete and with limited accuracy. This stage thentriggers the application of class-specific interpretation andvalidation processes, which recover richer and moreaccurate interpretation of the visible scene. We discussimplications of the model for visual interpretation byhumans and by computer vision models
Published: 2015

37. A model for discovering ‘containment’ relations

Author: Ullman, Shimon, Dorfman, Nimrod, and Harari, Daniel
Published: 2019
Full Text: View/download PDF

38. Human-like scene interpretation by a guided counterstream processing

Author: Ullman, Shimon, primary, Assif, Liav, additional, Strugatski, Alona, additional, Vatashsky, Ben-Zion, additional, Levi, Hila, additional, Netanyahu, Aviv, additional, and Yaari, Adam, additional
Published: 2023
Full Text: View/download PDF

39. Visual Concept Recognition and Localization via Iterative Introspection

Author: Rosenfeld, Amir, Ullman, Shimon, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Lai, Shang-Hong, editor, Lepetit, Vincent, editor, Nishino, Ko, editor, and Sato, Yoichi, editor
Published: 2017
Full Text: View/download PDF

40. Full interpretation of minimal images

Author: Ben-Yosef, Guy, Assif, Liav, and Ullman, Shimon
Published: 2018
Full Text: View/download PDF

41. Learning to perceive coherent objects

Author: Dorfman, Nimord, Harari, Daniel, and Ullman, Shimon
Published: 2013

42. Minimal Nativism: How does cognitive development get off the ground?

Author: Ullman, Tomer, Tenenbaum, Josh, Goodman, Noah, Ullman, Shimon, and Spelke, Elizabeth
Published: 2013

43. Workshop on Modeling the Perception of Intentions

Author: Tversky, Barbara, Ullman, Shimon, Baldwin, Dare, Pollick, Frank E., Tenenbaum, Joshua, Gao, Tao, Pantelis, Peter, and Pautler, David
Published: 2012

44. Teaching Structured Vision & Language Concepts to Vision & Language Models

Author: Doveh, Sivan, primary, Arbelle, Assaf, additional, Harary, Sivan, additional, Schwartz, Eli, additional, Herzig, Roei, additional, Giryes, Raja, additional, Feris, Rogerio, additional, Panda, Rameswar, additional, Ullman, Shimon, additional, and Karlinsky, Leonid, additional
Published: 2023
Full Text: View/download PDF

45. Recognition, Categorization, and the Emergence of Meaning

Author: Ullman, Shimon
Published: 2008

46. Attention Based Multi-Label Classification of Diabetic Retinopathy from Optical Coherence Tomography

Author: Segev, Dan, primary, Basri, Ronen, additional, Batash, Tomer, additional, Chowers, Itay, additional, Harari, Daniel, additional, Lender, Rivkah, additional, Levi, Jaime, additional, Shwartz, Yahel, additional, Tiosano, Liran, additional, Ullman, Shimon, additional, and Galun, Meirav, additional
Published: 2023
Full Text: View/download PDF

47. Top-Down Network Combines Back-Propagation with Attention

Author: Abel, Roy, Ullman, Shimon, Abel, Roy, and Ullman, Shimon
Abstract: Cortical processing, in vision and other domains, combines bottom-up (BU) with extensive top-down (TD) processing. Two primary goals attributed to TD processing are learning and directing attention. These two roles are accomplished in current network models through distinct mechanisms. Attention guidance is often implemented by extending the model's architecture, while learning is typically accomplished by an external learning algorithm such as back-propagation. In the current work, we present an integration of the two functions above, which appear unrelated, using a single unified mechanism inspired by the human brain. We propose a novel symmetric bottom-up top-down network structure that can integrate conventional bottom-up networks with a symmetric top-down counterpart, allowing each network to recurrently guide and influence the other. For example, during multi-task learning, the same top-down network is being used for both learning, via propagating feedback signals, and at the same time also for top-down attention, by guiding the bottom-up network to perform a selected task. In contrast with standard models, no external back-propagation is used for learning. Instead, we propose a 'Counter-Hebb' learning, which adjusts the weights of both the bottom-up and top-down networks simultaneously. We show that our method achieves competitive performance on standard multi-task learning benchmarks. Yet, unlike existing methods, we rely on single-task architectures and optimizers, without any task-specific parameters. The results, which show how attention-guided multi-tasks can be combined efficiently with internal learning in a unified TD process, suggest a possible model for combining BU and TD processing in human vision.
Published: 2023

48. Efficient Rehearsal Free Zero Forgetting Continual Learning using Adaptive Weight Modulation

Author: Sverdlov, Yonatan, Ullman, Shimon, Sverdlov, Yonatan, and Ullman, Shimon
Abstract: Artificial neural networks encounter a notable challenge known as continual learning, which involves acquiring knowledge of multiple tasks over an extended period. This challenge arises due to the tendency of previously learned weights to be adjusted to suit the objectives of new tasks, resulting in a phenomenon called catastrophic forgetting. Most approaches to this problem seek a balance between maximizing performance on the new tasks and minimizing the forgetting of previous tasks. In contrast, our approach attempts to maximize the performance of the new task, while ensuring zero forgetting. This is accomplished by creating a task-specific modulation parameters for each task. Only these would be learnable parameters during learning of consecutive tasks. Through comprehensive experimental evaluations, our model demonstrates superior performance in acquiring and retaining novel tasks that pose difficulties for other multi-task models. This emphasizes the efficacy of our approach in preventing catastrophic forgetting while accommodating the acquisition of new tasks
Published: 2023

49. Atoms of recognition in human and computer vision

Author: Ullman, Shimon, Assif, Liav, Fetaya, Ethan, and Harari, Daniel
Published: 2016

50. Visual Cortex Models for Object Recognition

Author: Poggio, Tomaso, Ullman, Shimon, and Ikeuchi, Katsushi, editor
Published: 2014
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

645 results on '"Ullman, Shimon"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources