24,742 results on '"Shahbaz, A."'
Search Results
2. Evaluating Optimal Safe Flows Decomposition for RNA Assembly
- Author
-
Ahmed, Bashar, Rana, Siddharth Singh, Ujjwal, and Khan, Shahbaz
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
In Bioinformatics, the applications of flow decomposition in directed acyclic graphs are highlighted in RNA Assembly problem. However, it admits multiple solutions where exactly one solution correctly represents the underlying transcripts. The problem was addressed by Safe and Complete framework~[RECOMB16], which reports all the parts of the solution that are present in every possible solution. Khan et al.~[RECOMB22] first studied flow decomposition in the safe and complete framework. Their algorithm showed superior performance ($\approx20\%$) over the popular heuristic (greedy-width) on sufficiently complex graphs for a unified metric of precision and coverage (F-score). They presented the solution in multiple representations using simple but suboptimal algorithms, which were later optimized by Khan and Tomescu~[ESA22], who also presented an optimal representation. In this paper, we evaluate the practical significance of the optimal algorithms by Khan and Tomescu~[ESA22]. Our work highlights the significance of the theoretically optimal algorithms improving time (up to $60-70\%$) and memory (up to $76-85\%$), and the optimal representations improving output size (up to $135-170\%$) significantly. However, the impact of optimal algorithms was limited due to a large number of extremely short safe paths. We propose heuristics to improve these representations further, resulting in further improvement in time (up to $10\%$) and output size ($10-25\%$). However, in absolute terms, these improvements were limited to a few seconds on real datasets involved due to the smaller size of the graphs. We thus generated large random graphs, to demonstrate the scalability of the above results. The older algorithms [RECOMB22] were not practical on moderately large graphs ($\geq 1M$ nodes), while optimal algorithms [ESA22] were linearly scalable for much larger graphs ($\geq 100M$ nodes).
- Published
- 2024
3. Efficient Localized Adaptation of Neural Weather Forecasting: A Case Study in the MENA Region
- Author
-
Munir, Muhammad Akhtar, Khan, Fahad Shahbaz, and Khan, Salman
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Physics - Atmospheric and Oceanic Physics - Abstract
Accurate weather and climate modeling is critical for both scientific advancement and safeguarding communities against environmental risks. Traditional approaches rely heavily on Numerical Weather Prediction (NWP) models, which simulate energy and matter flow across Earth's systems. However, heavy computational requirements and low efficiency restrict the suitability of NWP, leading to a pressing need for enhanced modeling techniques. Neural network-based models have emerged as promising alternatives, leveraging data-driven approaches to forecast atmospheric variables. In this work, we focus on limited-area modeling and train our model specifically for localized region-level downstream tasks. As a case study, we consider the MENA region due to its unique climatic challenges, where accurate localized weather forecasting is crucial for managing water resources, agriculture and mitigating the impacts of extreme weather events. This targeted approach allows us to tailor the model's capabilities to the unique conditions of the region of interest. Our study aims to validate the effectiveness of integrating parameter-efficient fine-tuning (PEFT) methodologies, specifically Low-Rank Adaptation (LoRA) and its variants, to enhance forecast accuracy, as well as training speed, computational resource utilization, and memory efficiency in weather and climate modeling for specific regions., Comment: Our codebase and pre-trained models can be accessed at: [this url](https://github.com/akhtarvision/weather-regional)
- Published
- 2024
4. iSeg: An Iterative Refinement-based Framework for Training-free Segmentation
- Author
-
Sun, Lin, Cao, Jiale, Xie, Jin, Khan, Fahad Shahbaz, and Pang, Yanwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. Inspired by this, researchers have explored employing stable diffusion for trainingfree segmentation. Most existing approaches either simply employ cross-attention map or refine it by self-attention map, to generate segmentation masks. We believe that iterative refinement with self-attention map would lead to better results. However, we mpirically demonstrate that such a refinement is sub-optimal likely due to the self-attention map containing irrelevant global information which hampers accurately refining cross-attention map with multiple iterations. To address this, we propose an iterative refinement framework for training-free segmentation, named iSeg, having an entropy-reduced self-attention module which utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined crossattention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kind of images and interactions., Comment: Project Page: https://linsun449.github.io/iSeg/ Code: https://github.com/linsun449/iseg.code
- Published
- 2024
5. CONDA: Condensed Deep Association Learning for Co-Salient Object Detection
- Author
-
Li, Long, Liu, Nian, Zhang, Dingwen, Li, Zhongyu, Khan, Salman, Anwer, Rao, Cholakkal, Hisham, Han, Junwei, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in complex scenarios, and their image feature optimization approach is not explicit for inter-image association modeling. To alleviate these limitations, this paper proposes a deep association learning strategy that deploys deep networks on raw associations to explicitly transform them into deep association features. Specifically, we first create hyperassociations to collect dense pixel-pair-wise raw associations and then deploys deep aggregation networks on them. We design a progressive association generation module for this purpose with additional enhancement of the hyperassociation calculation. More importantly, we propose a correspondence-induced association condensation module that introduces a pretext task, i.e. semantic correspondence estimation, to condense the hyperassociations for computational burden reduction and noise elimination. We also design an object-aware cycle consistency loss for high-quality correspondence estimations. Experimental results in three benchmark datasets demonstrate the remarkable effectiveness of our proposed method with various training settings., Comment: There is an error. In Sec 4.1, the number of images in some dataset is incorrect and needs to be revised
- Published
- 2024
6. Explanation Space: A New Perspective into Time Series Interpretability
- Author
-
Rezaei, Shahbaz and Liu, Xin
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Human understandable explanation of deep learning models is necessary for many critical and sensitive applications. Unlike image or tabular data where the importance of each input feature (for the classifier's decision) can be directly projected into the input, time series distinguishable features (e.g. dominant frequency) are often hard to manifest in time domain for a user to easily understand. Moreover, most explanation methods require a baseline value as an indication of the absence of any feature. However, the notion of lack of feature, which is often defined as black pixels for vision tasks or zero/mean values for tabular data, is not well-defined in time series. Despite the adoption of explainable AI methods (XAI) from tabular and vision domain into time series domain, these differences limit the application of these XAI methods in practice. In this paper, we propose a simple yet effective method that allows a model originally trained on time domain to be interpreted in other explanation spaces using existing methods. We suggest four explanation spaces that each can potentially alleviate these issues in certain types of time series. Our method can be readily adopted in existing platforms without any change to trained models or XAI methods. The code is available at https://github.com/shrezaei/TS-X-spaces.
- Published
- 2024
7. Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification
- Author
-
Kan, Ziwen, Rezaei, Shahbaz, and liu, Xin
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning - Abstract
The popularity of deep learning methods in the time series domain boosts interest in interpretability studies, including counterfactual (CF) methods. CF methods identify minimal changes in instances to alter the model predictions. Despite extensive research, no existing work benchmarks CF methods in the time series domain. Additionally, the results reported in the literature are inconclusive due to the limited number of datasets and inadequate metrics. In this work, we redesign quantitative metrics to accurately capture desirable characteristics in CFs. We specifically redesign the metrics for sparsity and plausibility and introduce a new metric for consistency. Combined with validity, generation time, and proximity, we form a comprehensive metric set. We systematically benchmark 6 different CF methods on 20 univariate datasets and 10 multivariate datasets with 3 different classifiers. Results indicate that the performance of CF methods varies across metrics and among different models. Finally, we provide case studies and a guideline for practical usage., Comment: 15 pages, 27 figures
- Published
- 2024
8. A GTC spectroscopic study of three spider pulsar companions: line-based temperatures, a new face-on redback, and improved mass constraints
- Author
-
Simpson, Jordan A., Linares, Manuel, Casares, Jorge, Shahbaz, Tariq, Sen, Bidisha, and Camilo, Fernando
- Subjects
Astrophysics - High Energy Astrophysical Phenomena ,Astrophysics - Solar and Stellar Astrophysics - Abstract
We present GTC-OSIRIS phase-resolved optical spectroscopy of three compact binary MSPs, or `spiders': PSR J1048+2339, PSR J1810+1744, and (for the first time) PSR J1908+2105. For the companion in each system, the temperature is traced throughout its orbit, and radial velocities are measured. The radial velocities are found to vary with the absorption features used when measuring them, resulting in a lower radial velocity curve semi-amplitude measured from the day side of two of the systems when compared to the night: for J1048 ($K_\mathrm{day} = 344 \pm 4$ km s$^{-1}$, $K_\mathrm{night} = 372 \pm 3$ km s$^{-1}$) and, tentatively, for J1810 ($K_\mathrm{day} = 448 \pm 19$ km s$^{-1}$, $K_\mathrm{night} = 491 \pm 32$ km s$^{-1}$). With existing inclination constraints, this gives the neutron star (NS) and companion masses $M_\mathrm{NS} = 1.50 - 2.04$ $M_\odot$ and $M_2 = 0.32 - 0.40$ $M_\odot$ for J1048, and $M_\mathrm{NS} > 1.7$ $M_\odot$ and $M_2 = 0.05 - 0.08$ $M_\odot$ for J1810. For J1908, we find an upper limit of $K_2 < 32$ km s$^{-1}$, which constrains its mass ratio $q = M_2 / M_\mathrm{NS} > 0.55$ and inclination $i < 6.0^\circ$, revealing the previously misunderstood system to be the highest mass ratio, lowest inclination redback yet. This raises questions for the origins of its substantial radio eclipses. Additionally, we find evidence of asymmetric heating in J1048 and J1810, and signs of metal enrichment in J1908. We also explore the impact of inclination on spectroscopic temperatures, and demonstrate that the temperature measured at quadrature ($\phi = 0.25, 0.75$) is essentially independent of inclination, and thus can provide additional constraints on photometric modelling., Comment: Submitted to MNRAS. 18 pages, 19 figures
- Published
- 2024
9. A Single Channel-Based Neonatal Sleep-Wake Classification using Hjorth Parameters and Improved Gradient Boosting
- Author
-
Arslan, Muhammad, Mubeen, Muhammad, Abbasi, Saadullah Farooq, Khan, Muhammad Shahbaz, Boulila, Wadii, and Ahmad, Jawad
- Subjects
Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Signal Processing - Abstract
Sleep plays a crucial role in neonatal development. Monitoring the sleep patterns in neonates in a Neonatal Intensive Care Unit (NICU) is imperative for understanding the maturation process. While polysomnography (PSG) is considered the best practice for sleep classification, its expense and reliance on human annotation pose challenges. Existing research often relies on multichannel EEG signals; however, concerns arise regarding the vulnerability of neonates and the potential impact on their sleep quality. This paper introduces a novel approach to neonatal sleep stage classification using a single-channel gradient boosting algorithm with Hjorth features. The gradient boosting parameters are fine-tuned using random search cross-validation (randomsearchCV), achieving an accuracy of 82.35% for neonatal sleep-wake classification. Validation is conducted through 5-fold cross-validation. The proposed algorithm not only enhances existing neonatal sleep algorithms but also opens avenues for broader applications., Comment: 8 pages, 5 figures, 3 tables, International Polydisciplinary Conference on Artificial Intelligence and New Technologies
- Published
- 2024
10. BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning
- Author
-
Hanif, Asif, Shamshad, Fahad, Awais, Muhammad, Naseer, Muzammal, Khan, Fahad Shahbaz, Nandakumar, Karthik, Khan, Salman, and Anwer, Rao Muhammad
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Medical foundation models are gaining prominence in the medical community for their ability to derive general representations from extensive collections of medical image-text pairs. Recent research indicates that these models are susceptible to backdoor attacks, which allow them to classify clean images accurately but fail when specific triggers are introduced. However, traditional backdoor attacks necessitate a considerable amount of additional data to maliciously pre-train a model. This requirement is often impractical in medical imaging applications due to the usual scarcity of data. Inspired by the latest developments in learnable prompts, this work introduces a method to embed a backdoor into the medical foundation model during the prompt learning phase. By incorporating learnable prompts within the text encoder and introducing imperceptible learnable noise trigger to the input images, we exploit the full capabilities of the medical foundation models (Med-FM). Our method, BAPLe, requires only a minimal subset of data to adjust the noise trigger and the text prompts for downstream tasks, enabling the creation of an effective backdoor attack. Through extensive experiments with four medical foundation models, each pre-trained on different modalities and evaluated across six downstream datasets, we demonstrate the efficacy of our approach. BAPLe achieves a high backdoor success rate across all models and datasets, outperforming the baseline backdoor attack methods. Our work highlights the vulnerability of Med-FMs towards backdoor attacks and strives to promote the safe adoption of Med-FMs before their deployment in real-world applications. Code is available at https://asif-hanif.github.io/baple/., Comment: MICCAI 2024
- Published
- 2024
11. Connecting Dreams with Visual Brainstorming Instruction
- Author
-
Sun, Yasheng, Li, Bohan, Zhuge, Mingchen, Fan, Deng-Ping, Khan, Salman, Khan, Fahad Shahbaz, and Koike, Hideki
- Subjects
Computer Science - Human-Computer Interaction - Abstract
Recent breakthroughs in understanding the human brain have revealed its impressive ability to efficiently process and interpret human thoughts, opening up possibilities for intervening in brain signals. In this paper, we aim to develop a straightforward framework that uses other modalities, such as natural language, to translate the original dreamland. We present DreamConnect, employing a dual-stream diffusion framework to manipulate visually stimulated brain signals. By integrating an asynchronous diffusion strategy, our framework establishes an effective interface with human dreams, progressively refining their final imagery synthesis. Through extensive experiments, we demonstrate the method ability to accurately instruct human brain signals with high fidelity. Our project will be publicly available on https://github.com/Sys-Nexus/DreamConnect
- Published
- 2024
12. The mass of the white dwarf in YY Dra (=DO Dra): Dynamical measurement and comparative study with X-ray estimates
- Author
-
Álvarez-Hernández, Ayoze, Torres, Manuel A. P., Shahbaz, Tariq, Rodríguez-Gil, Pablo, Gazeas, Kosmas D., Sánchez-Sierras, Javier, Jonker, Peter G., Corral-Santana, Jesús M., Acosta-Pulido, Jose A., and Hakala, Pasi
- Subjects
Astrophysics - Solar and Stellar Astrophysics ,Astrophysics - High Energy Astrophysical Phenomena - Abstract
We present a dynamical study of the intermediate polar cataclysmic variable YY Dra based on time-series observations in the $K$ band, where the donor star is known to be the major flux contributor. We covered the $3.97$-h orbital cycle with 44 spectra taken between $2020$ and $2022$ and two epochs of photometry observed in 2021 March and May. One of the light curves was simultaneously obtained with spectroscopy to better account for the effects of irradiation of the donor star and the presence of accretion light. From the spectroscopy, we derived the radial velocity curve of the donor star metallic absorption lines, constrained its spectral type to M0.5$-$M3.5 with no measurable changes in the effective temperature between the irradiated and non-irradiated hemispheres of the star, and measured its projected rotational velocity $v_\mathrm{rot} \sin i = 103 \pm 2 \, \mathrm{km}\,\mathrm{s}^{-1}$. Through simultaneous modelling of the radial velocity and light curves, we derived values for the radial velocity semi-amplitude of the donor star, $K_2 = 188^{+1}_{-2} \, \mathrm{km} \, \mathrm{s}^{-1}$, the donor to white dwarf mass ratio, $q=M_2/M_1 = 0.62 \pm 0.02$, and the orbital inclination, $i={42^{\circ}}^{+2^{\circ}}_{-1^{\circ}}$. These binary parameters yield dynamical masses of $M_{1} = 0.99^{+0.10}_{-0.09} \, \mathrm{M}_{\odot}$ and $M_2 = 0.62^{+0.07}_{-0.06} \, \mathrm{M}_{\odot}$ ($68$ per cent confidence level). As found for the intermediate polars GK Per and XY Ari, the white dwarf dynamical mass in YY Dra significantly differs from several estimates obtained by modelling the X-ray spectral continuum., Comment: 12 pages, 7 figures, 6 tables, accepted for publication in A&A
- Published
- 2024
13. Learning Camouflaged Object Detection from Noisy Pseudo Label
- Author
-
Zhang, Jin, Zhang, Ruiheng, Shi, Yanjiao, Cao, Zhe, Liu, Nian, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Existing Camouflaged Object Detection (COD) methods rely heavily on large-scale pixel-annotated training sets, which are both time-consuming and labor-intensive. Although weakly supervised methods offer higher annotation efficiency, their performance is far behind due to the unclear visual demarcations between foreground and background in camouflaged images. In this paper, we explore the potential of using boxes as prompts in camouflaged scenes and introduce the first weakly semi-supervised COD method, aiming for budget-efficient and high-precision camouflaged object segmentation with an extremely limited number of fully labeled images. Critically, learning from such limited set inevitably generates pseudo labels with serious noisy pixels. To address this, we propose a noise correction loss that facilitates the model's learning of correct pixels in the early learning stage, and corrects the error risk gradients dominated by noisy pixels in the memorization stage, ultimately achieving accurate segmentation of camouflaged objects from noisy labels. When using only 20% of fully labeled data, our method shows superior performance over the state-of-the-art methods., Comment: Accepted by ECCV2024
- Published
- 2024
14. GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model
- Author
-
Shaker, Abdelrahman, Wasim, Syed Talal, Khan, Salman, Gall, Juergen, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent advancements in state-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. To address this, we introduce a Modulated Group Mamba layer which divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication. Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Our code and models are available at: https://github.com/Amshaker/GroupMamba., Comment: Preprint. Our code and models are available at: https://github.com/Amshaker/GroupMamba
- Published
- 2024
15. Black hole X-ray binary A0620$\unicode{x2013}$00 in quiescence: hints of Faraday rotation of near-infrared and optical polarization?
- Author
-
Kravtsov, Vadim, Veledina, Alexandra, Berdyugin, Andrei V., Tsygankov, Sergey, Shahbaz, Tariq, Torres, Manuel A. P., Jermak, Helen, McCall, Callum, Kajava, Jari J. E., Piirola, Vilppu, Sakanoi, Takeshi, Kagitani, Masato, Berdyugina, Svetlana V., and Poutanen, Juri
- Subjects
Astrophysics - High Energy Astrophysical Phenomena ,Astrophysics - Solar and Stellar Astrophysics - Abstract
We present simultaneous high-precision optical polarimetric and near-infrared (NIR) to ultraviolet (UV) photometric observations of low-mass black hole X-ray binary A0620$\unicode{x2013}$00 in the quiescent state. Subtracting interstellar polarization, estimated from a sample of field stars, we derive the intrinsic polarization of A0620$\unicode{x2013}$00. We show that the intrinsic polarization degree (PD) is variable with the orbital period with the amplitude of $\sim0.3\%$ at least in the $R$ band, where the signal-to-noise ratio of our observations is the best. It implies that some fraction of the optical polarization is produced by scattering of stellar radiation off the matter that follows the black hole in its orbital motion. In addition, we see a rotation of the orbit-average intrinsic polarization angle (PA) with the wavelength from $164\deg$ in the $R$ to $180\deg$ in the $B$ band. All of the above, combined with the historical NIR to optical polarimetric observations, shows the complex behavior of average intrinsic polarization of A0620$\unicode{x2013}$00 with the PA making continuous rotation from infrared to blue band by $\sim56\deg$ in total, while the PD $\sim1\%$ remains nearly constant over the entire spectral range. The spectral dependence of the PA can be described by Faraday rotation with the rotation measure of RM=$-0.2$ rad $\mu$m$^{-2}$, implying a few Gauss magnetic field in the plasma surrounding the black hole accretion disk. However, our preferred interpretation for the peculiar wavelength dependence is the interplay between two polarized components with different PAs. Polarimetric measurements in the UV range can help distinguishing between these scenarios., Comment: 8 pages, 10 figures, submitted
- Published
- 2024
16. Biogenic synthesis of Ag-doped TiO2 photocatalyst using citrus paradisi extract for solar trigged degradation of methylene blue
- Author
-
Memon, Muddassir Ali, Akhtar, M Wasim, Shahbaz, Raja, Gabol, Nasir M, Khuhawar, Muhammad Yar, and Khan, M Yasir
- Published
- 2024
17. Fast X-ray/IR observations of the black hole transient Swift~J1753.5--0127: from an IR lead to a very long jet lag
- Author
-
Ulgiati, Alberto, Vincentelli, Federico Maria, Casella, Piergiorgio, Veledina, Alexandra, Maccarone, Thomas, Russell, David, Uttley, Phil, Ambrosino, Filippo, Baglio, Maria Cristina, Imbrogno, Matteo, Melandri, Andrea, Motta, Sara Elisa, O'Brien, Kiran, Sanna, Andrea, Shahbaz, Tariq, Altamirano, Diego, Fender, Rob, Maitra, Dipankar, and Malzac, Julien
- Subjects
Astrophysics - High Energy Astrophysical Phenomena - Abstract
We report on two epochs of simultaneous near-infrared (IR) and X-ray observations with a sub-second time resolution of the low mass X-ray binary black hole candidate Swift J1753.5--0127 during its long 2005--2016 outburst. Data were collected strictly simultaneously with VLT/ISAAC (K$_{S}$ band, 2.2 $\mu m$) and RXTE (2-15 keV) or \textit{XMM-Newton} (0.7-10 keV). A clear correlation between the X-ray and the IR variable emission is found during both epochs but with very different properties. In the first epoch, the near-IR variability leads the X-ray by $ \sim 130 \, ms$. This is the opposite of what is usually observed in similar systems. The correlation is more complex in the second epoch, with both anti-correlation and correlations at negative and positive lags. Frequency-resolved Fourier analysis allows us to identify two main components in the complex structure of the phase lags: the first component, characterised by a few seconds near-IR lag at low frequencies, is consistent with a combination of disc reprocessing and a magnetised hot flow; the second component is identified at high frequencies by a near-IR lag of $\approx$0.7 s. Given the similarities of this second component with the well-known constant optical/near-IR jet lag observed in other black hole transients, we tentatively interpret this feature as a signature of a longer-than-usual jet lag. We discuss the possible implications of measuring such a long jet lag in a radio-quiet black hole transient., Comment: 10 pages, 7 figures, accepted for publication in A&A
- Published
- 2024
18. Rapid Mid-Infrared Spectral-Timing with JWST. I. The prototypical black hole X-ray Binary GRS 1915+105 during a MIR-bright and X-ray-obscured state
- Author
-
Gandhi, P., Borowski, E. S., Byrom, J., Hynes, R. I., Maccarone, T. J., Shaw, A. W., Adegoke, O. K., Altamirano, D., Baglio, M. C., Bhargava, Y., Britt, C. T., Buckley, D. A. H., Buisson, D. J. K., Casella, P., Segura, N. Castro, Charles, P. A., Corral-Santana, J. M., Dhillon, V. S., Fender, R., Gúrpide, A., Heinke, C. O., Igl, A. B., Knigge, C., Markoff, S., Mastroserio, G., McCollough, M. L., Middleton, M., Miller, J. M., Miller-Jones, J. C. A., Motta, S. E., Paice, J. A., Pawar, D. D., Plotkin, R. M., Pradhan, P., Ressler, M. E., Russell, D. M., Russell, T. D., Santos-Sanz, P., Shahbaz, T., Sivakoff, G. R., Steeghs, D., Tetarenko, A. J., Tomsick, J. A., Vincentelli, F. M., George, M., Gurwell, M., and Rao, R.
- Subjects
Astrophysics - High Energy Astrophysical Phenomena ,Astrophysics - Solar and Stellar Astrophysics - Abstract
We present mid-infrared (MIR) spectral-timing measurements of the prototypical Galactic microquasar GRS 1915+105. The source was observed with the Mid-Infrared Instrument (MIRI) onboard JWST in June 2023 at a MIR luminosity L(MIR)~10^{36} erg/s exceeding past IR levels by about a factor of 10. By contrast, the X-ray flux is much fainter than the historical average, in the source's now-persistent 'obscured' state. The MIRI low-resolution spectrum shows a plethora of emission lines, the strongest of which are consistent with recombination in the hydrogen Pfund (Pf) series and higher. Low amplitude (~1%) but highly significant peak-to-peak photometric variability is found on timescales of ~1,000 s. The brightest Pf(6-5) emission line lags the continuum. Though difficult to constrain accurately, this lag is commensurate with light-travel timescales across the outer accretion disc or with expected recombination timescales inferred from emission line diagnostics. Using the emission line as a bolometric indicator suggests a moderate (~5-30% Eddington) intrinsic accretion rate. Multiwavelength monitoring shows that JWST caught the source close in-time to unprecedentedly bright MIR and radio long-term flaring. Assuming a thermal bremsstrahlung origin for the MIRI continuum suggests an unsustainably high mass-loss rate during this time unless the wind remains bound, though other possible origins cannot be ruled out. PAH features previously detected with Spitzer are now less clear in the MIRI data, arguing for possible destruction of dust in the interim. These results provide a preview of new parameter space for exploring MIR spectral-timing in XRBs and other variable cosmic sources on rapid timescales., Comment: Dedicated to the memory of our colleague, Tomaso Belloni. Submitted 2024 June 21; Comments welcome
- Published
- 2024
19. Open-Vocabulary Temporal Action Localization using Multimodal Guidance
- Author
-
Gupta, Akshita, Arora, Aditya, Narayan, Sanath, Khan, Salman, Khan, Fahad Shahbaz, and Taylor, Graham W.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism to learn the alignment between class representations and frame-level video features, facilitating the multimodal guided features. Third, we propose a two-stage training strategy which includes training with a larger vocabulary dataset and finetuning to downstream data to generalize to novel categories. OVFormer extends existing TAL methods to open-vocabulary settings. Comprehensive evaluations on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our method. Code and pretrained models will be publicly released.
- Published
- 2024
20. VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
- Author
-
Bharadwaj, Rohit, Gani, Hanan, Naseer, Muzammal, Khan, Fahad Shahbaz, and Khan, Salman
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The recent developments in Large Multi-modal Video Models (Video-LMMs) have significantly enhanced our ability to interpret and analyze video data. Despite their impressive capabilities, current Video-LMMs have not been evaluated for anomaly detection tasks, which is critical to their deployment in practical scenarios e.g., towards identifying deepfakes, manipulated video content, traffic accidents and crimes. In this paper, we introduce VANE-Bench, a benchmark designed to assess the proficiency of Video-LMMs in detecting and localizing anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models, encompassing a variety of subtle anomalies and inconsistencies grouped into five categories: unnatural transformations, unnatural appearance, pass-through, disappearance and sudden appearance. Additionally, our benchmark features real-world samples from existing anomaly detection datasets, focusing on crime-related irregularities, atypical pedestrian behavior, and unusual events. The task is structured as a visual question-answering challenge to gauge the models' ability to accurately detect and localize the anomalies within the videos. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies. In conclusion, our research offers significant insights into the current capabilities of Video-LMMs in the realm of anomaly detection, highlighting the importance of our work in evaluating and improving these models for real-world applications. Our code and data is available at https://hananshafi.github.io/vane-benchmark/, Comment: Data: https://huggingface.co/datasets/rohit901/VANE-Bench
- Published
- 2024
21. Towards Evaluating the Robustness of Visual State Space Models
- Author
-
Malik, Hashmat Shadab, Shamshad, Fahad, Naseer, Muzammal, Nandakumar, Karthik, Khan, Fahad Shahbaz, and Khan, Salman
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.
- Published
- 2024
22. On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models
- Author
-
Malik, Hashmat Shadab, Saeed, Numan, Hanif, Asif, Naseer, Muzammal, Yaqub, Mohammad, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Electrical Engineering and Systems Science - Image and Video Processing ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Volumetric medical segmentation models have achieved significant success on organ and tumor-based segmentation tasks in recent years. However, their vulnerability to adversarial attacks remains largely unexplored, raising serious concerns regarding the real-world deployment of tools employing such models in the healthcare sector. This underscores the importance of investigating the robustness of existing models. In this context, our work aims to empirically examine the adversarial robustness across current volumetric segmentation architectures, encompassing Convolutional, Transformer, and Mamba-based models. We extend this investigation across four volumetric segmentation datasets, evaluating robustness under both white box and black box adversarial attacks. Overall, we observe that while both pixel and frequency-based attacks perform reasonably well under \emph{white box} setting, the latter performs significantly better under transfer-based black box attacks. Across our experiments, we observe transformer-based models show higher robustness than convolution-based models with Mamba-based models being the most vulnerable. Additionally, we show that large-scale training of volumetric segmentation models improves the model's robustness against adversarial attacks. The code and robust models are available at https://github.com/HashmatShadab/Robustness-of-Volumetric-Medical-Segmentation-Models., Comment: Accepted at British Machine Vision Conference 2024
- Published
- 2024
23. Optimal k-centers of a graph: a control-theoretic approach
- Author
-
Shahbaz, Karim, Belur, Madhu N., Bhawal, Chayan, and Pal, Debasattam
- Subjects
Mathematics - Combinatorics ,Electrical Engineering and Systems Science - Systems and Control - Abstract
In a network consisting of n nodes, our goal is to identify the most central k nodes with respect to the proposed definitions of centrality. Depending on the specific application, there exist several metrics for quantifying k-centrality, and the subset of the best k nodes naturally varies based on the chosen metric. In this paper, we propose two metrics and establish connections to a well-studied metric from the literature (specifically for stochastic matrices). We prove these three notions match for path graphs. We then list a few more control-theoretic notions and compare these various notions for a general randomly generated graph. Our first metric involves maximizing the shift in the smallest eigenvalue of the Laplacian matrix. This shift can be interpreted as an improvement in the time constant when the RC circuit experiences leakage at certain k capacitors. The second metric focuses on minimizing the Perron root of a principal sub-matrix of a stochastic matrix, an idea proposed and interpreted in the literature as manufacturing consent. The third one explores minimizing the Perron root of a perturbed (now super-stochastic) matrix, which can be seen as minimizing the impact of added stubbornness. It is important to emphasize that we consider applications (for example, facility location) when the notions of central ports are such that the set of the best k ports does not necessarily contain the set of the best k-1 ports. We apply our k-port selection metric to various network structures. Notably, we prove the equivalence of three definitions for a path graph and extend the concept of central port linkage beyond Fiedler vectors to other eigenvectors associated with path graphs.
- Published
- 2024
24. Multi-Granularity Language-Guided Multi-Object Tracking
- Author
-
Li, Yuhao, Naseer, Muzammal, Cao, Jiale, Zhu, Yu, Sun, Jinqiu, Zhang, Yanning, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at ~\url{https://github.com/WesLee88524/LG-MOT}.
- Published
- 2024
25. Engineering Semi-streaming DFS algorithms
- Author
-
Bhagavan, Kancharla Nikhilesh, Vardhan, Macharla Sri, Chowdary, Madamanchi Ashok, and Khan, Shahbaz
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
Depth first search is a fundamental graph problem having a wide range of applications. For a graph $G=(V,E)$ having $n$ vertices and $m$ edges, the DFS tree can be computed in $O(m+n)$ using $O(m)$ space where $m=O(n^2)$. In the streaming environment, most graph problems are studied in the semi-streaming model where several passes (preferably one) are allowed over the input, allowing $O(nk)$ local space for some $k=o(n)$. Trivially, using $O(m)$ space, DFS can be computed in one pass, and using $O(n)$ space, it can be computed in $O(n)$ passes. Khan and Mehta [STACS19] presented several algorithms allowing trade-offs between space and passes, where $O(nk)$ space results in $O(n/k)$ passes. They also empirically analyzed their algorithm to require only a few passes in practice for even $O(n)$ space. Chang et al. [STACS20] presented an alternate proof for the same and also presented $O(\sqrt{n})$ pass algorithm requiring $O(n~poly\log n)$ space with a finer trade-off between space and passes. However, their algorithm uses complex black box algorithms, making it impractical. We perform an experimental analysis of the practical semi-streaming DFS algorithms. Our analysis ranges from real graphs to random graphs (uniform and power-law). We also present several heuristics to improve the state-of-the-art algorithms and study their impact. Our heuristics improve state of the art by $40-90\%$, achieving optimal one pass in almost $40-50\%$ cases (improved from zero). In random graphs, they improve from $30-90\%$, again requiring optimal one pass for even very small values of $k$. Overall, our heuristics improved the relatively complex state-of-the-art algorithm significantly, requiring merely two passes in the worst case for random graphs. Additionally, our heuristics made the relatively simpler algorithm practically usable even for very small space bounds, which was impractical earlier.
- Published
- 2024
26. Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
- Author
-
Boudjoghra, Mohamed El Amine, Dai, Angela, Lahoud, Jean, Cholakkal, Hisham, Anwer, Rao Muhammad, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.
- Published
- 2024
27. Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging
- Author
-
Dong, Jiahua, Yin, Hui, Li, Hongliu, Li, Wenbo, Zhang, Yulun, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Electrical Engineering and Systems Science - Image and Video Processing ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Deep unfolding methods have made impressive progress in restoring 3D hyperspectral images (HSIs) from 2D measurements through convolution neural networks or Transformers in spectral compressive imaging. However, they cannot efficiently capture long-range dependencies using global receptive fields, which significantly limits their performance in HSI reconstruction. Moreover, these methods may suffer from local context neglect if we directly utilize Mamba to unfold a 2D feature map as a 1D sequence for modeling global long-range dependencies. To address these challenges, we propose a novel Dual Hyperspectral Mamba (DHM) to explore both global long-range dependencies and local contexts for efficient HSI reconstruction. After learning informative parameters to estimate degradation patterns of the CASSI system, we use them to scale the linear projection and offer noise level for the denoiser (i.e., our proposed DHM). Specifically, our DHM consists of multiple dual hyperspectral S4 blocks (DHSBs) to restore original HSIs. Particularly, each DHSB contains a global hyperspectral S4 block (GHSB) to model long-range dependencies across the entire high-resolution HSIs using global receptive fields, and a local hyperspectral S4 block (LHSB) to address local context neglect by establishing structured state-space sequence (S4) models within local windows. Experiments verify the benefits of our DHM for HSI reconstruction. The source codes and models will be available at https://github.com/JiahuaDong/DHM., Comment: 13 pages, 6 figures
- Published
- 2024
28. Dynamic FMR and magneto-optical response of hydrogenated FCC phase Fe25Pd75 thin films and micro patterned devices
- Author
-
Khan, Shahbaz, Sarkar, Satyajit, Lawler, Nicolas B., Akbar, Ali, Anwar, Muhammad Sabieh, Martyniuk, Mariusz, Iyer, K. Swaminathan, and Kostylev, Mikhail
- Subjects
Condensed Matter - Materials Science ,Physics - Applied Physics - Abstract
In this work, we investigate the effects of H2 on the physical properties of Fe25Pd75. Broadband ferromagnetic resonance (FMR) spectroscopy revealed a significant FMR peak shift induced by H2 absorption for the FCC phased Fe25Pd75. The peak shifted towards higher applied fields, which is contrary to what was previously observed for CoPd alloys. Additionally, we conducted structural and magneto-optical Kerr ellipsometric studies on the Fe25Pd75 film and performed density functional theory calculations to explore the electronic and magnetic properties in both hydrogenated and dehydrogenated states. In the final part of this study, we deposited a Fe25Pd75 layer on top of a microscopic coplanar transmission line and investigated the FMR response of the layer while driven by a microwave current in the coplanar line. We observed a large amplitude FMR response upon hydrogen absorption, as well as desorption rates when cycling between pure N2 and a mixture of 3% H2 + 97% N2.
- Published
- 2024
29. How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
- Author
-
Khattak, Muhammad Uzair, Naeem, Muhammad Ferjad, Hassan, Jameel, Naseer, Muzammal, Tombari, Federico, Khan, Fahad Shahbaz, and Khan, Salman
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/., Comment: Technical report
- Published
- 2024
30. Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning
- Author
-
Hou, Wenjin, Chen, Shiming, Chen, Shuhuang, Hong, Ziming, Wang, Yan, Feng, Xuetao, Khan, Salman, Khan, Fahad Shahbaz, and You, Xinge
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes). To address this issue, we propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping by fully exploiting the visual-augmented knowledge into semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features (referred to as domain visual knowledge), which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples. Ultimately, we concatenate their output as a dynamic semantic prototype, which serves as the condition of the generator. Extensive experiments demonstrate that our VADS achieves superior CZSL and GZSL performances on three prominent datasets and outperforms other state-of-the-art methods with averaging increases by 6.4\%, 5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.
- Published
- 2024
31. Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels
- Author
-
Dharmasiri, Amaya, Naseer, Muzammal, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations., Comment: To be published in Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024
- Published
- 2024
32. Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- Author
-
Chen, Shiming, Hou, Wenjin, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT ., Comment: Accepted to CVPR'24
- Published
- 2024
33. Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration
- Author
-
Dudhane, Akshay, Thawakar, Omkar, Zamir, Syed Waqas, Khan, Salman, Khan, Fahad Shahbaz, and Yang, Ming-Hsuan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. The requirement to tackle multiple degradations using the same model can lead to high-complexity designs with fixed configuration that lack the adaptability to more efficient alternatives. We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks. Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment with a single round of training. This seamless switching is enabled by our weights-sharing mechanism, forming the core of our architecture and facilitating the reuse of initialized module weights. Further, to establish robust weights initialization, we introduce a dynamic pre-training strategy that trains variants of the proposed DyNet concurrently, thereby achieving a 50% reduction in GPU hours. To tackle the unavailability of large-scale dataset required in pre-training, we curate a high-quality, high-resolution image dataset named Million-IRD having 2M image samples. We validate our DyNet for image denoising, deraining, and dehazing in all-in-one setting, achieving state-of-the-art results with 31.34% reduction in GFlops and a 56.75% reduction in parameters compared to baseline models. The source codes and trained models are available at https://github.com/akshaydudhane16/DyNet.
- Published
- 2024
34. Language Guided Domain Generalized Medical Image Segmentation
- Author
-
Kunhimon, Shahina, Naseer, Muzammal, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Single source domain generalization (SDG) holds promise for more reliable and consistent image segmentation across real-world clinical settings particularly in the medical domain, where data privacy and acquisition cost constraints often limit the availability of diverse datasets. Depending solely on visual features hampers the model's capacity to adapt effectively to various domains, primarily because of the presence of spurious correlations and domain-specific characteristics embedded within the image features. Incorporating text features alongside visual features is a potential solution to enhance the model's understanding of the data, as it goes beyond pixel-level information to provide valuable context. Textual cues describing the anatomical structures, their appearances, and variations across various imaging modalities can guide the model in domain adaptation, ultimately contributing to more robust and consistent segmentation. In this paper, we propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features to learn a more robust feature representation. We assess the effectiveness of our text-guided contrastive feature alignment technique in various scenarios, including cross-modality, cross-sequence, and cross-site settings for different segmentation tasks. Our approach achieves favorable performance against existing methods in literature. Our code and model weights are available at https://github.com/ShahinaKK/LG_SDG.git., Comment: Accepted at ISBI2024
- Published
- 2024
35. Particulate and gaseous air pollutants exceed WHO guideline values and have the potential to damage human health in Faisalabad, Metropolitan, Pakistan.
- Author
-
Zeeshan, Nukshab, Murtaza, Ghulam, Ahmad, Hamaad, Awan, Abdul, Shahbaz, Muhammad, and Freer-Smith, Peter
- Subjects
Air quality index ,CO ,Heavy metals ,Human health ,NO2 ,Particulates ,SO2 ,Pakistan ,Humans ,Air Pollutants ,Environmental Monitoring ,Particulate Matter ,Air Pollution ,Seasons ,World Health Organization ,Sulfur Dioxide ,Cities ,Nitrogen Dioxide ,Environmental Exposure ,Carbon Monoxide - Abstract
First-ever measurements of particulate matter (PM2.5, PM10, and TSP) along with gaseous pollutants (CO, NO2, and SO2) were performed from June 2019 to April 2020 in Faisalabad, Metropolitan, Pakistan, to assess their seasonal variations; Summer 2019, Autumn 2019, Winter 2019-2020, and Spring 2020. Pollutant measurements were carried out at 30 locations with a 3-km grid distance from the Sitara Chemical Industry in District Faisalabad to Bhianwala, Sargodha Road, Tehsil Lalian, District Chiniot. ArcGIS 10.8 was used to interpolate pollutant concentrations using the inverse distance weightage method. PM2.5, PM10, and TSP concentrations were highest in summer, and lowest in autumn or winter. CO, NO2, and SO2 concentrations were highest in summer or spring and lowest in winter. Seasonal average NO2 and SO2 concentrations exceeded WHO annual air quality guide values. For all 4 seasons, some sites had better air quality than others. Even in these cleaner sites air quality index (AQI) was unhealthy for sensitive groups and the less good sites showed Very critical AQI (> 500). Dust-bound carbon and sulfur contents were higher in spring (64 mg g-1) and summer (1.17 mg g-1) and lower in autumn (55 mg g-1) and winter (1.08 mg g-1). Venous blood analysis of 20 individuals showed cadmium and lead concentrations higher than WHO permissible limits. Those individuals exposed to direct roadside pollution for longer periods because of their occupation tended to show higher Pb and Cd blood concentrations. It is concluded that air quality along the roadside is extremely poor and potentially damaging to the health of exposed workers.
- Published
- 2024
36. Efficient Video Object Segmentation via Modulated Cross-Attention Memory
- Author
-
Shaker, Abdelrahman, Wasim, Syed Talal, Danelljan, Martin, Khan, Salman, Yang, Ming-Hsuan, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS., Comment: WACV 2025
- Published
- 2024
37. ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
- Author
-
Noman, Mubashir, Fiaz, Mustansar, Cholakkal, Hisham, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Project url https://github.com/techmn/elgcnet., Comment: accepted at IEEE TGRS
- Published
- 2024
38. Composed Video Retrieval via Enriched Context and Discriminative Embeddings
- Author
-
Thawakar, Omkar, Naseer, Muzammal, Anwer, Rao Muhammad, Khan, Salman, Felsberg, Michael, Shah, Mubarak, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{https://github.com/OmkarThawakar/composed-video-retrieval}, Comment: CVPR-2024
- Published
- 2024
39. VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
- Author
-
Mahmood, Ahmad, Vayani, Ashmal, Naseer, Muzammal, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. By presenting LLMs with pairs of instructions and their corresponding high-level programs, we harness their contextual learning capabilities to generate executable visual programs for video understanding. To enhance program's accuracy and robustness, we implement two important strategies. Firstly, we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. Secondly, taking motivation from recent works on self refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation and multi-video QA illustrate the efficacy of these enhancements in improving the performance of visual programming approaches for video tasks.
- Published
- 2024
40. AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
- Author
-
Cui, Yuning, Zamir, Syed Waqas, Khan, Salman, Knoll, Alois, Shah, Mubarak, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR., Comment: 28 pages,15 figures
- Published
- 2024
41. Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning
- Author
-
Watawana, Hasindri, Ranasinghe, Kanchana, Mahmood, Tariq, Naseer, Muzammal, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Self-supervised representation learning has been highly promising for histopathology image analysis with numerous approaches leveraging their patient-slide-patch hierarchy to learn better representations. In this paper, we explore how the combination of domain specific natural language information with such hierarchical visual representations can benefit rich representation learning for medical image tasks. Building on automated language description generation for features visible in histopathology images, we present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. We explore contrastive objectives and granular language description based text alignment at multiple hierarchies to inject language modality information into the visual representations. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets. Our framework also provides better interpretability with our language aligned representation space. Code is available at https://github.com/Hasindri/HLSS., Comment: 13 pages and 5 figures
- Published
- 2024
42. Optical properties of Y dwarfs observed with the Gran Telescopio Canarias
- Author
-
Martín, Eduardo L., Zhang, Jerry J. -Y., Lanchas, Honorio, Lodieu, Nicolas, Shahbaz, Tarik, and Pavlenko, Yakiv V.
- Subjects
Astrophysics - Solar and Stellar Astrophysics ,Astrophysics - Earth and Planetary Astrophysics - Abstract
Observations of five Y dwarfs with three optical and near-infrared instruments at the 10.4 m Gran Telescopio Canarias are reported. Deep images of the five targets and a low-resolution far-red optical spectrum for one of the targets were obtained. One of the Y dwarfs, WISE J173835+273258 (Y0), was clearly detected in the optical (z- and i-bands) and another, WISE J182831+265037 (Y2), was detected only in the z-band. We measured the colours of our targets and found that the z-J and i-z colours of the Y dwarfs are bluer than those of mid- and late-T dwarfs. This optical blueing has been predicted by models, but our data indicates that it is sharper and happens at temperatures about 150 K warmer than expected. Likely, the culprit is the K I resonance doublet, which weakens more abruptly in the T- to Y-type transition than expected. We show that the alkali resonance lines (Cs I and K I) are weaker in Y dwarfs than in T dwarfs; the far-red optical spectrum of WISE J173835+273258 is similar to that of late-T dwarfs, but with stronger methane and water features; and we noted the appearance of new absorption features that we propose could be due to hydrogen sulphide. The optical properties of Y dwarfs presented here pose new challenges to understanding grain sedimentation in extremely cool objects. The weakening of the very broad K I resonance doublet due to condensation in dust grains is more abrupt than theoretically anticipated. Consequently, the observed blueing of the z-J and i-z colours of Y dwarfs with respect to T dwarfs is more pronounced than predicted by models and could boost the potential of upcoming deep large-area optical surveys regarding their ability to detect extremely cool objects, Comment: accepted for publication in Astronomy and Astrophysics
- Published
- 2024
43. Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
- Author
-
Noman, Mubashir, Naseer, Muzammal, Cholakkal, Hisham, Anwar, Rao Muhammad, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach, named SatMAE++, performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works, the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions, leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\% for multi-label classification task on BigEarthNet dataset. Our code and pre-trained models are available at \url{https://github.com/techmn/satmae_pp}., Comment: Accepted at CVPR 2024
- Published
- 2024
44. ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes
- Author
-
Malik, Hashmat Shadab, Huzaifa, Muhammad, Naseer, Muzammal, Khan, Salman, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks. Code https://github.com/Muhammad-Huzaifaa/ObjectCompose.git
- Published
- 2024
45. Effectiveness Assessment of Recent Large Vision-Language Models
- Author
-
Jiang, Yao, Yan, Xinyu, Ji, Ge-Peng, Fu, Keren, Sun, Meijun, Xiong, Huan, Fan, Deng-Ping, and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, helping researchers improve LVLMs for both general and specialized applications., Comment: Accepted by Visual Intelligence
- Published
- 2024
46. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
- Author
-
Thawakar, Omkar, Vayani, Ashmal, Khan, Salman, Cholakal, Hisham, Anwer, Rao M., Felsberg, Michael, Baldwin, Tim, Xing, Eric P., and Khan, Fahad Shahbaz
- Subjects
Computer Science - Computation and Language - Abstract
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands. MobiLlama is a SLM design that initiates from a larger model and applies a careful parameter sharing scheme to reduce both the pre-training and the deployment cost. Our work strives to not only bridge the gap in open-source SLMs but also ensures full transparency, where complete training data pipeline, training code, model weights, and over 300 checkpoints along with evaluation codes is available at : https://github.com/mbzuai-oryx/MobiLlama., Comment: Code available at : https://github.com/mbzuai-oryx/MobiLlama
- Published
- 2024
47. Semi-supervised Open-World Object Detection
- Author
-
Mullappilly, Sahal Shaji, Gehlot, Abhishek Singh, Anwer, Rao Muhammad, Khan, Fahad Shahbaz, and Cholakkal, Hisham
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this formulation less realistic in a real-world deployment. To address this, we introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD), that reduces the annotation cost by casting the incremental learning stages of OWOD in a semi-supervised manner. We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting. Therefore, we introduce a novel SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme to better align the object query representations between the original and augmented images to leverage the large unlabeled and few labeled data. We further introduce a pseudo-labeling scheme for unknown detection that exploits the inherent capability of decoder object queries to capture object-specific information. We demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations. Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach. Our source code, models and splits are available here - https://github.com/sahalshajim/SS-OWFormer, Comment: Accepted to AAAI 2024 (Main Track)
- Published
- 2024
- Full Text
- View/download PDF
48. Practical algorithms for Hierarchical overlap graphs
- Author
-
Talera, Saumya, Bansal, Parth, Khan, Shabnam, and Khan, Shahbaz
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
Genome assembly is a prominent problem studied in bioinformatics, which computes the source string using a set of its overlapping substrings. Classically, genome assembly uses assembly graphs built using this set of substrings to compute the source string efficiently, having a tradeoff between scalability and avoiding information loss. The scalable de Bruijn graphs come at the price of losing crucial overlap information. The complete overlap information is stored in overlap graphs using quadratic space. Hierarchical overlap graphs [IPL20] (HOG) overcome these limitations, avoiding information loss despite using linear space. After a series of suboptimal improvements, Khan and Park et al. simultaneously presented two optimal algorithms [CPM2021], where only the former was seemingly practical. We empirically analyze all the practical algorithms for computing HOG on real and random datasets, where the optimal algorithm [CPM2021] outperforms the previous algorithms as expected, though at the expense of extra memory. However, it uses non-intuitive approach and non-trivial data structures. We present arguably the most intuitive algorithm, using only elementary arrays, which is also optimal. Our algorithm empirically proves even better for both time and memory over all the algorithms, highlighting its significance in both theory and practice. We further explore the applications of hierarchical overlap graphs to solve various forms of suffix-prefix queries on a set of strings. Loukides et al. [CPM2023] recently presented state-of-the-art algorithms for these queries. However, these algorithms require complex black-box data structures and are seemingly impractical. Our algorithms, despite failing to match the state-of-the-art algorithms theoretically, answer different queries ranging from 0.01-100 milliseconds for a data set having around a billion characters.
- Published
- 2024
49. BiMediX: Bilingual Medical Mixture of Experts LLM
- Author
-
Pieri, Sara, Mullappilly, Sahal Shaji, Khan, Fahad Shahbaz, Anwer, Rao Muhammad, Khan, Salman, Baldwin, Timothy, and Cholakkal, Hisham
- Subjects
Computer Science - Computation and Language - Abstract
In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering. We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations. We also introduce a comprehensive evaluation benchmark for Arabic medical LLMs. Furthermore, we introduce BiMed1.3M, an extensive Arabic-English bilingual instruction set covering 1.3 Million diverse medical interactions, resulting in over 632 million healthcare specialized tokens for instruction tuning. Our BiMed1.3M dataset includes 250k synthesized multi-turn doctor-patient chats and maintains a 1:2 Arabic-to-English ratio. Our model outperforms state-of-the-art Med42 and Meditron by average absolute gains of 2.5% and 4.1%, respectively, computed across multiple medical evaluation benchmarks in English, while operating at 8-times faster inference. Moreover, our BiMediX outperforms the generic Arabic-English bilingual LLM, Jais-30B, by average absolute gains of 10% on our Arabic medical benchmark and 15% on bilingual evaluations across multiple datasets. Our project page with source code and trained model is available at https://github.com/mbzuai-oryx/BiMediX .
- Published
- 2024
50. TL;DR Progress: Multi-faceted Literature Exploration in Text Summarization
- Author
-
Syed, Shahbaz, Al-Khatib, Khalid, and Potthast, Martin
- Subjects
Computer Science - Computation and Language - Abstract
This paper presents TL;DR Progress, a new tool for exploring the literature on neural text summarization. It organizes 514~papers based on a comprehensive annotation scheme for text summarization approaches and enables fine-grained, faceted search. Each paper was manually annotated to capture aspects such as evaluation metrics, quality dimensions, learning paradigms, challenges addressed, datasets, and document domains. In addition, a succinct indicative summary is provided for each paper, consisting of automatically extracted contextual factors, issues, and proposed solutions. The tool is available online at https://www.tldr-progress.de, a demo video at https://youtu.be/uCVRGFvXUj8, Comment: EACL 2024 System Demonstration
- Published
- 2024
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.