The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts, we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, 5K unique concept compositions, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in ground truth images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome.
@inproceedings{patel2023conceptbed,title={ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models},author={Patel, Maitreya and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou},booktitle={arXiv (pre-print) -- },year={2023},url={https://conceptbed.github.io/},demo={https://huggingface.co/spaces/mpatel57/ConceptBed},}
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
The rapid advancement of generative models, facilitating the creation of hyper-realistic images from textual descriptions, has concurrently escalated critical societal concerns such as misinformation. Traditional fake detection mechanisms, although providing some mitigation, fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images, thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user’s unique digital fingerprint, imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach, incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates near-perfect attribution accuracy with a minimal impact on output quality. We rigorously scrutinize our method’s secrecy under two distinct scenarios: one where a malicious user attempts to detect the fingerprint, and another where a user possesses a comprehensive understanding of our method. We also evaluate the robustness of our approach against various image post-processing manipulations typically executed by end-users. Through extensive evaluation of the Stable Diffusion models, our method presents a promising and novel avenue for accountable model distribution and responsible use.
@inproceedings{kim2023wouaf,title={WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models},author={Kim*, Changhoon and Min*, Kyle and Patel, Maitreya and Cheng, Sheng and Yang, Yezhou},booktitle={arXiv (pre-print) -- },year={2023},url={https://wouaf.vercel.app/},demo={https://huggingface.co/spaces/mpatel57/WOUAF-Text-to-Image},}
2022
CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering
Videos often capture objects, their motion, and the interactions between different objects. Although real-world objects have physical properties associated with them, many of these properties (such as mass and coefficient of friction) are not captured directly by the imaging pipeline. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce a new video question answering task for reasoning about the implicit physical properties of objects in a scene, from videos. For this task, we introduce a dataset – CRIPP-VQA, which contains videos of objects in motion, annotated with hypothetical/counterfactual questions about the effect of actions (such as removing, adding, or replacing objects), questions about planning (choosing actions to perform in order to reach a particular goal), as well as descriptive questions about the visible properties of objects. We benchmark the performance of existing video question answering models on two test settings of CRIPP-VQA: \textiti.i.d. and an out-of-distribution setting which contains objects with values of mass, coefficient of friction, and initial velocities that are not seen in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties (the focus of prior work) of objects.
@inproceedings{patel2022cripp,title={{CRIPP-VQA}: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering},author={Patel, Maitreya and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou},booktitle={EMNLP, Main Conference -- },year={2022},url={https://maitreyapatel.com/CRIPP-VQA/},}
Benchmarking generalization via in-context instructions on 1,600+ language tasks
How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce Natural-Instructions v2, a benchmark of 1,600+ diverse language tasks and their expert-written instructions. It covers 70+ distinct task types, such as tagging, in-filling, and rewriting. These tasks are collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. With this large and diverse collection of tasks, we are able to rigorously benchmark cross-task generalization of models – training on a subset of tasks and evaluating on the remaining unseen ones. For instance, we quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes. Based on these insights, we introduce Tk-Instruct, an encoder-decoder Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark. We hope this benchmark facilitates future progress toward more general-purpose language understanding models.
@inproceedings{wang2022benchmarking,title={Benchmarking generalization via in-context instructions on 1,600+ language tasks},author={Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and others},booktitle={EMNLP, Main Conference -- },year={2022},url={https://instructions.apps.allenai.org},}
2020
MSpeC-Net: Multi-Domain Speech Conversion Network
Harshit Malaviya, Jui Shah, Maitreya Patel, Jalansh Munshi, and Hemant A Patil
In 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
@inproceedings{malaviya2020mspec,title={{MSpeC-Net}: Multi-Domain Speech Conversion Network},author={Malaviya, Harshit and Shah, Jui and Patel, Maitreya and Munshi, Jalansh and Patil, Hemant A},booktitle={45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},pages={7764--7768},year={2020},organization={IEEE},}
CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion
Maitreya Patel, Mirali Purohit, Jui Shah, and Hemant A Patil
In 28th European Signal Processing Conference (EUSIPCO) 2020
@inproceedings{patel2020cinc,title={{CinC-GAN} for Effective F0 prediction for Whisper-to-Normal Speech Conversion},author={Patel, Maitreya and Purohit, Mirali and Shah, Jui and Patil, Hemant A},booktitle={28th European Signal Processing Conference (EUSIPCO)},year={2020},organization={IEEE},}
Weak Speech Supervision: A case study of Dysarthria Severity Classification
@inproceedings{purohit2020weak,title={Weak Speech Supervision: A case study of Dysarthria Severity Classification},author={Purohit, Mirali and Parmar, Mihir and Patel, Maitreya and Malaviya, Harshit and Patil, Hemant A},booktitle={28th European Signal Processing Conference (EUSIPCO)},year={2020},organization={IEEE},}
2019
Novel adaptive generative adversarial network for voice conversion
@inproceedings{patel2019novel,title={Novel adaptive generative adversarial network for voice conversion},author={Patel, Maitreya and Parmar, Mihir and Doshi, Savan and Shah, Nirmesh J and Patil, Hemant A},booktitle={11th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},pages={1273--1281},year={2019},organization={IEEE},}
Effectiveness of cross-domain architectures for whisper-to-normal speech conversion
@inproceedings{parmar2019effectiveness,title={Effectiveness of cross-domain architectures for whisper-to-normal speech conversion},author={Parmar, Mihir and Doshi, Savan and Shah, Nirmesh J and Patel, Maitreya and Patil, Hemant A},booktitle={27th European Signal Processing Conference (EUSIPCO)},year={2019},organization={IEEE},}