Text-conditioned training and generation require the Hugging Face t5-base text encoder. By default, configs load it from ${OMG_MODELS_ROOT}/t5-base-local. Download t5 ...
ViT-Up is an implicit feature upsampler for Vision Transformers that predicts backbone-aligned features at arbitrary continuous image coordinates. Pretrained through self-supervised feature ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results