Virtual try-on technology represents a significant advancement in the digital fashion industry, offering a bridge between online shopping experiences and the physical act of trying on clothes. This technology allows consumers to visualize how garments look on their bodies without the need for a physical fitting. It leverages digital imaging and artificial intelligence to superimpose clothing items onto the user’s digital avatar or image in a realistic manner. The potential uses of virtual try-on in the fashion industry are extensive, including online shopping, personalized recommendations, and virtual fashion shows and benefits for consumers may include convenience, enhanced shopping experience, and reduced need for physical trials, leading to fewer returns and greater satisfaction.
In today’s blog, we will test and evaluate the results from StableVITON, an advanced framework for image-based virtual try-on. Building upon the capabilities of pre-trained diffusion models, it promises to offer high-quality, realistic clothing simulations on arbitrary person images. We’re particularly interested in evaluating if this technology is ready to become a final product that customers would find beneficial.
StableVITON distinguishes itself from existing virtual try-on solutions with several innovative features:
It analyzes and learns the relationships between clothing items and body shape within the hidden (latent) representation space of the diffusion model. This leads to more accurate virtual clothing transfer onto different body images.
These are core elements of the architecture allowing StableVITON to maintain clothing details while still using the power of the pre-trained diffusion model for image generation. These blocks are designed to preserve clothing details by learning semantic correspondences, while also leveraging the inherent knowledge of the pre-trained model in the image warping process, resulting in high-fidelity images.
A novel loss function proposed to achieve sharper attention maps leading to crisper preservation of garment details like patterns and textures
StableVITON employs three conditions—agnostic map, agnostic mask, and dense pose—alongside the clothing feature map as inputs to the model’s attention mechanism, ensuring detailed and accurate alignment of clothing on the person’s image.
In contrast to other virtual try-on solutions that may struggle with preserving fine clothing details or require extensive manual tuning, StableVITON claims to offer an end-to-end solution that preserves clothing details and generates high-quality images even in the presence of complex backgrounds.
The main goal of StableVITON is to solve the issue of realistic virtual try-on of clothing. It aims to generate highly accurate images where:
While specific system requirements are not formally documented for StableVITON, our testing environment was equipped with robust hardware to ensure optimal performance. This included an NVIDIA RTX 3090 GPU, a 12th Gen Intel® Core™ i5-12400 CPU, and 64GB of RAM. It’s recommended to use a similarly capable setup or consult the documentation for minimum requirements.
The process to install StableVITON is straightforward. Here’s a step-by-step guide to get you started:
During our installation, we encountered no significant issues.
The lack of comprehensive documentation in the repository posed a significant challenge. While the authors outlined the expected input data structure for inference, they did not provide clear instructions or identify the specific dependencies required to replicate their results. References to VITON-HD and DensePose were made for mask generation and pose estimation, respectively, but the exact models needed for accurate replication were omitted.
Despite these challenges, we were able to proceed with inference. Using Facebook’s Detectron, we generated the necessary DensePose estimations. Additionally, the ‘OOTDiffusion’ repository provided the inference script needed to generate agnostic masks.
The authors had referred to the zalando-hd-resized dataset to structure the custom data similarly. The dataset (zalando-hd-resized) was linked in a separate repository linked here, we decided to use the cloth images already provided in the dataset to test on our own images.
The following images and masks were essential for inference:
We manually reframed the original images to center the subject and resized them to 768×1024 with a 3:4 aspect ratio. This standardization was crucial as the model requires uniform image sizes, including the clothing images.
To generate the necessary agnostic images and masks, we employed the “run_ootd” script from OOTDiffusion. Modifications were made to this script to output images in the specified dimensions of 768×1024.
For pose estimation, the ‘densepose_rcnn_R_50_FPN_s1x‘ model was run in ‘dp_segm’ mode within the DensePose project. Using the ‘apply_net’ script (link here ), we obtained the DensePose results. A minor script modification was necessary to ensure the DensePose was generated against a black background.
The structure was provided as follows in the project repo but an additional file named “test_pairs.txt” was required to run the inference. The final structure of the directory should look like this.
test
|-- image
|-- image-densepose
|-- agnostic
|-- agnostic-mask
|-- cloth
|-- cloth_mask
|-- test_pairs.txt
In the “test” directory, we created a text file named “test_pairs.txt”. This file specifies the image pairs for virtual try-on. Each line should list the filenames of the human image and the corresponding cloth image, separated by a space.
Example:
human_image_1.jpg cloth_image_1.jpg
human_image_2.jpg cloth_image_1.jpg
human_image_1.jpg cloth_image_2.jpg
human_image_2.jpg cloth_image_2.jpg
…..
Following the repository’s instructions, we successfully generated results using the ‘unpaired’ mode. Each generation took approximately 9 seconds. Our test included 12 total generations, combining 6 different clothing items with 3 input images.
Here is our observations:
StableVITON delivers impressive garment fit on the virtual model. Notably, it retains details like hair and skin at the garment’s edges, showcasing its ability to handle complex image regions. Additionally, the clothing exhibits realistic wrinkles and folds, adding a layer of realism.
Interestingly, when comparing the generated images for the crop top and t-shirt, StableVITON appears to understand the inherent style of each garment. The virtual try-on reflects a tighter fit for the crop top and the dress while generating a looser drape for the t-shirt, demonstrating the model’s ability to adapt to different clothing types.
We observed inconsistencies in the color of a single garment across different generations. For example, the gray t-shirt exhibited variations in shade, at times appearing lighter gray or even black.
While StableVITON retains the basic texture of the clothing, it struggles to fully reproduce finer details, particularly in the case of text or graphics printed on the garment.
We consistently encountered problems with the generation of faces, beards, and occasionally glasses by the pre-trained diffusion model. These facial features often appeared distorted or malformed, rendering many results unusable. This significant limitation indicates that the framework, in its current form, is not suitable for deployment in a customer-facing product.
Since the final image is produced by the pre-trained diffusion model, we encountered occasional issues with inconsistent skin tone when generating arms. Additionally, finer details like watches or tattoos on the original person’s image were always lost during the generation process.
StableVITON demonstrates considerable promise in the realm of virtual try-on. Its ability to accurately transfer garment fit, generate compelling textures, and adapt to different clothing styles showcases significant technical advancement. However, the noted inconsistencies and shortcomings highlight critical areas for improvement before this technology can deliver reliable results in a customer-facing product.
Specifically the frequent distortion of facial features, loss of finer garment details and color instability detract from the overall realism and trustworthiness of the virtual try-on experience. For customers, these flaws undermine confidence in making purchasing decisions based on the generated images.
While StableVITON marks a noteworthy step forward, addressing these limitations is crucial before it can be considered a market-ready solution.