CycleGAN's Other Failure Mode: Perfect Scores, No Translation

Sathammai Sathappan

Rejected for the following reason(s):

No LLM generated, assisted/co-written, or edited work.
Insufficient Quality for AI Content.
Potentially / Partially LLM content.

Read full explanation

TL;DR: CycleGANs are notoriously known for being unstable — mode collapse is expected. While I was exploring that, I faced another type of degenerate solution, where the translated image ≈ original image — caused due to the strictness of cycle loss consistency. The reason, I believed, was the low resolution dataset, which did not allow the discriminator to learn the intricate details, and thereby allowing the generator to get away with minimal work. So the next dataset I chose was sophisticated and of a higher resolution. This produced much more creative translations while also maintaining cycle-consistency. Takeaway here is: a well processed dataset gives the CycleGAN sandbox conditions — which doesn't only produce creative translations — but is also robust to noise.

CycleGAN In Fashion Domain:

CycleGANs are primarily used in medical imaging and art/style transfer and remain largely under-explored in fashion domains. I wanted to explore cross-culture translations, and hence, I curated an Indian and Western bridal couture dataset to test this out. But to first understand the behavior of CycleGAN, I conducted the experiments on the Fashion MNIST dataset, to have a more controlled setting. I perform both qualitative and quantitative examinations, with quantitative metrics ranging four categories — pixel-level fidelity, structural consistency, perceptual realism, and color accuracy.

Fashion-MNIST:

Fashion-MNIST is a benchmark dataset used for image classification. The images are low-resolution (28x28), grayscale images of fashion-items. It contains ten classes, I chose sandals as domain-A and sneakers as domain-B.

Following are the training dynamics:

The loss evolution curves for domain-B (sneakers) have an adversarial dynamic from early on (notice how G_B curve and D_B curves are somewhat complementary to each other), whereas domain-A (sandals) have that characteristic mostly from the second half of the training. The reason is very intuitive, sneakers are visually richer, so the discriminator could learn better — quickly, whereas sandals are simpler so the discriminator was slower to learn.

The cycle-consistency and identity loses steadily decrease, and at the end of the training, the model nearly perfected the cycle consistency and identity metrics as well.

Following are the identity metrics averaged over 100 images per domain:

Translation Direction	MSE	SSIM	PSNR
Sandals (A)	0.0001	0.994	41.19
Sneakers (B)	0.0002	0.991	40.41

Following are the cycle-consistency metrics averaged over 100 images per domain:

Translation Direction	MSE	SSIM	PSNR
Sandals → Sneakers (A→B)	0.0004	0.975	35.54
Sneakers → Sandals (B→A)	0.0004	0.978	34.51

So the model learned to reconstruct the original image from the translated image, and it's able to do that by preserving the original structure, but the cost of preservation? Almost no translation.

After qualitative observations, I roughly divided the translations into three categories:

A) Successful translations:

B) Mid-level transformations:

C) No translation:

Notice how, even in successful and partial transformations, the original structure is stronger, and the addition is fainter, like faint wedges being added to pencil and block heels.

These qualitative observations, combined with the quantitative ones where the metrics were near perfect, lead me to believe that in low-resolution datasets, since the discriminator cannot learn strongly, the generator — which is already heavily constrained by the cycle consistency loss — learns that minimal changes are enough to fool the discriminator. This leaves highly distinctive and unusual elements unaltered.

So my hypothesis was that a richer dataset would overcome this problem, which lead me to construct a strong second dataset.

Bridal Couture Dataset:

The goal was to explore culturally rich fashion translations in a high resolution setup, and since this wasn't readily available, I curated it on my own. I chose Indian traditional bridal attire as domain-A and Western bridal attire as domain-B. The Indian bridal wear mostly had semi-couture pieces with rich texture and intricate detailing, and Western bridal wear had semi-couture pieces with some ready-to-wear — which is lesser intricate than the Indian dataset, but still good enough for exploring style translations.

I web-scraped them using various methods, including Selenium and Playwright, amongst some other. I got around 5600 images for Indian domain and 5000 images for Western domain. To improve the focus on the gowns, I blurred the background using rembg. To maintain privacy, and to prevent the model from misusing ethnicity artifacts in translations, I blurred the faces used retinaface.

To balance the datasets, I manually selected a subset of the Western dataset to perform data augmentation. The final total was 5900, slightly more than the Indian dataset. This was deliberately done to give the model more exposure to the Western dataset. I thought this would provide the model equitable learning opportunities, as the Western dataset was simpler.

Following are the training dynamics:

The discriminator losses for both domains have minimal oscillations, indicating that they stabilized early. The generator loss had higher oscillations, suggesting active adaptation to the intricate detailing.

Following are the cycle consistency metrics averaged over 100 images per domain:

Metric	A→B→A Cycle	B→A→B Cycle
MSE	0.001475	0.000882
PSNR	28.65	30.84
SSIM	0.8983	0.9454
LPIPS	0.13898	0.10481
HOG	0.9293	0.9463
MS-SSIM	0.9168	0.9692

Following are the identity metrics averaged over 100 images per domain:

Metric	Identity A (GB)	Identity B (GA)
MSE	0.0005	0.00005
PSNR	33.28	34.15
SSIM	0.9641	0.9767
LPIPS	0.0377	0.0466

The metrics are not as high as it was for Fashion-MNIST, but it's still very decent, considering the intricacy of the bridal datasets. The metrics for domain-B (Western bridal) are higher than domain-A (Indian bridal), which was expected as both feature preservation and reconstruction is relatively easier.

Few successful translations:

Notice how the translations — especially Western to Indian — includes additional texture patterns (resembling embroidery) and ornament detailing. Whereas Indian to Western suppressed those texture patterns and detailing. This suggest that the both translation functions were near inverses — indicating that the model learnt well.

The key here is that — at the slight cost of the cycle consistency and identity metrics, the model produced much better translations — the main reason being the discriminator. Since the dataset was of a higher resolution, the discriminator could learn the intricate details better — which makes it harder to fool. This pushed the generator to perform more than just basic translations — even if it was at the cost of the cycle-consistency metrics — to effectively fool the discriminator.

Some more metrics:

To do justice to the datasets, I also computed additional metrics like LPIPS, color and HSV histograms, edge and HOG-based comparisons, MS-SSIM and ∆E (Delta-E). Following are the numbers:

Following are the edge and structural Metrics averaged over 100 images:

Metric	A→B Translation	B→A Translation
Edge MSE	0.001568	0.001462
Edge PSNR	28.68	29.11
Edge SSIM	0.6539	0.6779
Laplacian MSE	469.39	582.83
Laplacian PSNR	27.35	23.77
Laplacian SSIM	0.6616	0.6026
HOG Similarity	0.8213	0.8325
MS-SSIM	0.4228	0.6034

The edge based metrics performed better in Western to Indian direction because existing features are mostly preserved. The reverse — maximalism to minimalism direction is tougher since elaborate visual elements need to be removed — causing a drop in the edge metrics.

Following are the color metrics averaged over 100 images:

Metric	A→B	B→A	A→B→A	B→A→B
Delta-E	19.27	15.98	5.35	3.92

For color analysis, Delta-E is higher for Indian to Western translation and lower for Western to Indian. Wait, isn't that counter-intuitive? Well, not really, because the model only trained till 161 epochs (stopped mid-way because of compute constraints) and the model is still refining the color mappings, so the Indian translation has softer, less sharply defined color pixels as compared to the Western translation, which is predominantly white, allowing the model to be more confident. For cycle reconstruction, Delta-E of Indian domain is higher, as it has a wider color range, therefore, even subtle deviation in the color spectrum is amplified.

Key Takeaway:

Cycle-consistency constraint was introduced to solve the under-constrained mapping problem, where every input is mapped to the same output. Cycle-consistency forces the translation to be reversible, which rules out many degenerate mapping. This is the reason why its given extremely high importance.

In a low-resolution dataset, however, the discriminator is unable to learn domain variety as effectively. So the generator — exploiting this weakness — learns that minimal transformations does the job, and also helps in maintaining low cycle loss. This highlights a critical limitation: when the dataset is too simple (or low resolution, which blurs out many details), the cycle-consistency loss dominates, which prevents creative or semantically rich translations. Thus, cycle-consistency, which was originally introduced to enforce uniqueness, paradoxically becomes a liability.

In contrast, the Bridal Couture dataset had complex, high-resolution images that had distinct elements — which fundamentally changed the model's behavior. Another thing that surprised me was: for Western to Indian direction, the model produced visually rich translations. Naturally, this direction posed a higher risk of mode collapse — features had to be introduced, and yet, the model produced visually realistic outputs.

A strong discriminator forced the generator to find a balance between preserving content and introducing style variations — a significant step towards stable and expressive unpaired translations.

Dataset quality is always considered important — these findings enunciate that in the context of CycleGAN's success. CycleGAN's performance on the second dataset helps understand its behavior when given the perfect sandbox — it learns mappings that are not only accurate, but also meaningfully creative.

Few things to note:

1) Due to rembg's imperfect segmentation, a small percentage (≈2%) of skirts got blurred. I still used them in the dataset to test the robustness. To my surprise, not only did the model not collapse, but it also tried to produce meaningful translations for the blurred gowns. The same model — which somewhat collapsed on Fashion-MNIST — managed to produce meaningful translations on a complex dataset, even with noise — results of a good sandbox.

2) I initially planned to run 300-400 epochs, but the training stopped at 161 due to compute constraints. So only 10% of the translations were visually exceptional. The remaining 90% — though less striking — accurately reflected the data's natural variability and segmentation noise. So the model managed to learn the underlying domain distribution, instead of just memorizing the idealized features.

Repo

The notebooks and detailed reports are available at: https://github.com/Sat1509/Fashion-CycleGAN