Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Oct 5.
Published in final edited form as: Proc SPIE Int Soc Opt Eng. 2022 Apr 4;12032:120321Q. doi: 10.1117/12.2610655

Structure-aware Unsupervised Tagged-to-Cine MRI Synthesis with Self Disentanglement

Xiaofeng Liu a, Fangxu Xing a, Jerry L Prince b, Maureen Stone c, Georges El Fakhri a, Jonghye Woo a
PMCID: PMC9533681  NIHMSID: NIHMS1776956  PMID: 36203947

Abstract

Cycle reconstruction regularized adversarial training—e.g., CycleGAN, DiscoGAN, and DualGAN—has been widely used for image style transfer with unpaired training data. Several recent works, however, have shown that local distortions are frequent, and structural consistency cannot be guaranteed. Targeting this issue, prior works usually relied on additional segmentation or consistent feature extraction steps that are task-specific. To counter this, this work aims to learn a general add-on structural feature extractor, by explicitly enforcing the structural alignment between an input and its synthesized image. Specifically, we propose a novel input-output image patches self-training scheme to achieve a disentanglement of underlying anatomical structures and imaging modalities. The translator and structure encoder are updated, following an alternating training protocol. In addition, the information w.r.t. imaging modality can be eliminated with an asymmetric adversarial game. We train, validate, and test our network on 1,768, 416, and 1,560 unpaired subject-independent slices of tagged and cine magnetic resonance imaging from a total of twenty healthy subjects, respectively, demonstrating superior performance over competing methods.

Keywords: Tagged MRI, Image synthesis, Unsupervised image translation, Anatomical disentanglement

1. INTRODUCTION

Cross-modality image-to-image translation plays a vital role in medical image analysis. For example, tagged-to-cine magnetic resonance (MR) imaging (MRI) synthesis can potentially be used to reduce the extra cine MRI acquisition time and cost, without interfering with subsequent tasks, such as motion analyses.1,2 To this end, previous works usually relied on the paired cine and tagged MRI for training.1 However, because of imaging artifacts or because patients might not tolerate long enough in the scanner, one of the modalities (e.g., cine-MRI) could be missing. In such circumstances, it is necessary to relax the requirement of the paired training data.

A typical solution for unpaired image-to-image translation would be to use the cycle constraint of the bi-directional mapping—e.g., CycleGAN,3 DiscoGAN,4 and DualGAN.5 Although prior works yielded visually realistic results by means of a generative adversarial network (GAN) loss,610 the structural consistency was not enforced, thereby easily resulting in local distortions in the synthesized images. We note that, in the tagged-to-cine MRI synthesis task, keeping the anatomical structure is essential for the subsequent tissue segmentation and motion analysis. In related developments, a segmentation task was added for co-training.11 However, it requires additional segmentation labels to train an anatomical segmentation network.1215 In addition, Yang et al.16 enforced the consistency of the MIND feature between input CT and output MRI data. However, that work cannot be applied to our tagged-to-cine MRI synthesis, since the contour patterns cannot be the same between tagged- and cine-MRI due to tag patterns.

In this work, to address the aforementioned challenges, we propose to achieve a structure-aware translator, by learning general structural features, compared with task-specific or hand-crafted features, such as the contour-based MIND feature.16 This is achieved by an explicit disentanglement of the anatomical structures (e.g., tongue shape) and imaging modalities via a novel input-output image patches self-training protocol. Our framework can be simply added on top of the conventional cycle-constrained GANs for structure-preserving synthesis.

2. RELATED WORK

Tagged MRI has been a crucial imaging modality for measuring tissue deformation in moving organs.17 Since it has intrinsically low anatomical resolution, an additional matching set of cine MRI is typically acquired for subsequent tissue segmentation, which added extra scanning time and cost. With the development of GAN-based image generation methods,1821 recent methods1,2 proposed to synthesize cine MRI from acquired tagged MRI. In these methods, precisely co-registered paired tagged and cine MRI datasets were required for training.

Cycle reconstruction for image style translation is an important technology for the unpaired translation task.35 However, local structures can hardly be constrained, which thus leads to significant distortions.16 To enforce structural consistency, a recent work16 incorporated the MIND texture feature to extract an additional supervision signal for CT-to-MRI translation. In addition, deformation invariant CycleGAN22 was proposed to alleviate large nonlinear deformations. Similarly, in that work, a structural dissimilarity loss23 was enforced to preserve local structural consistency. Though low-level texture features, e.g., MIND feature, work efficiently for MR to CT translation, those features cannot be used in our tagged to cine MRI translation task, due to additional tag patterns present in tagged MRI.

3. METHODS

3.1. MRI Data Acquisition

All participants were speaking a word, “a souk,”along with a periodic metronome-like sound during the MRI scan. MRI scanning was carried out on a Siemens 3.0T TIM Trio system with a 12-channel head coil and a 4-channel neck coil using a segmented gradient echo sequence.24 The field of view was 240×240 mm. In-plane resolution was 1.87×1.87 mm and a slice thickness was 6 mm. The image sequence was acquired at the rate of 26 fps. We note that both cine and tagged MRI are in the same coordinate space. The detailed collection protocol can be found in.1

3.2. Our Proposed Network

In an unpaired setting, we have a group of tagged MR images {xt} and a group of cine MR images {xc}. The basic framework of CycleGAN3 is illustrated in Fig. 1 left, which has two bi-directional U-Net-based translators GTC: Tagged MRI→Cine MRI and GCT: Cine MRI→Tagged MRI, and two discriminators DT and DC for tagged and cine MRI, respectively. In addition, the cycle constraint minimizes the image-level reconstruction loss of L1I. Note that we only focus on the tagged-to-cine MRI synthesis in this work. Specifically, the optimization objectives can be formulated as:

L1I=GCT(GTC(xt))xt1, (1)
LDC=Exc[log(DC(xc))]+Ext[log(1DC(GTC(xt))], (2)
LDT=Ext[log(DT(xt))]+Exc[log(1DT(GCT(GTC(xc)))], (3)

which are trained in a round-based manner in each iteration.

Figure 1.

Figure 1.

Illustration of our framework for the tagged-to-cine MRI synthesis, which consists of a synthesis module and a structure extraction module.

A recent study11 demonstrated that with a bijective geometric transformation T with its inverse transformation T−1, the cycle constraint can still be satisfied with the translators of GTC=GTCT and GCT=GCTT1. Therefore, the geometrical distortions were not punished at the training stage. In addition, Yang et al.16 extracted the MIND feature M(xt) and M(GTC(xt)) of xt and GTC(xt) with a manually defined extractor M, and minimized their reconstruction loss ∥M(xt) − M(GTC(xt))∥1. Specifically, since the MIND feature mainly focuses on the boundaries, it it not straightforward to apply the same MIND feature to tagged MRI due to tag patterns. To improve upon the prior work,16 in the present work, we propose to learn a general structure feature extractor f as an alternative to M. Our f can be learned with a novel input-output image patches self-training scheme to achieve a disentanglement of the structures and imaging modalities.

Inspired by the recent self-supervised learning,25 we simply split both the input xt and the output GTC(xt) to a set of 3 × 3 patches. In each forward pass, we randomly choose two patches from a batch of 18 patches as our input. Accordingly, there are four possible combinations:

  1. From the same position of tagged MRI xt and generated cine MRI GTC(xt).

  2. Both from tagged MRI xt, with different positions.

  3. Both from generated cine MRI GTC(xt), with different positions.

  4. From different modalities and different positions.

Based on these four cases, we have two assumptions: (1) the patches in the case of (a) should have consistent anatomical structures, and (2) the patch pairs in these cases have three combinations from two modalities—i.e., both from tagged MRI, both from cine MRI, and from tagged and cine MRI.

To remove the modality information from the embedding of f, we adopt a classifier C for checking the modality combination of the concatenated two patch features. C is trained to minimize a 3-class cross-entropy loss LCE of modality combination classification as the adversarial classifier as in Pix2Pix.26 There can be three possible binary combinations, i.e., {f(GTC(xt)), f(xt)}, {f(GTC(xt)), f(GTC(xt))}, and {f(xt), f(xt)}. We denote the true binary combination as y3, which is a one-hot vector. The combination classification loss can be formulated as:

LCE=i3yilogCi(), (4)

where Ci(·) is the classifier prediction for the i-th class. Our classifier C and feature extractor f play an asymmetrical adversarial game to encourage that f eliminates the modality information.27 Rather than maximizing the cross-entropy loss, f minimizes the KL-divergence of its softmax prediction and a uniform distribution. Specifically, we minimize the following loss:

LKL=DKL(C()U). (5)

We note that the modality and position label of the sampled two patches are known, which can be used for supervised training.

Other than removing the modality information from the embedding of f, it is necessary to preserve the anatomic information. In the case of (a), we would expect the L1f loss between the embeddings of two patches can be minimized.

L1f=f(GTC(xt))f(xt)1, (6)

Similar to the self-training,2 we adopt an alternating training scheme, which firstly fixes the structure extraction module to update GTC with the L1f loss, and then fix GTC to update the structure extraction module with the L1f loss.

After several iterations, the structure extraction module can be expected to embed the anatomic information and filter the modality information out, and the CycleGAN can be well regularized by the L1 loss—i.e., the structural consistency constraint. We use f to replace the manually defined M as in16 to achieve a better structural consistency translation. In testing, only the trained CycleGAN part is used for translation.

4. RESULTS

During training, we used unpaired 1,500 tagged MR images with horizontal tag patterns and 1,500 cine MR images from a total of ten subjects. In addition, two subjects (416 slice pairs) and eight subjects (1,560 slice pairs) are used for hyper-parameter validation and evaluation, respectively.

For a fair comparison, we resized the tagged and cine MR images to 256×256 and adopted the GTC, GCT backbones from CycleGAN.3 For the structure encoder f, we used five fully convolutional layers, which resulted in features with the size of 32 × 32 × 128. The classifier C has two convolutional layers and two fully-connected layers with a three-dimensional output layer. We used the PyTorch deep learning toolbox for our implementation. We also used an NVIDIA V100 GPU for the training, which took about 6 hours for 200 epochs.

The synthesis results using GAN without the cycle constraint, CycleGAN,3 and our proposed method are shown in Fig. 2. We can see that the CycleGAN produced realistic images, while not being able to achieve structural consistency. By contrast, our method was able to keep the position and shape of the tongue consistent between the two modalities. The resulting images were expected to have realistic appearances and to be structurally consistent with its corresponding paired ground-truth xc. For our quantitative evaluation, we used four evaluation metrics, including mean L1 error, structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and inception score (IS).1 Table 1 lists numerical comparisons for the eight testing subjects. The proposed framework outperformed the other two comparison methods w.r.t. SSIM, PSNR, and IS. Of note, all of the compared methods do not have the L1 minimization objective as in.1

Figure 2.

Figure 2.

Comparison of different unpaired tagged-to-cine MR generation methods, including the vanilla GAN (i.e., the half direction of CycleGAN), CycleGAN,3 and our proposed method. * indicates the first attempt at unpaired tagged-to-cine MR image synthesis.

Table 1.

Numerical comparisons of four methods in testing across 1,560 slice pairs. The best results are in bold.

Methods L1 ↓ SSIM ↑ PSNR ↑ IS ↑
GAN 183.8±0.2 0.8863±0.0015 25.84±0.07 7.19±0.14
CycleGAN 174.2±0.3 0.9014±0.0017 27.33±0.06 8.86±0.13
Proposed 168.5±0.3 0.9237±0.0014 29.52±0.08 9.91±0.16

5. CONCLUSION

In this work, we proposed a novel input-output image patches self-training scheme to achieve a disentanglement of the anatomical structures and imaging modalities. The structure extraction module and the structure tagged-to-cine MRI translator GTC were trained with an alternating training protocol. Our translator was able to achieve the structure-aware translation as demonstrated by the tagged-to-cine MRI synthesis task. Both qualitative and quantitative evaluation results showed that our framework outperformed CycleGAN for unpaired training. Our approach can be a simple add-on module to CycleGAN, or the other cycle constrained translators—e.g., DiscoGAN and DualGAN. Additionally, our method can be applied to other modality synthesis tasks, including MRI-to-CT synthesis, which is subject to our future work.

ACKNOWLEDGMENTS

This work is supported by NIH R01DC014717, R01DC018511, and R01CA133015.

REFERENCES

  • [1].Liu X, Xing F, Prince JL, Carass A, Stone M, El Fakhri G, and Woo J, “Dual-cycle constrained bijective vae-gan for tagged-to-cine magnetic resonance image synthesis,” in [2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)], 1448–1452, IEEE; (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Liu X, Xing F, Stone M, Zhuo J, Reese T, Prince JL, El Fakhri G, and Woo J, “Generative self-training for cross-domain unsupervised tagged-to-cine mri synthesis,” in [International Conference on Medical Image Computing and Computer-Assisted Intervention], 138–148, Springer; (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Zhu J-Y, Park T, Isola P, and Efros AA, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in [Proceedings of the IEEE international conference on computer vision], 2223–2232 (2017). [Google Scholar]
  • [4].Kim T, Cha M, Kim H, Lee JK, and Kim J, “Learning to discover cross-domain relations with generative adversarial networks,” in [International Conference on Machine Learning], 1857–1865, PMLR; (2017). [Google Scholar]
  • [5].Yi Z, Zhang H, Tan P, and Gong M, “Dualgan: Unsupervised dual learning for image-to-image translation,” in [Proceedings of the IEEE international conference on computer vision], 2849–2857 (2017). [Google Scholar]
  • [6].Liu X, Guo Z, Li S, Xing F, You J, Kuo C-CJ, El Fakhri G, and Woo J, “Adversarial unsupervised domain adaptation with conditional and label shift: Infer, align and iterate,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision], 10367–10376 (2021). [Google Scholar]
  • [7].Liu X, Li S, Ge Y, Ye P, You J, and Lu J, “Recursively conditional gaussian for ordinal unsupervised domain adaptation,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision], 764–773 (2021). [Google Scholar]
  • [8].Liu X, Xing F, El Fakhri G, and Woo J, “A unified conditional disentanglement framework for multimodal brain mr image translation,” in [2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)], 10–14, IEEE; (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Liu X, Liu X, Hu B, Ji W, Xing F, Lu J, You J, Kuo C-CJ, Fakhri GE, and Woo J, “Subtype-aware unsupervised domain adaptation for medical diagnosis,” AAAI (2021). [Google Scholar]
  • [10].He G, Liu X, Fan F, and You J, “Classification-aware semi-supervised domain adaptation,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops], 964–965 (2020). [Google Scholar]
  • [11].Zhang Z, Yang L, and Zheng Y, “Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network,” in [Proceedings of the IEEE conference on computer vision and pattern Recognition], 9242–9251 (2018). [Google Scholar]
  • [12].Liu X, Han Y, Bai S, Ge Y, Wang T, Han X, Li S, You J, and Lu J, “Importance-aware semantic segmentation in self-driving with discrete wasserstein training,” in [Proceedings of the AAAI Conference on Artificial Intelligence], 34(07), 11629–11636 (2020). [Google Scholar]
  • [13].Liu X, Ji W, You J, Fakhri GE, and Woo J, “Severity-aware semantic segmentation with reinforced wasserstein training,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition], 12566–12575 (2020). [Google Scholar]
  • [14].Liu X, Lu Y, Liu X, Bai S, Li S, and You J, “Wasserstein loss with alternative reinforcement learning for severity-aware semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems (2020).
  • [15].Liu X, Xing F, Gaggin HK, Wang W, Kuo C-CJ, Fakhri GE, and Woo J, “Segmentation of cardiac structures via successive subspace learning with saab transform from cine mri,” arXiv preprint arXiv:2107.10718 (2021). [DOI] [PubMed] [Google Scholar]
  • [16].Yang H, Sun J, Carass A, Zhao C, Lee J, Prince JL, and Xu Z, “Unsupervised mr-to-ct synthesis using structure-constrained cyclegan,” IEEE transactions on medical imaging 39(12), 4249–4261 (2020). [DOI] [PubMed] [Google Scholar]
  • [17].Xing F, Liu X, Stone M, Wedeen VJ, Prince JL, El Fakhri G, and Woo J, “Measuring strain in diffusion-weighted data using tagged magnetic resonance imaging,” in [SPIE Medical Imaging 2022: Image Processing], (2022). [DOI] [PMC free article] [PubMed]
  • [18].Liu X, Kumar BV, Jia P, and You J, “Hard negative generation for identity-disentangled facial expression recognition,” Pattern Recognition 88, 1–12 (2019). [Google Scholar]
  • [19].Liu X, Che T, Lu Y, Yang C, Li S, and You J, “Auto3d: Novel view synthesis through unsupervisely learned variational viewpoint and global 3d representation,” in [Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16], 52–71, Springer; (2020). [Google Scholar]
  • [20].Liu X, Jin L, Han X, and You J, “Mutual information regularized identity-aware facial expression recognition in compressed video,” Pattern Recognition, 108105 (2021). [Google Scholar]
  • [21].He G, Liu X, Fan F, and You J, “Image2audio: Facilitating semi-supervised audio emotion recognition with facial expression image,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops], 912–913 (2020). [Google Scholar]
  • [22].Wang C, Macnaught G, Papanastasiou G, MacGillivray T, and Newby D, “Unsupervised learning for cross-domain medical image synthesis using deformation invariant cycle consistency networks,” in [International Workshop on Simulation and Synthesis in Medical Imaging], 52–60, Springer; (2018). [Google Scholar]
  • [23].Xiang L, Li Y, Lin W, Wang Q, and Shen D, “Unpaired deep cross-modality synthesis with fast training,” in [Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support], 155–164, Springer; (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Xing F, Woo J, Lee J, Murano EZ, Stone M, and Prince JL, “Analysis of 3-D tongue motion from tagged and cine magnetic resonance images,” Journal of Speech, Language, and Hearing Research 59(3), 468–479 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Chen T, Zhai X, Ritter M, Lucic M, and Houlsby N, “Self-supervised gans via auxiliary rotation loss,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition], 12154–12163 (2019). [Google Scholar]
  • [26].Isola P, Zhu J-Y, Zhou T, and Efros AA, “Image-to-image translation with conditional adversarial networks,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 1125–1134 (2017). [Google Scholar]
  • [27].Liu X, Chao Y, You JJ, Kuo C-CJ, and Vijayakumar B, “Mutual information regularized feature-level frankenstein for discriminative recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2021). [DOI] [PubMed]

RESOURCES