Jian Lin1, Chengze Li1*, Haoyun Qin2,3,4, Kwun Wang Chan1, Yanghua Jin3, Hanyuan Liu1, Stephen Chun Wang Choy1, Xueting Liu1
1Saint Francis University 2University of Pennsylvania 3Spellbrush 4Shitagaki Lab
*Corresponding author
Conditionally accepted to appear in ACM SIGGRAPH 2026 Conference Proceedings.
We introduce a framework that automates the transformation of static anime illustrations into manipulatable 2.5D models. Our approach decomposes a single image into fully inpainted, semantically distinct layers with inferred drawing orders — up to 23 layers including hair, face, eyes, clothing, accessories, and more.
trailer.mp4
This is our trailer video. Click to play.
# 1. Create environment
conda create -n see_through python=3.12 -y
conda activate see_through
# 2. Install PyTorch (CUDA 12.8)
pip install torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128 \
--index-url https://download.pytorch.org/whl/cu128
# 3. Install dependencies (includes common utilities and annotators)
pip install -r requirements.txt
# 4. Create assets symlink (you can also copy assets to the root if you prefer)
ln -sf common/assets assetsOptional annotator tiers (install as needed):
| Tier | Command | What it adds |
|---|---|---|
| Body parsing | pip install --no-build-isolation -r requirements-inference-annotators.txt |
detectron2 for body attribute tagging |
| SAM2 | pip install --no-build-isolation -r requirements-inference-sam2.txt |
SAM2 for language-guided segmentation |
| Instance seg | pip install -r requirements-inference-mmdet.txt |
mmcv/mmdet for anime instance segmentation |
Note: Always run scripts from the repository root as the working directory.
| Script | Purpose |
|---|---|
inference/scripts/inference_psd.py |
Main pipeline — end-to-end layer decomposition → PSD output |
inference/scripts/syn_data.py |
Synthetic training data generation utilities |
For the other inference/data parsing scripts refer to the codebase and check the docstrings for details.
| Notebook | Description |
|---|---|
inference/demo/bodypartseg_sam.ipynb |
Interactive body part segmentation demo with visualization (19-parts) |
For the definition of complete body tags, refer to scrap_model.py.
We have prepared a Huggingface Space with ZeroGPU, so that if you register with HuggingFace, you should be able to run 1-2 PSD extractions per day (approximately 2-3 mins each, at 1280 resolution).
(Copyright Tohoku Zunko Project).
inference_psd.py runs the full See-through pipeline: it applies the LayerDiff 3D model
for transparent layer generation and the fine-tuned Marigold model for pseudo-depth
inference, then stratifies the character into up to 23 semantic layers and exports a
layered PSD file. Note that the separation for head and body are in two continuous stages, which
may lead to a longer time than the original model mentioned in the paper.
# Decompose a single image into a layered PSD
python inference/scripts/inference_psd.py \
--srcp assets/test_image.png \
--save_to_psd
# Process a directory of images
python inference/scripts/inference_psd.py \
--srcp path/to/image_folder/ \
--save_to_psdOutput is saved to workspace/layerdiff_output/ by default. Each result includes:
- A layered
.psdfile with semantically separated layers - Intermediate depth maps and segmentation masks
Note: This uses our most recent model with 23-layer body part separation (V3).
Once you have finished the layer splitting, you can further process the PSD with the scripts in inference/scripts/heuristic_partseg.py for depth-based or left-right stratification.
# Split based on depth
python inference/scripts/heuristic_partseg.py seg_wdepth --srcp workspace/test_samples_output/PV_0047_A0020.psd --target_tags handwear
#Left-right split
python inference/scripts/heuristic_partseg.py seg_wlr --srcp workspace/test_samples_output/PV_0047_A0020_wdepth.psd --target_tags handwear-1The default pipeline runs at bf16 precision and requires approximately 12-16 GB of VRAM at 1280 resolution.
12 GB GPUs: Enable group offload to reduce peak VRAM to ~10 GB at 1280 resolution:
python inference/scripts/inference_psd.py \
--srcp assets/test_image.png \
--save_to_psd \
--group_offload8 GB GPUs: Use the NF4 quantized pipeline, which uses 4-bit quantized model weights. This achieves ~8 GB peak VRAM at 1280 resolution, and can be further reduced by lowering the resolution with group offload:
# Install bitsandbytes (one-time)
pip install -r requirements-inference-bnb.txt
# Run with NF4 quantization (default: group_offload on, depth resolution 720)
python inference/scripts/inference_psd_quantized.py \
--srcp assets/test_image.png \
--save_to_psd
# For even lower VRAM, reduce layerdiff resolution to 1024
python inference/scripts/inference_psd_quantized.py \
--srcp assets/test_image.png \
--save_to_psd \
--resolution 1024The quantized models are hosted on HuggingFace and downloaded automatically on first run. Quality is close to the full-precision model (PSNR ~30 dB, SSIM ~0.96 vs bf16 baseline).
Note: Group offload trades speed for VRAM savings (roughly 1.5x slower). NF4 quantization has minimal speed overhead but reduces model weight memory.
We have provided a separate repo for you to prepare the dataset for training the Live2D parsing model. Please refer to CubismPartExtr to know how to download the sample model files and prepare your workspace folder.
After that, refer to the README_datapipeline.md for the instructions on how to run the data parsing scripts to prepare the dataset for inspection and training.
Once you have prepared your data, you may go ahead with the user interfaces. Refer to UI Readme for the instructions on how to launch the UI.
We currently require the
workspace/datasets/folder located at the repository root to launch the UI, as it contains the sample data for demonstration. We will work on making this more flexible in the future.
We recommend installing the
mmdettier dependencies to ensure the UI can launch successfully.
We have our training scripts ready, but we are still working on the documentation. We will release them no later than 2026/04/12. Please stay tuned!
We welcome community contributions and third-party integrations!
If you build tools, extensions, or workflows on top of this project, please let us know by opening an issue or pull request — we would be happy to feature your work here.
- ComfyUI-See-through by @jtydhr88 — Integration for ComfyUI, with node-based workflow and in-browser PSD export. Thank you for the amazing work!
We also seek i18n help for this project. Your help will be highly appreciated.
We don't think so — at least, not yet.
While we produce 2.5D layer decompositions from a single image, the full Image-to-Live2D pipeline requires significantly more:
-
Finer artistic decomposition. Live2D models demand layers designed with specific deformation behaviors in mind. Our automatic decomposition prioritizes semantic correctness, but a Live2D artist would make different artistic choices about how to split layers for natural-looking motion.
-
Rigging. After decomposition, a Live2D model needs a deformation mesh, physics parameters, and motion curves — this rigging process is arguably the most critical (and labor-intensive) step, and it is not covered in this project.
-
Artistic intent. Professional Live2D works are crafted holistically: the layer structure, inpainting style, and rigging are designed together. Automating one step in isolation cannot replicate this.
That said, we believe our decomposition can serve as a useful starting point for Live2D artists by eliminating some of the most tedious part of the workflow, such as manual segmentation and occluded region inpainting.
2026-04-02
- Multiple memory optimizations; added suggestions for low-VRAM users (group offload, NF4 quantization).
This work is funded and substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project. No. UGC/FDS11/E02/23).
We would like to pay our thanks to the following people for their help and support:
- Dingkun Yan and Xinrui Wang for their inspiration and support on the project.
- USTC Student ACG Club "LEO" for kindly providing the sample Live2D model files for us to demonstrate on the paper.
This is an open-source research project. We thank the authors of the following projects that made this work possible:
- LayerDiffuse — Transparent image layer diffusion (Lvmin Zhang is always a legend)
- Marigold — Diffusion-based monocular depth estimation
- Segment Anything (SAM) — Foundation model for segmentation
- Grounding DINO — Open-set object detection
- LaMa — Large mask inpainting
- AnimeInstanceSegmentation — Anime-specific instance segmentation
If you find this work useful, please cite:
@article{lin2026seethrough,
title={See-through: Single-image Layer Decomposition for Anime Characters},
author={Lin, Jian and Li, Chengze and Qin, Haoyun and Chan, Kwun Wang and Jin, Yanghua and Liu, Hanyuan and Choy, Stephen Chun Wang and Liu, Xueting},
journal={arXiv preprint arXiv:2602.03749},
year={2026}
}