Taming Transformers for High-Resolution Image Synthesis