Generating TestData for a image reconstructing Transformer/Perceiver IO

I am working on a problem where I need to reconstruct an image in a Pix2Pix-like manner. The data has some attributes what would make a Transformer/Perceiver IO favourable in my opinion (e.g. the y axis contains information about the location, but is not necessarily neighboured to the row above, hence Convolutions assume a structure which is not present in the data).

I am trying to build a custom Keras version of Perceiver IO at the moment. I have put many hours into it, but it simply doesn’t work. That’s why I tried to scale down the process and came up with the idea to generate random values in a 4×4 matrix and use them as Input and Output at the same time (including a positional encoding). This doesn’t work as well, and I am thinking maybe it is the wrong approach. I know the attention mechanism is invariant to it’s input order, but positional encoding should solve this problem. Shouldn’t it be possible to map random Inputs to the same Outputs? Or am I missing something? What would a suitable testing dataset look like in my case?

Source: Python-3x Questions