Demonstration results of multi-modal instruction. The first row lists the visual stimulus, whereas the second row depicts our intermediate reconstructions. The manipulation results via the instruction ...