True, but Macro-op fusion isnt Conroe's only trick. Conroe's got 4 decoders, and using Macro-op fusion it can sometimes execute a 5th instruction in parallel (per core).
AMD64's can only execute up to 3 instructions in parallel on each core.
Of course a lot depends on the optimizations used during compiling, many programs dont even run more than 1 instruction at a time, thats why Hyperthreading worked on P4's, the spare decoders were used by a second thread to make better use of the width of the processor.
Anyway in 32bit Conroe can execute 1-5 instructions at once, while AMD64's can execute 1-3 instructions. Assuming that virtually no 64 bit instructions work with macro-op fusion, that still leaves 1-4 instructions for Conroe.
And there were other improvements in conroe as well, like the full 128bit internal busses, allowing 128bit SSE instructions to be executed in a single clock cycle.
Intel and AMD both cross license their technologies with each other, so naturally intel could use the 64bit enhancements, likewise AMD are able to use intels SSE (1/2/3/4/n). If intel hadn't been trying to sell their IA64 architechture, im sure they would have build a 64bit X86 a long time ago. After all they had no problem going from 16bit 286, to true 32bit 386.
Still AMD added quite a few new 64bit only registers (present in the Intel version as well), which is a nice bonus, wouldnt be surprised if a pure 'intel' design would have simply extended the original 8086 registers to 64bit, without adding the extra 64bit general purpose registers.