Ulysses Unbound: Experiments in Communication–Computation Overlap
As video diffusion models scale, sequence lengths get uncomfortably large. A practical answer is context parallelism, and within that family, Ulysses is the canonical approach. The core idea is simple: full sequence, sharded heads. It maps well to modern GPU clusters built for high-throughput all-to-all communication, while still letting you run dense attention at large context lengths.
Ulysses follows a straightforward execution flow:
- QKV projection: each rank computes local Q, K, and V.
- Pre-attention all-to-all: ranks exchange shards so each GPU has the sequence slice needed for attention.
- Attention compute: SDPA runs on gathered Q/K/V.
- Post-attention path: outputs are exchanged back and the output projection is applied.
To make the flow concrete, we start with the pre-attention chunk in baseline Ulysses. Each rank runs three local GEMMs (Q, K, V), then performs three all-to-all exchanges before SDPA. This chunk stage is the main communication hotspot and the optimization target for the rest of the post.
Benchmark setup. We run a synthetic Ulysses pre-attention benchmark on 8 B200 GPUs (bf16, B=1, H=40, D=128) with fixed GPU clocks and the cuDNN attention backend for attention compute. Each configuration uses 15 warmup iterations and 40 timed iterations, repeated 3 times; we report max-rank median latency. We report both pre-attention chunk latency (QKV projection + pre-SDPA communication) and end-to-end step latency. Strong scaling keeps global sequence fixed as GPU count grows, while weak scaling keeps local sequence fixed so global sequence grows with world size.
Async Ulysses
Baseline Ulysses leaves performance on the table because the pre-attention path is mostly serialized: each rank computes Q, K, and V, then runs three all-to-all exchanges, then enters SDPA.
Async Ulysses (as introduced by ByteDance as part of VeOmni) uses a simple observation: the Q/K/V branches are independent before attention. That lets us overlap communication from one branch with compute from the next, without changing the math.
The resulting schedule is:
- Compute Q, launch Q all-to-all
- Compute K while Q all-to-all is in flight, then launch K all-to-all
- Compute V while K all-to-all is in flight, then launch V all-to-all
- Wait/restore gathered tensors, then run SDPA
The resulting overlap is shown in the diagram below
There are two practical ways to implement this overlap. The first is the classic NCCL route: launch collectives with async_op=True, carry Work handles, and place explicit waits later. The second is PyTorch functional collectives (torch.distributed._functional_collectives), which return AsyncTensor values so launch/wait behavior is expressed directly in the tensor flow. We use the functional path here because it is newer, composes better with torch.compile, and makes the overlap schedule cleaner to write and reason about.
Async Ulysses does what we want: chunk latency drops by about 23–25% at 2/4/8 GPUs, while end-to-end improves by ~3%. The overlap removes pre-attention serialization, but SDPA and output still dominate total step time.
Async Ulysses with Symmetric Memory
There is a subtle limitation in our current Async Ulysses path. Even though we launch communication asynchronously on separate streams, they can still run on the same underlying SM resources. NCCL collectives are implemented as GPU kernels that issue loads/stores and reduction work, so they compete with GEMMs for SM cycles. In practice, this means timeline overlap does not always translate into throughput overlap: communication and computation can overlap in time while still throttling each other.
To reduce this contention, we keep Async Ulysses unchanged and change only the transport layer. Routing pre-attention transfers through the Copy Engine separates data movement from SM compute, which leaves more uninterrupted SM capacity for GEMMs and attention kernels. We use PyTorch Symmetric Memory to access that path while keeping the same attention math and execution schedule.
async_symm is better at 2 and 4 GPUs, but regresses at 8 GPUs. When payloads are larger, copy-engine transport reduces SM contention and improves effective overlap; at 8 GPUs payloads are smaller, so Symmetric Memory fixed overheads (buffering/signaling/bookkeeping) are harder to amortize.
Fused QKV Projections
Async Ulysses proved overlap works. A natural follow-up to overlap is fusion: reduce how many kernels and collectives the pre-attention path launches. Instead of running separate Q/K/V projections and separate communication exchanges, each rank builds one packed local weight shard (Q|K|V rows for its local heads) and calls torch.ops.symm_mem.fused_all_gather_matmul. That single op performs sequence all-gather and local-head matmul together, returning a packed (B, S_global, 3*H_local*D) output. We then split the packed output into Q/K/V and reshape back to gathered local-head tensors for SDPA.
This is the strongest chunk optimization at 2 and 4 GPUs (-37.3%, -33.4%) and gives the best end-to-end gains there (-5.0%, -4.8%). At 8 GPUs the benefit mostly vanishes (-4.6% chunk, -0.3% total): messages are already small under strong scaling, so fixed fusion overheads take over.
Weak Scaling
Strong scaling can hide communication costs because local sequence shrinks with more GPUs. So we also run weak scaling: keep local sequence fixed and let global sequence grow.
For weak scaling at local-seq 16K, Fused QKV is best at 2-4 GPUs, while Async Ulysses scales more smoothly and is best at 8 GPUs. Async Ulysses with Symmetric Memory is competitive at lower scale, then degrades faster.
Takeaway: overlap is the most robust high-scale default; fusion is strongest in lower/mid-scale regimes.
Conclusion
- These results come from an 8xB200 node, where all-to-all is already very strong; on slower interconnects, there is likely more communication slack to hide.
- More broadly, the stack is moving toward tighter comm-comp fusion: device-initiated symmetric-memory/NVSHMEM-style communication, Triton kernels, and persistent execution. Kraken is a good cookbook-style reference for that direction.
- Since no single strategy wins every regime, a lightweight runtime policy could choose between packing, overlap, and fusion based on sequence length, world size, and hardware/interconnect characteristics.
- Communication-heavy workloads (for example, MoE routing) should benefit even more than this dense-attention setup.