In my first two articles, I wrote about an AI that audits my infrastructure, then an AI I build applications with. This time the topic is more down to earth, but it is one of the projects I am proudest of: migrating an entire financial infrastructure's virtualization, from a proprietary hypervisor to Proxmox, in production, without ever stopping the trading floor.

Here is why, how, and above all the traps, because that is where the real work hides.

Why touch a hypervisor that works?

That was the first question I got. Virtualization was running, it was stable. Why take the risk? Three reasons.

First, cost. The proprietary licenses cost me tens of thousands of francs every year. A Proxmox subscription is a few hundred euros. Over time, that stops being a budget line and becomes an argument.

Second, independence. When your virtualization layer belongs to a single vendor, the vendor decides the prices, the pace, the features, and when your version reaches end of support. In a small team, that dependency is a silent risk.

The third is the most important, and it is the thread of this blog: open source is text. A Proxmox config is files. So it is versionable in Git, readable by an AI, auditable, reproducible. A proprietary hypervisor is a black box that neither Git nor my AI agent can really work with. Moving to Proxmox was not just a change of software: it made my infrastructure legible to my tools.

The setup: two sites, a handful of servers, about twenty machines

To follow the rest, you need the setting.

On one side, the main data center: three recent servers (dual socket, plenty of RAM each) running a proprietary hypervisor, carrying most of the critical machines, including the domain controllers, the virtualized firewalls, and the market-data servers.

On the other side, a second site, with an aging Proxmox cluster on end-of-life hardware, hosting a dozen internal services: monitoring, Git repositories, portals, application building blocks. So the goal was not only to leave the proprietary vendor. It was also to consolidate both worlds onto one fresh Proxmox cluster, and keep only a minimal node on the second site as a recovery site. Two projects in one, and always the same rule: nothing stops.

First step: clean house

The best migration is the one you do not have to do.

Before touching anything, I went through the inventory carefully. Out of the twenty-odd machines, several had been powered off for months, obsolete, or replaced by something else: an old management server, templates, a directory connector migrated elsewhere, decommissioned application servers. Every machine removed before the migration is one less to migrate, test, and document. The scope shrank before the first switchover. It is the least technical step and the most profitable.

The constraint: the trading floor never stops

No question of shutting everything down over a weekend and praying. On a trading infrastructure, the big bang is forbidden.

So the method was a rolling migration. I consolidate the machines onto fewer hosts to free one completely. I reinstall that host as Proxmox. I create the cluster. I migrate the machines one by one into the new world, then free the next host, and start again.

The scary question is capacity: can a single host carry the others while you drain them? I did the math before, not during. Eight active machines consolidated came to about 80 GB of RAM out of 128 available, with headroom on the cores. It fit. You do not consolidate blind: you consolidate when the numbers say it holds. At every moment, the service ran in parallel on the old or the new platform. The desk saw nothing.

Hyper-V LIVE Proxmox
Rolling migration: free a host, reinstall it as Proxmox, move the machines over, then start again. The service runs in parallel, without interruption.

Three ways to move a machine

Moving a VM from one hypervisor to another is not a single operation. Depending on the machine, I used three approaches.

Conversion. For classic Windows servers, I shut the machine down, convert its virtual disk to Proxmox's format (qemu-img convert), recreate a KVM VM with a VirtIO controller and network card, then install the VirtIO drivers at boot so Windows sees its disk and network. Without those drivers, the installer sees neither disk nor card: that is the classic trap. For a critical trading server, the operation happens outside market hours, and only if its high-availability peer is healthy.

Container rebuild. For Linux services, I did not convert: I rebuilt. A lightweight LXC container, the application reinstalled cleanly, the data reimported. It is more work than copying a disk, but you gain density, you leave behind years of accumulated cruft, and you end up with a clean, documented, reproducible machine.

The domain controller case. That one, you never convert. Cloning Active Directory by copying a disk risks a USN rollback: the directory reverts to an earlier state and silently corrupts replication, sometimes weeks later. The right method is longer but it is the only clean one: stand up a fresh controller, promote it, let the directory replicate, check there are no errors, transfer the FSMO roles, then demote the old one and take over its address. No disk copy anywhere in the chain.

The trap I did not see coming: shared storage

Here is the mistake that cost me the most time, and the lesson I keep most.

My first two Proxmox nodes hit the same volume on the storage array, over multipath iSCSI. On paper, that was shared storage: live migration between nodes should work. Except it did not. The volume was formatted as LVM thin, and Proxmox refuses to treat LVM thin as truly shareable: each node keeps its own view of the metadata, locally. Both nodes see the same physical disk but each believes it is the sole owner. As soon as you attempt a live migration, the metadata desynchronizes and the operation aborts with a transaction error.

iSCSI + LVM-thin node 1 node 2 local meta local meta no live migration 1 LUN, 2 isolated views NFS share node 1 node 2 live migration OK 1 share, unified view
The same disk, two outcomes. With LVM-thin, each node believes it is the sole owner of the metadata: live migration impossible. With NFS, the sharing is real: the machine moves live.
Shared storage is not a checkbox. The type of sharing matters as much as the sharing itself.

So I moved everything to shared NFS, with disks in qcow2: there, live migration finally works, for both VMs and containers. The iSCSI LVM was decommissioned once the switchover was done.

And the real target, eventually, is distributed storage (Ceph) across three nodes: several NVMe disks per server pooled and replicated three times, over a dedicated 25 Gbit/s network. No more single array as a point of failure, and live container migration as a bonus. That is the next step.

Proxmox + Ceph cluster Ceph node 1 node 2 node 3 25 GbE Site 2 / DR DR node async replication
The target: distributed storage (Ceph) spread across three nodes, with no single array as a point of failure, and a recovery node on the second site.

Migrating thirteen services in two days

For the second site's services, no disk conversion: I used backup as the migration vehicle. The Proxmox backup server already produces an image of each container. Migration becomes trivial: back up on the old site, copy the image to the new cluster, restore onto the shared storage, and the container comes back, identical, elsewhere. In two days, thirteen services made the move: monitoring, metrics, logs, SIEM, Git repositories, a CI runner, a proxy, market-data building blocks.

One trap was still waiting: a container restored in privileged mode from an unprivileged backup ends up with its /etc in overly restrictive permissions. Services then refuse to read their own configuration and crash at startup, with no clear message. The fix is one command, if you know to look for it.

The small miseries (that nobody documents)

A migration plan fits on one page. What does not fit on one page is the list of small things that break. A few lived examples:

  • The network bond (LACP). The link refused to forward traffic, the switch address table stayed empty despite a negotiation reported as fine. The fix, counter-intuitive: remove the bond, validate on a single link, add it back, then reboot the host so the negotiation happens cleanly at boot.
  • Two-factor blocking the cluster. Two-factor authentication on the admin account made adding a node fail, with an unhelpful error. I had to disable it temporarily, join the node, then re-enable it.
  • The Linux containers. The package manager no longer resolving DNS in a privileged container, system services failing because they demand namespaces that are unavailable, an SSH login stretched by twenty-five seconds by a session service spinning in the void.
  • The small wins, too. Enabling memory deduplication across machines, tuning the system's tendency to swap: details that, stacked together, fit more machines on the same iron.
A migration is 10% plan and 90% small miseries. Nobody documents them, so I document them.

Everything points to the old address

Migrating a machine changes its IP address. And you then discover how many other systems had that address hard-coded somewhere.

After the switchover, a series of alerts went red. Nothing actually broken: it was the monitoring still querying the old addresses. The dashboard's data source pointed to the metrics server's old IP, so every alert fell into error. A firewall rule only allowed a flow from the old IP. The monitoring targets had not followed the move.

The lesson: a migration is not done when the machine reboots elsewhere. It is done when everything that references it has been updated: the network, DNS, monitoring, security rules, backups. That is the invisible half of the work.

The incident, because honesty matters

A few weeks later, a storage volume filled to the brim. In cascade, about fifteen machines went into I/O error. Disorienting symptom: the Windows VMs answered ping but all their ports were dead, a half-woken zombie; the Linux containers flipped to read-only.

Recovery followed a precise order, and the order matters: grow the volume first, then restart the VMs, then reboot the containers in batches, then repair the damaged permissions along the way. Trying to restart the machines before fixing the storage is driving straight back into the wall.

The lesson, obvious in hindsight: a storage volume must have headroom and automatic growth, plus automatic pruning of old snapshots. I knew it in theory. Not on that volume. You never learn these lessons on other people's volumes.

The outcome

Today, about twenty machines run on the new open-source cluster, split between lightweight containers and VMs. The proprietary licenses are gone from the budget, and one node stays on the second site as backup. But the real gain is not cost.

The real gain is that my virtualization layer became text: versioned in Git, watched by an open-source stack I built alongside it, backed up with immutable snapshots that are my first line of defense against ransomware, and above all legible to my tools, including my AI agent. A small team now owns its virtualization, instead of renting it and waiting on a vendor's support when something breaks at 9am on a trading day.

It is the same thread as my previous articles. Open source and automation are not an engineer's whim. For a small team in a demanding environment, they make the difference between enduring your infrastructure and mastering it.

And if a vendor tells you that its black box is the only serious option for critical production: three servers, a trading floor that never blinked, and a licensing budget divided by a hundred say otherwise.