I'm a Head of IT at a financial company in Switzerland. Multi-site, Cisco switches, Palo Alto firewalls, data center, critical trading systems, telephony, the daily reality of an on-prem infrastructure in a regulated environment.

Since early 2026, I've been using Claude AI (Anthropic) as a co-pilot to manage this infrastructure. Not a chatbot I ask questions to. An agent that connects via SSH to my equipment, reads configs, audits, documents.

Here's what I've learned in 3 months. The results, the limits, and why I think every IT manager should be looking into this now.

The context: a small team facing an infrastructure that never sleeps

If you manage IT infrastructure with a small team, you know the reality:

  • Dozens of network devices spread across multiple sites
  • Firewalls, VPNs, storage, virtualization
  • Critical systems that tolerate zero downtime
  • And a task list that grows faster than you can handle it
Documentation falls behind. Audits get postponed. Risks accumulate silently. Not from incompetence, from lack of time.

That's where AI entered my operations.

The trigger: "What if AI could read my configs?"

I started using Claude Code (Anthropic's CLI) to document code. Then I realized the agent could connect via SSH, read running configs, and analyze them.

I created a dedicated read-only user on all my equipment:

  • Cisco switches (NX-OS, IOS, IOS-XE): show commands only
  • Palo Alto firewalls (Panorama): superreader role
  • Storage, servers: read-only CLI access

No write access. No configure terminal. No modification possible. Read-only, audit-only.

And then everything changed.

Result #1: Full switch fleet audit, dozens of findings in one day

The first real test was a security audit across all my switches.

The AI connected via SSH to each device, pulled the running config, and analyzed it against a standardized checklist:

  • Password encryption
  • ACLs on management lines
  • Insecure protocols active (HTTP server, Telnet, CDP)
  • Spanning Tree protection (BPDU Guard, Root Guard)
  • Unused ports left open
  • NTP, logging, login banners
  • Firmware and known CVEs
Result: over a hundred findings. Several critical. Normally, this audit would have taken me 2 to 3 weeks of fragmented work. With AI, it was done in a day, with a structured report by device, by severity, with recommendations.

The critical findings were things I knew somewhere in my head, but had never formalized:

  • A switch with HTTP server still active
  • Management lines without ACL on a remote site
  • Firmware with known unpatched CVEs

The report forced me to deal with them. To document is to make visible.

Result #2: Complete infrastructure documentation, from zero to a structured repo

Before AI, my documentation was scattered: notes here, emails there, manually saved configs, diagrams in my head.

In a few months, I built a single Git repo with structured documentation generated from actual configs:

  • Complete inventory: every switch, every port, every VLAN
  • Network topology: WAN diagrams in ASCII art
  • Per-device analysis: detailed documentation of each config (core, distribution, edge, DMZ)
  • Security: VPN tunnels inventoried, firewall rules documented, access policies
  • Per site: each remote office has its own section

Each file was generated by the AI after reading the production config directly, then validated by me. This isn't theoretical documentation, it's the exact reflection of what's running in prod.

When a vendor asks a question, when an auditor requests an inventory, when I need to justify a budget, everything is there, up to date, in Markdown, versioned in Git.

Result #3: Storage audit, findings divided by 3

My storage system (SAN array, HA, iSCSI + CIFS) had never been deeply audited since deployment.

The AI analyzed the complete configuration and produced a report with dozens of findings:

  • Incomplete high availability configuration
  • SNMP alerts not routed
  • Volumes without snapshot policy
  • Service accounts with excessive privileges

After fixing the critical points, a second audit divided the number of findings by 3. All remaining: informational or low priority.

Without AI, this audit would probably never have been done. Not from negligence, from lack of time.

Result #4: Migrating monitoring to an open source stack

Monitoring is where AI had the most structural impact.

I already had a complete monitoring stack, but proprietary. The problem with proprietary solutions is that AI can't really work with them: closed APIs, limited documentation, opaque formats. The agent hits walls on interfaces it can neither read nor automate.

This worked out well. I've been an open source expert for years, and I was looking for a reason to migrate. AI gave me the accelerator I was missing.

In a few weeks, I deployed a 100% open source stack with the help of AI:

  • Prometheus: metrics collection (ICMP, SNMP switches/firewalls/storage)
  • Grafana: network, security, storage, compute dashboards
  • Loki: centralized log aggregation
  • Wazuh SIEM: intrusion detection and compliance
  • Alerts: latency, disks, CPU, VPN tunnels down, expiring certificates
The difference with the old stack: everything is in text files, everything is versionable in Git, everything is automatable. AI can read a Prometheus config, propose a new alert, generate a Grafana dashboard in JSON, fix a Wazuh rule. With a proprietary solution, none of that is possible.

Each component was installed, configured, tested, documented. Every new machine added to the infrastructure now follows a mandatory checklist: hardening, fail2ban, SIEM agent, log collection, metrics, dashboard.

An unmonitored machine is an invisible machine. AI helped me systematize this principle, and open source gave it the means to do so.

What AI can NOT do

After 3 months of intensive use, here are the real limits. And they matter.

It doesn't make business decisions

AI can tell me a firmware has CVEs. It can't decide whether the risk of updating in production (downtime on critical systems) is acceptable on a Tuesday at 2pm. That's my job.

It doesn't replace experience

When it analyzes a config and flags a "problem," you need human context to know whether it's a real problem or a deliberate architectural choice. Disabling certain advanced security features can be a deliberate decision, not an oversight. AI can't know that unless you tell it.

It doesn't handle human relationships

Negotiating with a vendor, convincing management to invest in infrastructure, managing a team under pressure, that remains 100% human.

It can be wrong

It can misinterpret a config, confuse a legacy artifact with an active problem, or suggest a command that doesn't work on a specific OS version. Human review is non-negotiable.

It needs guardrails

I configured read-only access everywhere. No writing, no modification. If AI could execute configuration commands on a production switch, the risk would be unacceptable. Trust is built incrementally.

The real change: from reactive to proactive

The biggest impact of AI on my operations isn't speed. It's the shift in posture.

Before, I spent most of my time in reactive mode: incidents, requests, emergencies. Documentation, audits, optimization, all of that was perpetually "for later."

Now, with an AI agent that can read, analyze and document in parallel with my daily work, I can finally do that foundational work:

  • Regular audits instead of "when I have time" audits (never)
  • Up-to-date documentation instead of scattered notes
  • Proactive monitoring instead of discovering problems when things break
  • Informed decisions instead of undocumented intuitions

For a small IT team managing multi-site infrastructure in a demanding environment, it's a force multiplier.

How to get started (practical guide)

If you're an IT manager, sysadmin, or infrastructure engineer and want to explore this approach, here's the method I recommend:

1. Start with read-only

Create a dedicated user with minimal rights on your equipment. show commands only on switches. Read-only role on your firewalls. SNMP read-only. Build trust before expanding access.

2. Start with documentation

This is the safest and most immediately useful use case. Have your running configs analyzed and generate structured documentation. You'll discover things you had forgotten.

3. Move to audits

Once documentation is in place, use AI to systematically audit: security, compliance, vendor best practices. Formalize what you know intuitively.

4. Deploy monitoring

Use AI to deploy and configure a monitoring stack. Prometheus, Grafana, Wazuh, the tools are open source and mature. AI greatly accelerates the configuration and alert tuning phase.

5. Stay in control

AI is a tool, not an autonomous colleague. Review everything. Validate everything. Version everything in Git. If you don't understand what AI proposes, don't apply it.

The tools I use

For those who want to reproduce this approach:

  • Claude Code (Anthropic): the CLI that lets AI connect via SSH, read files, execute commands. It's the core of the setup.
  • Git (Gitea self-hosted): everything is versioned. Documentation, redacted configs, audit reports.
  • Prometheus + Grafana + Loki: open source monitoring, deployed on Linux containers.
  • Wazuh: open source SIEM for detection and compliance.

Everything runs on self-hosted infrastructure. No data in the cloud, no SaaS dependency for critical operations. That's a choice.

Conclusion

In 3 months, AI allowed me to:

  • Audit my entire network fleet and discover invisible critical findings
  • Document the entire infrastructure from actual configs, in a structured Git repo
  • Deploy a complete monitoring stack (metrics, logs, SIEM, alerts)
  • Audit my storage and divide findings by 3
  • Systematize security with mandatory checklists for every new machine

All while continuing to manage the daily operations of a demanding multi-site infrastructure.

AI doesn't replace the IT manager. It gives them back the time to do their real job: anticipate, structure, secure.

If you're waiting for AI to be perfect before starting, you'll wait too long. Start small, read-only, on a concrete use case. The results speak for themselves.