Building MDR Automation Tools: Lessons from the Trenches

When I joined Trend Micro as a Security Engineer, the reality became clear that Managed Detection and Response (MDR) for enterprise MDR is a completely different beast than what most people imagine. It's not just about having the best detection rules or the shiniest SIEM-it's about efficiency, speed, and making the impossible manageable.

We're talking about thousands of events per hour, hundreds of customers, and a team that needs to respond to genuine threats in minutes, not hours. That's where automation comes in, and that's where I've spent the last few years building tools that actually make a difference.

The Reality of MDR Operations

Let me paint you a picture: It's 2 AM, and an analyst gets an alert. Something suspicious in a customer's environment-maybe lateral movement, maybe just a misconfigured service. The analyst needs to:

Pull context from multiple sources (SIEM, EDR, threat intel)
Correlate events across different timestamps
Enrich data with asset information
Determine if it's a false positive
If real, escalate and contain
Document everything for compliance

Doing this manually for every alert? That's a recipe for burnout and missed threats.

Why I Started Building Internal Tools

The commercial tools we used were powerful, but they weren't built for our specific workflow. Every team has unique processes, unique data sources, and unique pain points. Off-the-shelf solutions get you 80% of the way there, but that last 20% is where you either thrive or drown.

I remember the first tool I built-a simple Python script that automated the "first five minutes" of investigation. It would:

def rapid_triage(alert_id):
    """
    Automated first-pass triage for MDR alerts
    Saves analysts 5-10 minutes per alert
    """
    # Fetch alert from SIEM
    alert = siem_api.get_alert(alert_id)
    
    # Parallel enrichment (saves time)
    with ThreadPoolExecutor(max_workers=5) as executor:
        threat_intel = executor.submit(check_threat_intel, alert.ioc)
        asset_context = executor.submit(get_asset_info, alert.hostname)
        related_events = executor.submit(find_related_events, alert)
        user_context = executor.submit(get_user_risk_score, alert.user)
        historical = executor.submit(check_historical_patterns, alert)
    
    # Compile enrichment report
    context = {
        'threat_score': threat_intel.result(),
        'asset': asset_context.result(),
        'related': related_events.result(),
        'user': user_context.result(),
        'history': historical.result()
    }
    
    # Auto-classify based on rules
    classification = classify_alert(context)
    
    # Generate analyst-friendly summary
    return create_investigation_summary(alert, context, classification)

That script alone saved our team hundreds of hours in the first month. But more importantly, it meant analysts could focus on thinking rather than clicking.

The Tools That Actually Moved the Needle

Over time, I built and maintained a suite of internal tools. Here are the ones that made the biggest impact:

1. Alert Correlation Engine

We were drowning in alerts that were related but scattered across different systems. I built a correlation engine that:

Grouped alerts by attack pattern (reconnaissance → exploitation → lateral movement)
Identified multi-stage attacks automatically
Reduced alert volume by 60% through intelligent deduplication

The secret? Graph databases. Modeling security events as nodes and relationships made pattern matching trivial:

# Neo4j query to find attack chains
query = """
MATCH (recon:Alert {type: 'reconnaissance'})-[:FOLLOWED_BY*1..5]->(exploit:Alert {type: 'exploitation'})
WHERE recon.timestamp < exploit.timestamp
AND duration.between(recon.timestamp, exploit.timestamp).minutes < 60
AND recon.source_ip = exploit.source_ip
RETURN recon, exploit, path
"""

2. Automated Playbook Executor

Repetitive response actions (isolate host, block IP, reset password) were eating up analyst time. I created a playbook system that:

Defined response workflows in YAML
Executed multi-step actions across different platforms (EDR, firewall, Active Directory)
Required human approval for destructive actions
Logged every step for audit trails

Example playbook:

name: Malware Containment - Tier 1
trigger: malware_detected
steps:
  - action: isolate_host
    target: ${alert.hostname}
    approval: auto
    
  - action: snapshot_memory
    target: ${alert.hostname}
    approval: auto
    
  - action: block_hash
    target: ${alert.file_hash}
    platforms: [edr, firewall]
    approval: auto
    
  - action: notify_customer
    template: malware_contained
    approval: auto
    
  - action: create_ticket
    priority: high
    approval: auto

3. Threat Intelligence Aggregator

We subscribed to multiple threat intel feeds. Manually cross-referencing IOCs was a nightmare. I built an aggregator that:

Pulled from 10+ sources (commercial and open-source)
Deduplicated and scored IOCs by confidence
Automatically enriched SIEM events
Provided a single API endpoint for lookups

Performance was critical-lookups needed to be sub-100ms to not bottleneck investigations.

@lru_cache(maxsize=10000)
async def lookup_ioc(ioc: str, ioc_type: str) -> ThreatIntel:
    """Cached threat intel lookup"""
    tasks = [
        query_virustotal(ioc),
        query_alienvault(ioc),
        query_abuseipdb(ioc),
        query_internal_db(ioc)
    ]
    results = await asyncio.gather(*tasks)
    return aggregate_and_score(results)

Lessons Learned

Lesson 1: Automate the Boring, Not the Thinking

Early on, I tried to build an "AI" that would auto-close false positives. It was a disaster. The model couldn't capture the nuance that a human analyst could spot in seconds.

What worked: Automate data gathering, enrichment, and repetitive tasks. Let humans make the final call.

Lesson 2: Observability Is Non-Negotiable

When your automation is making decisions in production, you must know what it's doing. Every tool I built had:

Detailed logging (structured JSON logs to a central system)
Metrics dashboards (how many alerts processed, success rates, execution times)
Alerting on anomalies (if the tool starts failing, page someone)

Tools without observability are black boxes. Black boxes in security are terrifying.

Lesson 3: Start Small, Iterate Fast

My first automation was a 100-line Python script. It solved one small problem. Then I added features based on feedback. Then I refactored. Then I scaled.

The tools that failed were the ones where I tried to build the "perfect solution" from day one. Perfect is the enemy of shipped.

Lesson 4: Make It Easy to Disable

In security, automation can go wrong. Maybe a bug in your code. Maybe an edge case you didn't consider. Always have a kill switch.

Every tool I shipped had a feature flag:

if not config.get('automation.rapid_triage.enabled', False):
    logger.warning("Rapid triage automation is disabled")
    return manual_triage(alert_id)

Lesson 5: Documentation > Code

I spent as much time writing runbooks and docs as I did writing code. Why? Because:

Teammates need to understand how it works
On-call needs to troubleshoot at 3 AM
Future me will forget why I made that weird architectural choice

Good docs mean your tool actually gets used.

The Architecture Philosophy

All my tools followed a similar pattern:

┌─────────────┐
│   Trigger   │  (Alert, Scheduled Job, API Call)
└──────┬──────┘
       │
       v
┌─────────────┐
│ Validation  │  (Schema check, rate limits)
└──────┬──────┘
       │
       v
┌─────────────┐
│  Enrichment │  (Pull context from multiple sources)
└──────┬──────┘
       │
       v
┌─────────────┐
│   Logic     │  (Decision making, orchestration)
└──────┬──────┘
       │
       v
┌─────────────┐
│   Action    │  (Execute, log, notify)
└──────┬──────┘
       │
       v
┌─────────────┐
│   Audit     │  (Store in DB, metrics, alerting)
└─────────────┘

Simple, testable, debuggable.

The Tech Stack

For those curious:

Languages: Python (primary), Go (for high-perf services)
Storage: PostgreSQL (events, audit logs), Redis (caching, job queues), Neo4j (relationship modeling)
Orchestration: Celery for task queues, Airflow for complex workflows
APIs: FastAPI for HTTP services, gRPC for internal microservices
Deployment: Docker, Kubernetes, GitLab CI/CD
Monitoring: Prometheus + Grafana, ELK stack for logs

The Impact

After two years of building and iterating:

60% reduction in mean time to triage (MTTT)
40% reduction in false positive escalations
80% of Tier 1 alerts fully automated (human-in-the-loop for approval)
Analyst satisfaction up (they spend time on interesting problems, not clicking through UIs)

But the real metric? We started catching threats we would have missed before. Automation gave us speed, and speed meant we could respond before attackers pivoted.

Advice for Building Your Own Tools

If you're thinking about building internal security tools, here's what I'd recommend:

Talk to your users (analysts, SOC managers) before writing a line of code
Solve one problem really well before tackling the next
Measure everything-you can't improve what you don't measure
Security first-your tools have access to sensitive data. Treat them like production systems.
Open source when possible-standing on the shoulders of giants saves time

Wrapping Up

Building automation tools for MDR isn't about replacing analysts-it's about amplifying them. It's about giving them superpowers so they can focus on what humans do best: critical thinking, pattern recognition, and creative problem-solving.

If you're in a security role and thinking "I could automate this," you probably should. Start small. Ship fast. Iterate. And always, always keep the analyst experience in mind.

Because at the end of the day, tools are only as good as the people using them.

---

Janusz Czeropski is a Security Engineer at Trend Micro, where he builds internal tools and automation for MDR operations. When he's not automating alerts, he's probably self-hosting something or playing around with infrastructure. You can find him on GitHub or LinkedIn.