Objective
Build an interpretable machine learning model that can predict whether a network device is vulnerable based on its software version string, providing clear decision logic for security teams to understand and act upon vulnerability assessments.
Business Value
For Network Security Teams:- Proactive Risk Assessment: Identify vulnerable devices before security scanners detect active threats
- Patch Prioritization: Focus limited maintenance windows on devices with highest vulnerability risk
- Transparent Decision Making: Clear decision tree logic shows exactly why a device is flagged as vulnerable
- Continuous Monitoring: Real-time vulnerability scoring integrated with network inventory systems
For Enterprise IT:- Asset Management: Automated vulnerability assessment integrated with CMDB and inventory systems
- Compliance Support: Document clear criteria for device vulnerability classification
- Resource Optimization: Prioritize patching efforts based on data-driven risk assessment
- Cost Reduction: Reduce manual vulnerability assessment overhead through automation
Core Libraries
- pandas & numpy: Data processing and numerical computations for version string analysis
- scikit-learn: DecisionTreeClassifier for interpretable vulnerability prediction
- matplotlib & seaborn: Decision tree visualization and confusion matrix analysis
- re (regex): Version string parsing to extract major, minor, and patch numbers
- synthetic data generation: Custom functions to create realistic network device datasets
Dataset
Source: Synthetically Generated Network Device Data- Device Types: CISCO routers, JUNIPER firewalls, ARISTA switches with realistic version patterns
- Version Strings: Complex version formats (e.g., "15.1(4)M", "20.4R3") typical of network equipment
- Vulnerability Rules: Predefined logic based on version age and device type patterns
- Scale: Multiple devices per version with noise injection for realistic class distribution
Key Features:- Device type classification (router, firewall, switch)
- Version parsing: major version, minor version, patch level
- Vulnerability labeling based on version age and known patterns
- Realistic noise injection to simulate real-world uncertainty
Step-by-Step Guide
1. Environment Setup and Data Generation
pip install pandas numpy scikit-learn matplotlib seaborn
Generate synthetic network device data with realistic software version patterns and vulnerability rules.
2. Synthetic Data Generation
# Define device types and version patterns
devices = {
'CISCO_ROUTER': ['15.1(4)M', '15.2(1)T', '15.5(3)S', '16.1.1', '16.3.2'],
'JUNIPER_FIREWALL': ['18.4R1', '19.2R2', '20.1R1', '20.4R3', '21.2R1'],
'ARISTA_SWITCH': ['4.20.6M', '4.21.5F', '4.22.1F', '4.23.0F', '4.25.1M']
}
# Apply vulnerability rules based on version patterns
def is_vulnerable(row):
if 'CISCO' in row['device_type'] and ('15.1' in row['software_version'] or '15.2' in row['software_version']):
return 1
return 0
3. Version String Feature Engineering
# Parse complex version strings into numerical features
def parse_version(version):
# Extract numbers from patterns like 15.1(4)M -> [15, 1, 4]
parts = re.findall(r'(\d+)', version)
parts = [int(p) for p in parts]
while len(parts) < 3:
parts.append(0)
return parts[:3]
df[['v_major', 'v_minor', 'v_patch']] = pd.DataFrame(version_features.tolist(), index=df.index)
4. Model Training with Interpretability Focus
# Decision Tree with limited depth for interpretability
model = DecisionTreeClassifier(
random_state=42,
max_depth=4 # Keep tree interpretable
)
model.fit(X_train, y_train)
5. Vulnerability Assessment Evaluation
# Focus on vulnerability detection performance
print(classification_report(y_test, y_pred, target_names=['Not Vulnerable', 'Vulnerable']))
# Analyze missed vulnerabilities (false negatives)
cm = confusion_matrix(y_test, y_pred)
6. Decision Tree Visualization
# Visualize the learned decision rules
plot_tree(
model,
feature_names=X.columns,
class_names=['Not Vulnerable', 'Vulnerable'],
filled=True,
rounded=True
)
Success Criteria
Primary Metrics:- Recall for Vulnerable Class: >90% (catch all vulnerable devices)
- Precision for Vulnerable Class: >80% (minimize false alarms)
- F1-Score: >0.85 for balanced performance
Secondary Metrics:- Decision Tree Interpretability: Clear, understandable decision rules with max depth ≤ 5
- Feature Importance: Logical version-based splitting criteria
- Processing Speed: Fast inference for real-time inventory assessment
Business Impact:- Deploy in network asset management systems
- Integrate with patch management workflows
- Provide clear audit trail for vulnerability decisions
Next Steps & Extensions
Immediate Improvements
- Real Data Integration: Connect with vulnerability databases (CVE, NVD)
- Multi-vendor Support: Expand device type coverage and version parsing
- Confidence Scoring: Add prediction probability for risk prioritization
Advanced Techniques
- Ensemble Methods: Combine multiple decision trees for improved accuracy
- Time Series Analysis: Incorporate patch release timelines and vulnerability disclosure dates
- Active Learning: Update model with security team feedback on predictions
Production Deployment
- API Integration: REST API for real-time vulnerability assessment
- CMDB Integration: Automatic device inventory vulnerability scoring
- Alert Systems: Automated notifications for high-risk device detection
Domain Expansion
- CVE Mapping: Direct integration with Common Vulnerabilities and Exposures database
- Risk Scoring: Multi-factor risk assessment including network exposure and criticality
- Patch Planning: Automated maintenance window planning based on vulnerability predictions
This project demonstrates practical application of interpretable machine learning for cybersecurity asset management, providing both accurate vulnerability prediction and transparent decision logic for security operations teams.