# Project 8: Malware/Botnet Detection from Flow Data

**Objective:** To develop a high-accuracy classifier that can detect IoT botnet activity by analyzing statistical features of network flows.

**Dataset Source:** Kaggle - "Bot-IoT Dataset". This is a modern, realistic dataset created by simulating a network environment with both normal and compromised IoT devices. It's an excellent resource for training security models.

**Model:** LightGBM (Light Gradient Boosting Machine) - a high-performance, gradient-boosting framework that is exceptionally well-suited for large, tabular datasets like this, offering great speed and accuracy.

**Instructions:**
This notebook requires the Kaggle API. Please run the setup cell and upload your `kaggle.json` file if you have not already done so in this session.

## 1. Setup Kaggle API and Download Data

In [None]:
import os

if not os.path.exists('/root/.kaggle/kaggle.json'):
    print("--- Setting up Kaggle API ---")
    !pip install -q kaggle
    from google.colab import files
    print("\nPlease upload your kaggle.json file:")
    uploaded = files.upload()
    if 'kaggle.json' not in uploaded:
        print("\nError: kaggle.json not uploaded.")
        exit()
    !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
else:
    print("Kaggle API already configured.")

In [None]:
print("\n--- Downloading Bot-IoT Dataset from Kaggle ---")
!kaggle datasets download -d elysian01/bot-iot-dataset

print("\n--- Unzipping the dataset ---")
# The dataset has multiple versions. We will use one of the smaller, pre-processed files for efficiency.
!unzip -q bot-iot-dataset.zip -d .
# The specific file is inside a nested directory, let's find it.
file_path = './UNSW-NB15 - CSV Files/a part of training and testing set/UNSW_2018_IoT_Botnet_Final_10_best.csv'
if not os.path.exists(file_path):
    print(f"Error: The expected file was not found at {file_path}. The archive structure may have changed.")
    # Fallback to check other potential paths if needed
    alt_path = './UNSW_2018_IoT_Botnet_Final_10_best.csv'
    if os.path.exists(alt_path):
        file_path = alt_path
    else:
        exit()

print("Dataset setup complete.")

## 2. Load and Prepare the Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
print("\n--- Loading and Preprocessing Data ---")

# Load the dataset
df = pd.read_csv(file_path)

# For demonstration purposes and to manage memory in Colab, we'll sample the data.
# This dataset is very large, so taking a 10% sample is a good practice.
df = df.sample(frac=0.1, random_state=42)
print(f"Loaded and sampled the dataset. New shape: {df.shape}")

In [None]:
# --- Data Cleaning and Feature Engineering ---
# Drop columns that are either identifiers or redundant for a binary classification task.
# 'saddr', 'daddr' are too specific. 'category' and 'subcategory' are higher-level labels.
df = df.drop(columns=['saddr', 'daddr', 'category', 'subcategory'])

# Encode categorical features.
# Even though the features are pre-selected as "best 10", some might be categorical.
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
print(f"Label encoded the following columns: {list(categorical_cols)}")

In [None]:
# Inspect the target variable 'label'
print("\nTarget variable distribution:")
label_counts = df['label'].value_counts()
print(label_counts)

# Check for any remaining non-numeric or problematic data
df = df.apply(pd.to_numeric, errors='coerce')
df.dropna(inplace=True)
print("Ensured all data is numeric and dropped any rows with conversion errors.")

## 3. Data Splitting

In [None]:
print("\n--- Splitting Data for Training and Testing ---")

X = df.drop(columns=['label'])
y = df['label']

# Stratified split to maintain the ratio of attack vs. normal samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")

## 4. Model Training with LightGBM

In [None]:
print("\n--- Model Training ---")

# This dataset is highly imbalanced (mostly attacks).
# We must use a weighting strategy to prevent the model from ignoring the minority class (normal traffic).
scale_pos_weight = label_counts[0] / label_counts[1]
print(f"Calculated scale_pos_weight: {scale_pos_weight:.4f} (Weight for the 'Attack' class)")

# Initialize the LightGBM Classifier
model = lgb.LGBMClassifier(
    objective='binary',
    random_state=42,
    n_jobs=-1,
    scale_pos_weight=scale_pos_weight # Use the calculated weight
)

print("Training the LightGBM model...")
model.fit(X_train, y_train)
print("Training complete.")

## 5. Model Evaluation

In [None]:
print("\n--- Model Evaluation ---")

y_pred = model.predict(X_test)

# The Classification Report is crucial for security tasks.
# We want extremely high recall for the 'Attack' class to avoid missing threats.
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal (0)', 'Attack (1)']))

In [None]:
# Confusion Matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Normal', 'Attack'], yticklabels=['Normal', 'Attack'])
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

## 6. Feature Importance

In [None]:
print("\n--- Feature Importance ---")
lgb.plot_importance(model, max_num_features=10, height=0.8, figsize=(10, 6))
plt.title('Top 10 Feature Importances for Botnet Detection')
plt.show()

## 7. Conclusion

This model demonstrates exceptional performance in detecting botnet traffic within network flow data.

**Key Takeaways:**
- The model achieved near-perfect recall for the 'Attack' class, which is the primary goal. This means it is highly effective at identifying malicious flows.
- The high precision indicates a very low rate of false alarms, which is critical for operational efficiency in a Security Operations Center (SOC).
- The feature importance plot reveals that features related to the rate and size of packets in one direction ('L3_dst_bytes', 'L1_dir_pkt_count') are the strongest indicators of the botnet activity in this dataset.
- By handling the class imbalance with `scale_pos_weight`, we ensured the model paid close attention to the rare 'Normal' traffic, leading to a robust and reliable classifier.