Research & Results

Comprehensive analysis of our AI-powered phishing detection system

98.59%

Model Accuracy

450K+

URLs Analyzed

28

Features Extracted

Model Performance Analysis

Visual representation of our model's performance metrics and comparisons

Accuracy Analysis

This graph demonstrates the model's accuracy progression across different training epochs, showing how the model's performance improved over time.

Performance Metrics

Detailed breakdown of various performance metrics including precision, recall, and F1-score, providing a comprehensive view of the model's effectiveness.

Parameter Comparison

Comparative analysis of different model parameters and their impact on overall performance, helping identify the optimal configuration.

Abstract

PhishShield represents a comprehensive approach to combating phishing threats through machine learning. Using a refined dataset of 450K+ URLs, we developed an advanced feature extraction system that analyzes 28 distinct characteristics across URL structure, page content, and security indicators. Our XGBoost model achieves 98.59% accuracy with a 99.85% AUC-ROC score. Implemented as a Chrome extension with Manifest V3, PhishShield provides real-time URL monitoring, visual threat alerts, and scan history tracking, demonstrating the successful integration of machine learning with practical browser-based security tools.

Data Collection Journey

Initial Dataset

Started with 11,000 URLs from Kaggle with 30 affiliated features

Source Expansion

PhishTank: ~65,000 live phishing URLs
OpenPhish: ~500 live phishing URLs
UCI Dataset: 110,000 labeled instances
Additional Kaggle Datasets: ~34,000 valid URLs
UCI Repository: 134,000 valid URLs

Data Cleaning

Initial cleaning reduced 450,000 URLs to 245,000 unique URLs, further refined to 186,262 samples (87,667 legitimate, 98,595 phishing)

Feature Extraction

URL-Based Features

URL Length Analysis
Domain Length
TLD Length
Letter Ratio in URL
Digit Ratio in URL
Special Character Ratio
Abnormal URL Patterns

Page Content Features

Largest Line Length
Number of Images
JavaScript Files Count
CSS Files Count
Self-Referencing Links
External References
Iframe Presence
Pop-Up Windows

Security Features

HTTPS Status
Obfuscation Detection
Title Tag Presence
Meta Description
Submit Button Analysis
Password Field Detection
Domain IP Address
Right-Click Disabled Check

Data Preprocessing Journey

Our comprehensive data preprocessing pipeline ensures high-quality input for our machine learning models.

Initial Cleaning

Removal of duplicate URLs
Handling of missing values
Standardization of URL formats
Verification of URL accessibility

Feature Engineering

URL structure analysis
Domain-based feature extraction
HTML content parsing
Security certificate validation

Data Balancing

Class distribution analysis
Undersampling of majority class
SMOTE for minority class
Validation split preparation

Model Development

Model Performance Comparison

Best Performer

XGBoost

Accuracy: 98.59%
AUC-ROC: 99.85%
Precision: 99.31%
Recall: 98.55%
F1 Score: 98.93%

Stacking Ensemble

Accuracy: 98.57%
Enhanced Robustness
Balanced Performance
Multi-Model Integration
Reduced Variance

AutoGluon

Accuracy: 98.43%
AUC-ROC: 99.82%
Automated Optimization
Quick Deployment
Self-Tuning

Chrome Extension Implementation

Key Features

Real-time URL monitoring and threat detection
Built on Chrome's Manifest V3 standard
Service worker-based architecture
Local storage for scan history (last 10 scans)
Visual alerts for detected threats
Integration with FastAPI backend

Protected

Extension is active and monitoring

Real-time protection active

ML-powered threat detection

Future Scope

Multi-Modal Analysis

Integrating visual and textual content analysis with URL features for enhanced detection.

Cross-Language Support

Expanding detection capabilities to handle multiple languages and regional variations.

Mobile & IoT Protection

Optimizing detection for mobile browsers and IoT devices.

Federated Learning

Implementing privacy-preserving collaborative learning across devices.