Home > Blog > Phishing URL Detection with Python and ML

Phishing URL Detection with Python and ML

Phishing is a form of fraudulent attack where the attacker tries to gain sensitive information by posing as a reputable source. In a typical phishing attack, a victim opens a compromised link that poses as a credible website. The victim is then asked to enter their credentials, but since it is a “fake” website, the sensitive information is routed to the hacker and the victim gets ”‘hacked.”

Phishing is popular since it is a low effort, high reward attack. Most modern web browsers, antivirus software and email clients are pretty good at detecting phishing websites at the source, helping to prevent attacks. To understand how they work, this blog post will walk you through a tutorial that shows you how to build your own phishing URL detector using Python and machine learning:

Identify the criteria that can recognize fake URLs
Build a decision tree that can iterate through the criteria
Train our model to recognize fake vs real URLs
Evaluate our model to see how it performs
Check for false positives/negatives

Get Started: Install ML Tools With This Ready-To-Use Python Environment

To follow along with the code in this Python phishing detection tutorial, you’ll need to have a recent version of Python installed, along with all the packages used in this post. The quickest way to get up and running is to install the Phishing URL Detection runtime for Windows or Linux, which contains a version of Python and all the packages you’ll need.

Required Packages

In order to download the ready-to-use phishing detection Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy and it unlocks the ActiveState Platform’s many benefits for you!

Supported Platforms

For Windows users, run the following at a CMD prompt to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:

powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Phishing-URL-Detection"

For Linux users, run the following to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:

sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Phishing-URL-Detection

1 — How to Identify A Fraudulent URL

A fraudulent domain or phishing domain is an URL scheme that looks suspicious for a variety of reasons. Most commonly, the URL:

Is misspelled
Points to the wrong top-level domain
A combination of a valid and a fraudulent URL
Is incredibly long
Is just be an IP address
Has a low pagerank
Has a young domain age
Ranks poorly on the Alexa Top 1 Million Sites

All these are characteristics of a phishing URL that can help us distinguish it from a valid URL. These characteristics can be converted into machine learning feature sets such as numbers, labels and booleans.

The University of California, Irvine put together a dataset identifying fraudulent versus valid URLs. Feature sets are divided into four main categories:

Address Bar-Based Features – these are features extracted from the URL itself, like URL length >54 characters, or whether it contains an IP address, uses an URL shortening service like TinyURL or Bitly, or employs redirection. Addition features may also include:
- Adding a prefix or suffix separated by (-) to the domain
- Having sub-domain and multi-sub-domains
- Existence of HTTPS
- Domain registration age
- Favicon loading from a different domain
- Using a non-standard port
Abnormal Features – these may include:
- Loading images loaded in the body from a different URL
- Minimal use of meta tags
- The use of a Server Form Handler (SFH)
- Submitting information to email
- An abnormal URL
HTML and JavaScript-Based Features – these can include things like:
- Website forwarding
- Status bar customization typically using JavaScript to display a fake URL
- Disabling the ability to right-click so users can’t view page source code
- Using pop-up windows
- iFrame redirection
Domain-Based Features – these can include:
- Unusually young domains
- Suspicious DNS record
- Low volume of website traffic
- PageRank, where 95% of phishing webpages have no PageRank
- Whether the site has been indexed by Google

2 — Building A Decision Tree

Given all the criteria that can help us identify phishing URLs, we can use a machine learning algorithm, such as a decision tree classifier to help us decide whether an URL is valid or not.

First, let’s download the UC Irvine dataset and explore its contents. The feature list contains:

having_IP_Address { -1,1 }
URL_Length { 1,0,-1 }
Shortining_Service { 1,-1 }
having_At_Symbol { 1,-1 }
double_slash_redirecting { -1,1 }
Prefix_Suffix { -1,1 }
having_Sub_Domain { -1,0,1 }
SSLfinal_State { -1,1,0 }
Domain_registeration_length { -1,1 }
Favicon { 1,-1 }
port { 1,-1 }
HTTPS_token { -1,1 }
Request_URL { 1,-1 }
URL_of_Anchor { -1,0,1 }
Links_in_tags { 1,-1,0 }
SFH { -1,1,0 }
Submitting_to_email { -1,1 }
Abnormal_URL { -1,1 }
Redirect { 0,1 }
on_mouseover { 1,-1 }
RightClick { 1,-1 }
popUpWidnow { 1,-1 }
Iframe { 1,-1 }
age_of_domain { -1,1 }
DNSRecord { -1,1 }
web_traffic { -1,0,1 }
Page_Rank { -1,1 }
Google_Index { 1,-1 }
Links_pointing_to_page { 1,0,-1 }
Statistical_report { -1,1 }

And finally, the Result designates whether the URL is valid or not:

Result { -1,1 }

Where -1 denotes an invalid URL and 1 is a valid URL.

Now let’s now jump into the code. First, we load the required modules:

# To perform operations on dataset
import pandas as pd
import numpy as np
# Machine learning model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# Visualization
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import export_graphviz

Next we read and split the dataset:

df = pd.read_csv('.../dataset.csv')
dot_file = '.../tree.dot'
confusion_matrix_file = '.../confusion_matrix.png'

And then print the results:

print(df.head())

-1  1  1.1  1.2  -1.1  -1.2  -1.3  -1.4  -1.5  1.3  1.4  -1.6  1.5  -1.7  1.6     ...    -1.9  -1.10  0  1.7  1.8  1.9  1.10  -1.11  -1.12  -1.13  -1.14  1.11  1.12  -1.15  -1.16
0   1  1    1    1     1    -1     0     1    -1    1    1    -1    1     0   -1  ...       1      1  0    1    1    1     1     -1     -1      0     -1     1     1      1     -1
1   1  0    1    1     1    -1    -1    -1    -1    1    1    -1    1     0   -1  ...      -1     -1  0    1    1    1     1      1     -1      1     -1     1     0     -1     -1
2   1  0    1    1     1    -1    -1    -1     1    1    1    -1   -1     0    0  ...       1      1  0    1    1    1     1     -1     -1      1     -1     1    -1      1     -1
3   1  0   -1    1     1    -1     1     1    -1    1    1     1    1     0    0  ...       1      1  0   -1    1   -1     1     -1     -1      0     -1     1     1      1      1
4  -1  0   -1    1    -1    -1     1     1    -1    1    1    -1    1     0    0  ...      -1     -1  0    1    1    1     1      1      1      1     -1     1    -1     -1      1

This dataset contains 5 rows and 31 columns, where each column contains a value for each of the attributes we discussed in the above section.

3 — Train the Model

As always, the first step in training a machine learning model is to split the dataset into testing and training data:

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

Since the dataset contains boolean data, it’s always best to use a Decision Tree, RandomForest Classifier or Logistic Regression algorithm since these models work best for classification. In this case, I chose to work with a Decision Tree, because it’s straightforward and generally gives the best results when trying to classify data.

model = DecisionTreeClassifier()
model.fit(Xtrain, ytrain)

4 — Evaluate the Model

Now that the model is trained, let’s see how well it does on the test data:

ypred = model.predict(Xtest)
print(metrics.classification_report(ypred, ytest))
print("\n\nAccuracy Score:", metrics.accuracy_score(ytest, ypred).round(2)*100, "%")

We used the model to predict Xtest data. Now let’s compare the results to ytest and see how well we did:

             precision    recall  f1-score   support
         -1       0.95      0.95      0.95      1176
          1       0.96      0.96      0.96      1588
  micro avg       0.96      0.96      0.96      2764
  macro avg       0.96      0.96      0.96      2764
weighted avg       0.96      0.96      0.96      2764
Accuracy Score: 96.0 %

Not bad! We made literally no modifications to the data and achieved an accuracy score of 96%. From here, you can dive deeper into the data and see if there’s any transformation that can be done to further improve the accuracy of prediction.

5 — Identify False Positives & False Negatives

The results of any decision tree evaluation are likely to contain both false positives (URLs that are actually valid, but that our model indicates are not), as well as false negatives (URLs that are actually bad, but our model indicates are fine). To help resolve these instances, let’s draw out a confusion matrix (a table with 4 different combinations of predicted and actual values) for our results. The matrix will help us identify:

True Positives
True Negatives
False Positives (Type 1 Error)
False Negatives (Type 2 Error)

mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.savefig(confusion_matrix_file)

As you can see, the number of false positives and false negatives are pretty low compared to our true positives and negatives, so we can be pretty sure of our results.

To see how the decision tree panned out in making these decisions, we can visualize it with sklearn, matplotlib and sns.

export_graphviz(model, out_file=dot_file, feature_names=X.columns.values)
>> dot -Tpng tree.dot -o tree.png

We use export_graphviz to create a dot file of the decision tree, which is a text file that lets us visualize the actual bifurcations in decisions. Then, using the command line tool dot we convert the text file to a PNG image which shows our final “tree” of decisions (open it in a new tab to view the details):

Phishing URL Detection with Python: Summary

These days, when everyone is working for home, there’s a lot less opportunity to just casually ask your office colleagues if they’ve received a suspicious email like the one you just got. And attackers know it, driving a 300% increase in cybercrime since the start of the pandemic. It’s always good practice to check every link before you click on it, but of course, busy employees can get careless.

This blog post showed you how, given a set of criteria that can typically identify phishing URLs, you can build and train a simple decision tree model to evaluate any given URL, and indicate whether it is actually valid or not with 96% accuracy. Now, if only it was as easy as this to prevent people from clicking fraudulent links in the first place!

You can find the criteria for evaluating phishing URLs in UC Irvine’s dataset.
To get started building your own URL phishing detector, sign up for a free ActiveState Platform account so you can download our Phishing URL Detection runtime environment and get started faster.

Use Cases

Dependency Management

Beyond End of Life Support

Governance and Regulations

The Platform

CLI - State Tool

Product Updates

Supported Languages

Resources

Support

Quick Links

Our Advantages

Pricing

Contact Us

Contact Sales

Use Cases

Dependency Management

Beyond End of Life Support

Governance and Regulations

The Platform

CLI - State Tool

Product Updates

Supported Languages

Resources

Support

Quick Links

Our Advantages

Pricing

Contact Us

Contact Sales

Phishing URL Detection with Python and ML

Get Started: Install ML Tools With This Ready-To-Use Python Environment

1 — How to Identify A Fraudulent URL

2 — Building A Decision Tree

3 — Train the Model

4 — Evaluate the Model

5 — Identify False Positives & False Negatives

Phishing URL Detection with Python: Summary

Related Reads

Recent Posts

Top 3 Uses Cases for Managing Open Source at Scale

Adopting SSDF: US Government Best Practices for Securing Your Software Supply Chain

DepSec: How DevOps Can Secure Open Source at Scale

Ready to Get Started?

Join Our Mailing List

Products

Product Updates

Supported Languages

Resources

Quick Links

Our Advantages

Solutions

Use Cases

Dependency Management

Beyond End of Life Support

Governance and Regulations

Pricing

Company

Support