Dataset Access

Access the world's largest open-source dataset of phishing, smishing, and scam messages. Available for researchers, developers, and organizations committed to fighting digital deception.

Access Options

Choose the access method that best fits your needs. All data is released under Creative Commons CC0 (Public Domain).

GitHub Dump

Full PostgreSQL database dump with all anonymized submissions, updated weekly.

Complete dataset export
SQL format for easy import
No authentication required
Weekly updates

Download from GitHub

Best for: Researchers, bulk analysis, offline use

REST API

Programmatic access to query and filter the dataset in real-time.

Real-time data access
Advanced filtering & search
OpenAPI documentation
Rate limits: 1000 req/hour

View API Docs

Best for: Applications, integrations, dynamic queries

Partner Access

Enhanced access for academic institutions, security companies, and verified researchers.

Bulk submission API
Higher rate limits
Early access to new models
Dedicated support

Become a Partner

Best for: Organizations, bulk contributors, researchers

Data Format & Schema

Core Tables: Submissions

The main submissions table stores all phishing, smishing, and scam messages with anonymized content and metadata:

Field	Type	Description
submission_id	UUID	Primary key, unique identifier
message_type	VARCHAR(20)	email, sms, whatsapp, telegram, signal, other
threat_level	VARCHAR(20)	low, medium, high, critical, unknown
attack_type	VARCHAR(50)	phishing, spear-phishing, smishing, BEC, romance_scam, etc.
subject_text	TEXT	Email subject or SMS preview (PII redacted)
body_text	TEXT	Message content (PII redacted)
raw_headers	JSONB	Email headers with PII removed
detected_language	VARCHAR(10)	ISO 639-1 language code (e.g., 'en', 'es', 'fr')
detected_country	VARCHAR(2)	ISO 3166-1 country code (e.g., 'US', 'GB')
sender_domain	VARCHAR(255)	Sender domain (preserved for pattern analysis)
sender_domain_hash	BYTEA	SHA-256 hash of full sender email
claimed_sender_name	VARCHAR(255)	Display name from sender (may be spoofed)
message_timestamp	TIMESTAMPTZ	When message was originally sent
submission_timestamp	TIMESTAMPTZ	When submitted to Sting9
verified	BOOLEAN	Manually verified as malicious
confidence_score	DECIMAL(3,2)	ML model confidence (0.00-1.00)

URLs Table

Extracted and analyzed URLs with threat intelligence:

url_idUUID primary key
full_urlComplete URL (defanged)
domainDomain name
threat_statussafe, suspicious, malicious
is_shortenedURL shortener detection
blocklist_sourcesWhich blocklists flagged it

Attachments Table

Metadata about file attachments (no file content stored):

attachment_idUUID primary key
filenameOriginal filename
file_hashSHA-256 hash
mime_typeFile type
is_executableExecutable file detection
malware_scan_resultclean, suspicious, malicious

Email Details Table

Extended metadata for email messages:

spf_resultSPF validation (pass/fail)
dkim_resultDKIM signature validation
dmarc_resultDMARC validation
uses_url_shortenersContains bit.ly, tinyurl, etc.
uses_homoglyphsLookalike characters (paypa1.com)

SMS Details Table

Extended metadata for SMS/text messages:

message_lengthCharacter count
sender_typeshortcode, longcode, alphanumeric
contains_urlMessage includes URL
urgency_keywordsContains urgent/immediate language

Privacy Guarantee

All personal information (email addresses, phone numbers, names, addresses, etc.) is automatically detected and redacted using advanced PII detection algorithms before being added to the dataset. URLs are defanged to prevent accidental clicks.

Usage Examples

Python with pandas

import pandas as pd
import requests

# Using the API
response = requests.get(
    'https://api.sting9.org/v1/submissions',
    params={
        'message_type': 'email',
        'language': 'en',
        'limit': 1000
    }
)
df = pd.DataFrame(response.json())

# Or load from GitHub dump
df = pd.read_sql(
    'SELECT * FROM submissions',
    'postgresql://localhost/sting9'
)

# Analyze threat types
df['threat_type'].value_counts()

JavaScript / Node.js

// Using fetch API
const response = await fetch(
  'https://api.sting9.org/v1/submissions',
  {
    method: 'GET',
    headers: {
      'Content-Type': 'application/json'
    }
  }
);

const data = await response.json();

// Filter phishing emails
const phishingEmails = data.filter(
  item => item.threat_type === 'phishing' &&
          item.message_type === 'email'
);

console.log(
  `Found ${phishingEmails.length} phishing emails`
);

License & Terms of Use

Creative Commons CC0 (Public Domain)

The Sting9 dataset is released under Creative Commons Zero (CC0), dedicating it to the public domain. This means:

You can copy, modify, and distribute the dataset without asking permission
You can use it for commercial or non-commercial purposes
No attribution is required (though appreciated!)
No warranty or liability is provided

Ethical Use Guidelines

While the data is public domain, we encourage ethical use:

Do not attempt to re-identify anonymized individuals
Do not use the data to create or improve malicious tools
Do use it to improve security and protect people
Do share your findings and improvements with the community

Suggested Citation: Sting9 Research Initiative. (2025). Sting9 Phishing and Scam Message Dataset. Retrieved from https://sting9.org/dataset

Ready to Get Started?

Download the dataset and start building better detection systems today.

Download from GitHub View API Documentation