Access the world's largest open-source dataset of phishing, smishing, and scam messages. Available for researchers, developers, and organizations committed to fighting digital deception.
Choose the access method that best fits your needs. All data is released under Creative Commons CC0 (Public Domain).
Full PostgreSQL database dump with all anonymized submissions, updated weekly.
Best for: Researchers, bulk analysis, offline use
Programmatic access to query and filter the dataset in real-time.
Best for: Applications, integrations, dynamic queries
Enhanced access for academic institutions, security companies, and verified researchers.
Best for: Organizations, bulk contributors, researchers
The main submissions table stores all phishing, smishing, and scam messages with anonymized content and metadata:
| Field | Type | Description |
|---|---|---|
| submission_id | UUID | Primary key, unique identifier |
| message_type | VARCHAR(20) | email, sms, whatsapp, telegram, signal, other |
| threat_level | VARCHAR(20) | low, medium, high, critical, unknown |
| attack_type | VARCHAR(50) | phishing, spear-phishing, smishing, BEC, romance_scam, etc. |
| subject_text | TEXT | Email subject or SMS preview (PII redacted) |
| body_text | TEXT | Message content (PII redacted) |
| raw_headers | JSONB | Email headers with PII removed |
| detected_language | VARCHAR(10) | ISO 639-1 language code (e.g., 'en', 'es', 'fr') |
| detected_country | VARCHAR(2) | ISO 3166-1 country code (e.g., 'US', 'GB') |
| sender_domain | VARCHAR(255) | Sender domain (preserved for pattern analysis) |
| sender_domain_hash | BYTEA | SHA-256 hash of full sender email |
| claimed_sender_name | VARCHAR(255) | Display name from sender (may be spoofed) |
| message_timestamp | TIMESTAMPTZ | When message was originally sent |
| submission_timestamp | TIMESTAMPTZ | When submitted to Sting9 |
| verified | BOOLEAN | Manually verified as malicious |
| confidence_score | DECIMAL(3,2) | ML model confidence (0.00-1.00) |
Extracted and analyzed URLs with threat intelligence:
Metadata about file attachments (no file content stored):
Extended metadata for email messages:
Extended metadata for SMS/text messages:
All personal information (email addresses, phone numbers, names, addresses, etc.) is automatically detected and redacted using advanced PII detection algorithms before being added to the dataset. URLs are defanged to prevent accidental clicks.
import pandas as pd
import requests
# Using the API
response = requests.get(
'https://api.sting9.org/v1/submissions',
params={
'message_type': 'email',
'language': 'en',
'limit': 1000
}
)
df = pd.DataFrame(response.json())
# Or load from GitHub dump
df = pd.read_sql(
'SELECT * FROM submissions',
'postgresql://localhost/sting9'
)
# Analyze threat types
df['threat_type'].value_counts()// Using fetch API
const response = await fetch(
'https://api.sting9.org/v1/submissions',
{
method: 'GET',
headers: {
'Content-Type': 'application/json'
}
}
);
const data = await response.json();
// Filter phishing emails
const phishingEmails = data.filter(
item => item.threat_type === 'phishing' &&
item.message_type === 'email'
);
console.log(
`Found ${phishingEmails.length} phishing emails`
);The Sting9 dataset is released under Creative Commons Zero (CC0), dedicating it to the public domain. This means:
While the data is public domain, we encourage ethical use:
Suggested Citation: Sting9 Research Initiative. (2025). Sting9 Phishing and Scam Message Dataset. Retrieved from https://sting9.org/dataset
Download the dataset and start building better detection systems today.