Dataset Access

Access the world's largest open-source dataset of phishing, smishing, and scam messages. Available for researchers, developers, and organizations committed to fighting digital deception.

Access Options

Choose the access method that best fits your needs. All data is released under Creative Commons CC0 (Public Domain).

GitHub Dump

Full PostgreSQL database dump with all anonymized submissions, updated weekly.

  • Complete dataset export
  • SQL format for easy import
  • No authentication required
  • Weekly updates
Download from GitHub

Best for: Researchers, bulk analysis, offline use

REST API

Programmatic access to query and filter the dataset in real-time.

  • Real-time data access
  • Advanced filtering & search
  • OpenAPI documentation
  • Rate limits: 1000 req/hour
View API Docs

Best for: Applications, integrations, dynamic queries

Partner Access

Enhanced access for academic institutions, security companies, and verified researchers.

  • Bulk submission API
  • Higher rate limits
  • Early access to new models
  • Dedicated support
Become a Partner

Best for: Organizations, bulk contributors, researchers

Data Format & Schema

Core Tables: Submissions

The main submissions table stores all phishing, smishing, and scam messages with anonymized content and metadata:

FieldTypeDescription
submission_idUUIDPrimary key, unique identifier
message_typeVARCHAR(20)email, sms, whatsapp, telegram, signal, other
threat_levelVARCHAR(20)low, medium, high, critical, unknown
attack_typeVARCHAR(50)phishing, spear-phishing, smishing, BEC, romance_scam, etc.
subject_textTEXTEmail subject or SMS preview (PII redacted)
body_textTEXTMessage content (PII redacted)
raw_headersJSONBEmail headers with PII removed
detected_languageVARCHAR(10)ISO 639-1 language code (e.g., 'en', 'es', 'fr')
detected_countryVARCHAR(2)ISO 3166-1 country code (e.g., 'US', 'GB')
sender_domainVARCHAR(255)Sender domain (preserved for pattern analysis)
sender_domain_hashBYTEASHA-256 hash of full sender email
claimed_sender_nameVARCHAR(255)Display name from sender (may be spoofed)
message_timestampTIMESTAMPTZWhen message was originally sent
submission_timestampTIMESTAMPTZWhen submitted to Sting9
verifiedBOOLEANManually verified as malicious
confidence_scoreDECIMAL(3,2)ML model confidence (0.00-1.00)

URLs Table

Extracted and analyzed URLs with threat intelligence:

  • url_idUUID primary key
  • full_urlComplete URL (defanged)
  • domainDomain name
  • threat_statussafe, suspicious, malicious
  • is_shortenedURL shortener detection
  • blocklist_sourcesWhich blocklists flagged it

Attachments Table

Metadata about file attachments (no file content stored):

  • attachment_idUUID primary key
  • filenameOriginal filename
  • file_hashSHA-256 hash
  • mime_typeFile type
  • is_executableExecutable file detection
  • malware_scan_resultclean, suspicious, malicious

Email Details Table

Extended metadata for email messages:

  • spf_resultSPF validation (pass/fail)
  • dkim_resultDKIM signature validation
  • dmarc_resultDMARC validation
  • uses_url_shortenersContains bit.ly, tinyurl, etc.
  • uses_homoglyphsLookalike characters (paypa1.com)

SMS Details Table

Extended metadata for SMS/text messages:

  • message_lengthCharacter count
  • sender_typeshortcode, longcode, alphanumeric
  • contains_urlMessage includes URL
  • urgency_keywordsContains urgent/immediate language

Privacy Guarantee

All personal information (email addresses, phone numbers, names, addresses, etc.) is automatically detected and redacted using advanced PII detection algorithms before being added to the dataset. URLs are defanged to prevent accidental clicks.

Usage Examples

Python with pandas

import pandas as pd
import requests

# Using the API
response = requests.get(
    'https://api.sting9.org/v1/submissions',
    params={
        'message_type': 'email',
        'language': 'en',
        'limit': 1000
    }
)
df = pd.DataFrame(response.json())

# Or load from GitHub dump
df = pd.read_sql(
    'SELECT * FROM submissions',
    'postgresql://localhost/sting9'
)

# Analyze threat types
df['threat_type'].value_counts()

JavaScript / Node.js

// Using fetch API
const response = await fetch(
  'https://api.sting9.org/v1/submissions',
  {
    method: 'GET',
    headers: {
      'Content-Type': 'application/json'
    }
  }
);

const data = await response.json();

// Filter phishing emails
const phishingEmails = data.filter(
  item => item.threat_type === 'phishing' &&
          item.message_type === 'email'
);

console.log(
  `Found ${phishingEmails.length} phishing emails`
);

License & Terms of Use

Creative Commons CC0 (Public Domain)

The Sting9 dataset is released under Creative Commons Zero (CC0), dedicating it to the public domain. This means:

  • You can copy, modify, and distribute the dataset without asking permission
  • You can use it for commercial or non-commercial purposes
  • No attribution is required (though appreciated!)
  • No warranty or liability is provided

Ethical Use Guidelines

While the data is public domain, we encourage ethical use:

  • Do not attempt to re-identify anonymized individuals
  • Do not use the data to create or improve malicious tools
  • Do use it to improve security and protect people
  • Do share your findings and improvements with the community

Suggested Citation: Sting9 Research Initiative. (2025). Sting9 Phishing and Scam Message Dataset. Retrieved from https://sting9.org/dataset

Ready to Get Started?

Download the dataset and start building better detection systems today.