Automated downloader of annual report filings (56-1 One Report / แบบแสดงรายการข้อมูลประจำปี) for all SET-listed companies on the Stock Exchange of Thailand.
- Overview
- Architecture
- Directory Structure
- Prerequisites
- Quick Start
- Usage Reference
- Naming Conventions
- Data Sources
- Disclaimer
SET-Scrape downloads all publicly available annual report ZIP packages for every company listed on the SET main board, organised by the SET's own industry > sector > symbol hierarchy.
Coverage: 703 stocks across 8 industries and 28 sectors, years BE 2564 onwards (CE 2021 onwards).
Download history is tracked in stats.csv. The downloaded report folders are not
committed to this repository — only the scripts and metadata are versioned.
The pipeline is split into two scripts:
scrape_set.py (Python)
1. Headless Chromium loads SET homepage once to obtain session cookies
2. GET /api/set/index/list -> industry / sector hierarchy
3. GET /api/set/index/{ind}/composition -> stock list per industry
4. Per symbol:
GET /api/set/company/{sym}/report/one (56-1 One Report, newer)
GET /api/set/company/{sym}/report/form56 (Form 56-1, older)
5. Writes all_entries.json (incremental, safe to interrupt and resume)
|
| all_entries.json
v
download_all.ps1 (PowerShell)
- Reads all_entries.json
- Skips already-downloaded entries (fully idempotent)
- Downloads ZIP -> extracts -> moves to correct industry/sector path
- Appends results to stats.csv
- Logs failures to failed_entries.txt
download_fin_insur.ps1 is a legacy script with hardcoded URLs for the FINCIAL
industry (BANK / FIN / INSUR). It is kept for reference but superseded by the
two-script pipeline above.
SET-Scrape/
├── scrape_set.py <- Step 1: scrape ZIP URLs -> all_entries.json
├── download_all.ps1 <- Step 2: download ZIPs -> folder tree
├── download_fin_insur.ps1 <- Legacy: hardcoded FINCIAL URLs
├── stats.csv <- Download log (Industry, Sector, Symbol, Year, Single)
├── all_entries.json <- Generated; gitignored
├── failed_entries.txt <- Failed downloads; gitignored
│
├── AGRO/
│ ├── AGRI/ (Agribusiness)
│ └── FOOD/ (Food and Beverage)
├── CONSUMP/
│ ├── FASHION/
│ ├── HELTH/
│ └── HOME/
├── FINCIAL/
│ ├── BANK/
│ ├── FIN/
│ └── INSUR/
├── INDUS/
│ ├── AUTO/
│ ├── ENERG/ (*)
│ ├── PAPER/
│ └── ...
├── PROPCON/
├── RESOURC/
│ ├── ENERG/
│ └── MINE/
├── SERVICE/
│ ├── COMM/
│ ├── HELTH/
│ ├── MEDIA/
│ ├── PROF/
│ ├── TOURISM/
│ └── TRANS/
└── TECH/
├── ETRON/
├── ICT/
└── SEMCON/
Each leaf: {SYMBOL}/[SINGLE_]{SYMBOL}_report_{BE_YEAR}/
| Requirement | Notes |
|---|---|
| Python 3.10+ | |
| PowerShell 5.1+ | Windows built-in |
requests |
pip install requests |
playwright |
pip install playwright |
| Chromium browser | python -m playwright install chromium |
pip install requests playwright
python -m playwright install chromium# Step 1 — Discover all ZIP URLs for all 703 SET stocks (~5-10 min)
python scrape_set.py
# Step 2 — Download everything (skips already-downloaded entries)
.\download_all.ps1Both steps are idempotent: re-running is safe and skips completed work.
If interrupted, re-run with --resume (scraper) — the downloader always skips
existing folders automatically.
python scrape_set.py # full run, all 703 stocks
python scrape_set.py --symbol PTT,ADVANC # test with specific symbols
python scrape_set.py --resume # skip already-scraped symbols
python scrape_set.py --delay 0.5 # slow down requests (seconds)
python scrape_set.py --stocks stocks.csv # manual stock list fallback
python scrape_set.py --output custom.json # custom output path
.\download_all.ps1 # full run
.\download_all.ps1 -TestSymbols "PTT,GFPT" # test with specific symbols
.\download_all.ps1 -JsonFile custom.json # custom input file
.\download_all.ps1 -Delay 1 # sleep 1s between downloads
.\download_all.ps1 -Root "D:\MyData" # custom output root| Pattern | Meaning |
|---|---|
{SYMBOL}_report_{YYYY}/ |
Multi-file report folder (e.g. One Report + structure diagram) |
SINGLE_{SYMBOL}_report_{YYYY}/ |
Single-file report folder (file kept as-is) |
Years use the Thai Buddhist Era (BE). Subtract 543 for Common Era (CE). Example: BE 2567 = CE 2024.
Report file types found inside the folders:
| Filename pattern | Type | Period |
|---|---|---|
ONE... / E1N... |
56-1 One Report (annual report + registration statement combined) | BE 2565+ |
F...T21, F...T22 |
Traditional Form 56-1 (annual registration statement only) | Before BE 2565 |
STRUCTURE... |
Shareholding structure diagram | Varies |
| Data | Endpoint |
|---|---|
| Industry / sector hierarchy | https://www.set.or.th/api/set/index/list?type=industry&market=set |
| Stocks per industry | https://www.set.or.th/api/set/index/{INDUSTRY}/composition |
| 56-1 One Report list | https://www.set.or.th/api/set/company/{SYMBOL}/report/one |
| Form 56-1 list | https://www.set.or.th/api/set/company/{SYMBOL}/report/form56 |
| ZIP files | https://weblink.set.or.th/dat/f56/*.zip |
This project is for personal research and educational use only. Downloaded documents are subject to the SET Terms of Use. Do not use this tool for commercial redistribution of the downloaded documents. The SET data is provided for informational purposes; the authors make no warranty as to its accuracy or completeness.