Skip to content

panthbs/SET-Scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SET-Scrape

Automated downloader of annual report filings (56-1 One Report / แบบแสดงรายการข้อมูลประจำปี) for all SET-listed companies on the Stock Exchange of Thailand.


Table of Contents


Overview

SET-Scrape downloads all publicly available annual report ZIP packages for every company listed on the SET main board, organised by the SET's own industry > sector > symbol hierarchy.

Coverage: 703 stocks across 8 industries and 28 sectors, years BE 2564 onwards (CE 2021 onwards).

Download history is tracked in stats.csv. The downloaded report folders are not committed to this repository — only the scripts and metadata are versioned.


Architecture

The pipeline is split into two scripts:

scrape_set.py  (Python)
  1. Headless Chromium loads SET homepage once to obtain session cookies
  2. GET /api/set/index/list          -> industry / sector hierarchy
  3. GET /api/set/index/{ind}/composition -> stock list per industry
  4. Per symbol:
       GET /api/set/company/{sym}/report/one     (56-1 One Report, newer)
       GET /api/set/company/{sym}/report/form56  (Form 56-1, older)
  5. Writes all_entries.json  (incremental, safe to interrupt and resume)
          |
          | all_entries.json
          v
download_all.ps1  (PowerShell)
  - Reads all_entries.json
  - Skips already-downloaded entries (fully idempotent)
  - Downloads ZIP -> extracts -> moves to correct industry/sector path
  - Appends results to stats.csv
  - Logs failures to failed_entries.txt

download_fin_insur.ps1 is a legacy script with hardcoded URLs for the FINCIAL industry (BANK / FIN / INSUR). It is kept for reference but superseded by the two-script pipeline above.


Directory Structure

SET-Scrape/
├── scrape_set.py            <- Step 1: scrape ZIP URLs -> all_entries.json
├── download_all.ps1         <- Step 2: download ZIPs -> folder tree
├── download_fin_insur.ps1   <- Legacy: hardcoded FINCIAL URLs
├── stats.csv                <- Download log (Industry, Sector, Symbol, Year, Single)
├── all_entries.json         <- Generated; gitignored
├── failed_entries.txt       <- Failed downloads; gitignored
│
├── AGRO/
│   ├── AGRI/   (Agribusiness)
│   └── FOOD/   (Food and Beverage)
├── CONSUMP/
│   ├── FASHION/
│   ├── HELTH/
│   └── HOME/
├── FINCIAL/
│   ├── BANK/
│   ├── FIN/
│   └── INSUR/
├── INDUS/
│   ├── AUTO/
│   ├── ENERG/ (*)
│   ├── PAPER/
│   └── ...
├── PROPCON/
├── RESOURC/
│   ├── ENERG/
│   └── MINE/
├── SERVICE/
│   ├── COMM/
│   ├── HELTH/
│   ├── MEDIA/
│   ├── PROF/
│   ├── TOURISM/
│   └── TRANS/
└── TECH/
    ├── ETRON/
    ├── ICT/
    └── SEMCON/

Each leaf:  {SYMBOL}/[SINGLE_]{SYMBOL}_report_{BE_YEAR}/

Prerequisites

Requirement Notes
Python 3.10+
PowerShell 5.1+ Windows built-in
requests pip install requests
playwright pip install playwright
Chromium browser python -m playwright install chromium
pip install requests playwright
python -m playwright install chromium

Quick Start

# Step 1 — Discover all ZIP URLs for all 703 SET stocks (~5-10 min)
python scrape_set.py

# Step 2 — Download everything (skips already-downloaded entries)
.\download_all.ps1

Both steps are idempotent: re-running is safe and skips completed work. If interrupted, re-run with --resume (scraper) — the downloader always skips existing folders automatically.


Usage Reference

scrape_set.py

python scrape_set.py                         # full run, all 703 stocks
python scrape_set.py --symbol PTT,ADVANC     # test with specific symbols
python scrape_set.py --resume                # skip already-scraped symbols
python scrape_set.py --delay 0.5             # slow down requests (seconds)
python scrape_set.py --stocks stocks.csv     # manual stock list fallback
python scrape_set.py --output custom.json    # custom output path

download_all.ps1

.\download_all.ps1                               # full run
.\download_all.ps1 -TestSymbols "PTT,GFPT"      # test with specific symbols
.\download_all.ps1 -JsonFile custom.json         # custom input file
.\download_all.ps1 -Delay 1                      # sleep 1s between downloads
.\download_all.ps1 -Root "D:\MyData"             # custom output root

Naming Conventions

Pattern Meaning
{SYMBOL}_report_{YYYY}/ Multi-file report folder (e.g. One Report + structure diagram)
SINGLE_{SYMBOL}_report_{YYYY}/ Single-file report folder (file kept as-is)

Years use the Thai Buddhist Era (BE). Subtract 543 for Common Era (CE). Example: BE 2567 = CE 2024.

Report file types found inside the folders:

Filename pattern Type Period
ONE... / E1N... 56-1 One Report (annual report + registration statement combined) BE 2565+
F...T21, F...T22 Traditional Form 56-1 (annual registration statement only) Before BE 2565
STRUCTURE... Shareholding structure diagram Varies

Data Sources

Data Endpoint
Industry / sector hierarchy https://www.set.or.th/api/set/index/list?type=industry&market=set
Stocks per industry https://www.set.or.th/api/set/index/{INDUSTRY}/composition
56-1 One Report list https://www.set.or.th/api/set/company/{SYMBOL}/report/one
Form 56-1 list https://www.set.or.th/api/set/company/{SYMBOL}/report/form56
ZIP files https://weblink.set.or.th/dat/f56/*.zip

Disclaimer

This project is for personal research and educational use only. Downloaded documents are subject to the SET Terms of Use. Do not use this tool for commercial redistribution of the downloaded documents. The SET data is provided for informational purposes; the authors make no warranty as to its accuracy or completeness.

About

Automated downloader of annual report filings (56-1 One Report / แบบแสดงรายการข้อมูลประจำปี) for all SET-listed companies on the Stock Exchange of Thailand.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors