Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions agent-workspace/domain-skills/ke-com/scraping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# 贝壳找房 (ke.com) — Data Extraction

Field-tested against ke.com on 2026-04-28.
No authentication required for listing search.
All listing page requests work via `http_get` without a browser.

---

## TL;DR

贝壳找房 (`{city}.ke.com`) returns full HTML for the **first page** of any
search via `http_get`. Subsequent pages and individual property detail pages
are protected by CAPTCHA and require a browser session.

**What you can do with `http_get` (no browser):**
- Scrape the first 30 listings of any city-wide search (二手房 or 租房)
- Extract: title, URL, price, floor/year/layout/area/orientation, community
name, tags (满五年/地铁 etc.), follower count
- Switch cities by changing the subdomain (`bj`, `sh`, `gz`, `sz`, etc.)

**What requires a browser (`goto` + `js`):**
- District/neighborhood filtering (e.g. `/ershoufang/chaoyangqu/`) — redirects
to login page via `http_get`
- Page 2 and beyond (302 redirect → CAPTCHA on direct `http_get`)
- Individual property detail pages (CAPTCHA page returned)

---

## Approach 1: 二手房 Listing Search

`GET https://{city}.ke.com/ershoufang/`

Returns the first 30 listings for a city. District/neighborhood filtering
requires a logged-in browser session — `http_get` on `/ershoufang/{district}/`
redirects to the login page.

```python
from helpers import http_get
import re

# City subdomain codes (confirmed working):
# bj=北京, sh=上海, gz=广州, sz=深圳, cd=成都, wh=武汉, hz=杭州, nj=南京

def ke_search_ershoufang(city="bj"):
"""Search 二手房 (resale housing) listings on ke.com.

Args:
city: City subdomain, e.g. 'bj' (北京), 'sh' (上海), 'gz' (广州),
'sz' (深圳), 'cd' (成都), 'hz' (杭州), 'wh' (武汉), 'nj' (南京)

Returns up to 30 listings from the first page. Each listing contains:
title, url, total_price, unit_price, info (floor/year/layout/area/orientation),
community, tags, followers.

Note: district/neighborhood filtering requires a logged-in browser session.
"""
url = f"https://{city}.ke.com/ershoufang/"
html = http_get(url)

ul = re.search(r'<ul class="sellListContent"[^>]*>(.*?)</ul>', html, re.DOTALL)
if not ul:
return []

listings = []
for li in re.findall(r'<li class="clear">(.*?)</li>', ul.group(1), re.DOTALL):
title = re.search(r'title="([^"]+)"', li)
href = re.search(r'href="(https://[a-z]+\.ke\.com/ershoufang/\d+\.html)"', li)
total = re.search(r'class="totalPrice[^"]*"[^>]*>.*?<span[^>]*>\s*(\d+)\s*</span>.*?<i>万</i>', li, re.DOTALL)
unit = re.search(r'<span>(\d[\d,]+元/平)</span>', li)
info_m = re.search(r'class="houseInfo"[^>]*>(.*?)</div>', li, re.DOTALL)
pos_m = re.search(r'class="positionInfo"[^>]*>(.*?)</div>', li, re.DOTALL)
tags = re.findall(r'class="(?:taxfree|five|subway|matching)[^"]*"[^>]*>\s*([^<]+?)\s*<', li)
follow = re.search(r'(\d+)人关注', li)

info = re.sub(r'\s+', ' ', re.sub(r'<[^>]+>', ' ', info_m.group(1))).strip() if info_m else None
pos = re.sub(r'\s+', ' ', re.sub(r'<[^>]+>', ' ', pos_m.group(1))).strip() if pos_m else None

if not title:
continue
listings.append({
"title": title.group(1),
"url": href.group(1) if href else None,
"total_price": f"{total.group(1)}万" if total else None, # e.g. "890万"
"unit_price": unit.group(1) if unit else None, # e.g. "61,720元/平"
"info": info, # e.g. "高楼层 (共24层) | 2002年 | 3室2厅 | 144.2平米 | 西南"
"community": pos, # e.g. "盛和家园"
"tags": [t.strip() for t in tags], # e.g. ["满五年", "地铁"]
"followers": int(follow.group(1)) if follow else None,
})
return listings

listings = ke_search_ershoufang(city="bj")
# [
# {
# "title": "盛和家园 满五年唯一 不临街 观景房 电梯刷卡进入",
# "url": "https://bj.ke.com/ershoufang/101133688738.html",
# "total_price": "890万",
# "unit_price": "61,720元/平",
# "info": "高楼层 (共24层) | 2002年 | 3室2厅 | 144.2平米 | 西南",
# "community": "盛和家园",
# "tags": ["满五年"],
# "followers": 221,
# },
# ... # up to 30 listings
# ]
```

---

## Approach 2: 租房 Listing Search

`GET https://{city}.ke.com/zufang/`

```python
from helpers import http_get
import re

def ke_search_zufang(city="bj"):
"""Search 租房 (rental) listings on ke.com.

Args:
city: City subdomain, e.g. 'bj', 'sh', 'gz'

Returns up to 30 rental listings from the first page.
Each listing's fields are extracted from its own HTML chunk keyed by
data-house_code, so price and description always correspond to the
correct listing with no index misalignment.
"""
html = http_get(f"https://{city}.ke.com/zufang/")

# Split on each listing boundary so that price/desc are extracted from
# the same chunk as href/title — avoids silent misalignment from
# independent global regex scans.
chunks = re.split(r'(?=data-house_code=")', html)
listings = []
for chunk in chunks:
if not chunk.startswith('data-house_code="'):
continue
href = re.search(r'href="(/zufang/[A-Z]{2}[\w]+\.html)"', chunk)
title = re.search(r'title="([^"]+)"', chunk)
# Price lives in: <span class="content__list--item-price"><em>6300</em> 元/月</span>
price = re.search(r'class="content__list--item-price"><em>(\d+)</em>', chunk)
desc = re.search(
r'class="content__list--item--des"[^>]*>(.*?)</p>',
chunk, re.DOTALL
)
if not href or not title:
continue

desc_clean = district = area_name = community = None
if desc:
desc_clean = re.sub(r'\s+', ' ', re.sub(r'<[^>]+>', '|', desc.group(1))).strip()
parts = [p for p in (x.strip() for x in desc_clean.split('|'))
if p and p not in ('/', '-')]
district = parts[0] if parts else None # e.g. "海淀区"
area_name = parts[1] if len(parts) > 1 else None # e.g. "马甸"
community = parts[2] if len(parts) > 2 else None # e.g. "月季园"

listings.append({
"title": title.group(1).strip(),
"url": f"https://{city}.ke.com{href.group(1)}",
"price": f"{price.group(1)}元/月" if price else None,
"district": district,
"area_name": area_name,
"community": community,
"desc": desc_clean,
})
return listings

rentals = ke_search_zufang(city="bj")
# [
# {
# "title": "整租·月季园 2室1厅 南/北",
# "url": "https://bj.ke.com/zufang/BJ2143777429737439232.html",
# "price": "6300元/月",
# "district": "海淀区",
# "area_name": "马甸",
# "community": "月季园",
# "desc": "海淀区 | 马甸 | 月季园 / 57.41㎡ / 南 北 / 2室1厅1卫 / 中楼层 (21层)",
# },
# ...
# ]
```

---

## Approach 3: Property Detail Page (Browser Required)

Individual property pages (`/ershoufang/{id}.html`, `/zufang/{id}.html`) return
a CAPTCHA page via `http_get`. Use the browser to load them.

```python
from helpers import goto, wait_for_load, wait, js

def ke_get_detail(property_url):
"""Fetch full property details from a ke.com listing page.

Requires browser. property_url comes from ke_search_ershoufang() or
ke_search_zufang(). CSS selectors verified against ke.com detail pages.
"""
goto(property_url)
wait_for_load()
wait(2)

return js("""
({
title: document.querySelector('.mainInfo')?.innerText?.trim()
|| document.querySelector('h1.title')?.innerText?.trim(),
total_price: document.querySelector('.total')?.innerText?.trim(),
unit_price: document.querySelector('.unitPriceValue')?.innerText?.trim(),
area: document.querySelector('.area .mainInfo')?.innerText?.trim(),
layout: document.querySelector('.room .mainInfo')?.innerText?.trim(),
floor: document.querySelector('.floor .mainInfo')?.innerText?.trim(),
orientation: document.querySelector('.toward .mainInfo')?.innerText?.trim(),
decoration: document.querySelector('.decoration .mainInfo')?.innerText?.trim(),
community: document.querySelector('.communityName .info')?.innerText?.trim(),
district: document.querySelector('.areaName .info')?.innerText?.trim(),
description: document.querySelector('.seller-desc')?.innerText?.trim(),
tags: Array.from(document.querySelectorAll('.tag-list .content'))
.map(e => e.innerText.trim()).filter(Boolean),
})
""")
```

---

## URL Reference

### City Subdomains

| 城市 | 子域名 | 二手房 URL |
|------|--------|-----------|
| 北京 | `bj` | `https://bj.ke.com/ershoufang/` |
| 上海 | `sh` | `https://sh.ke.com/ershoufang/` |
| 广州 | `gz` | `https://gz.ke.com/ershoufang/` |
| 深圳 | `sz` | `https://sz.ke.com/ershoufang/` |
| 成都 | `cd` | `https://cd.ke.com/ershoufang/` |
| 杭州 | `hz` | `https://hz.ke.com/ershoufang/` |
| 武汉 | `wh` | `https://wh.ke.com/ershoufang/` |
| 南京 | `nj` | `https://nj.ke.com/ershoufang/` |

### URL Pattern

```
# 二手房 (resale) — city-wide first page only via http_get
https://{city}.ke.com/ershoufang/

# 租房 (rental) — city-wide first page only via http_get
https://{city}.ke.com/zufang/

# 新房 (new development)
https://{city}.ke.com/loupan/

# District filter (browser required — http_get redirects to login)
https://{city}.ke.com/ershoufang/{district}/

# Pagination (browser required — http_get returns CAPTCHA)
https://{city}.ke.com/ershoufang/pg{n}/
https://{city}.ke.com/ershoufang/{district}/pg{n}/

# Property detail (browser required — http_get returns CAPTCHA)
https://{city}.ke.com/ershoufang/{house_id}.html
https://{city}.ke.com/zufang/{house_id}.html
```

- **Common Beijing District Slugs** (browser required for district filtering):
`chaoyangqu` 朝阳 / `haidianqu` 海淀 / `xichengqu` 西城 / `dongchengqu` 东城 /
`fengtaiqu` 丰台 / `shijingshanqu` 石景山 / `changpingqu` 昌平 / `tongzhouqu` 通州

---

## Gotchas

- **Only city-wide first page is accessible via `http_get`** — district filters
(e.g. `/ershoufang/chaoyangqu/`) redirect to a login page. Page 2+ redirects
to a CAPTCHA page. Both require a logged-in browser session.
- **Detail pages always return CAPTCHA via `http_get`** — even for the very
first request. Always use `goto()` + `js()` for detail pages.
- **Rental pages (`/zufang/`) follow a 302 on first hit** — `http_get` follows
redirects automatically; the final response (141KB+) contains full listing HTML.
- **City subdomains differ from link-on-ke.com** — `bj.ke.com` works;
`www.ke.com` redirects to the homepage without listing data.
- **`info` field format for 二手房**: pipe-separated string, e.g.
`"高楼层 (共24层) | 2002年 | 3室2厅 | 144.2平米 | 西南"`. Split on `|` and
strip whitespace to extract individual attributes.
- **Rental `desc` field format**: pipe/slash-separated, e.g.
`"海淀区 | 马甸 | 月季园 / 57.41㎡ / 南 北 / 2室1厅1卫 / 中楼层"`.
- **30 listings per page** — confirmed for both 二手房 and 租房.
- **`total_price` is in 万元** (10,000 RMB), e.g. `"890万"` = 8,900,000 RMB.
`unit_price` is 元/平方米, e.g. `"61,720元/平"`.