Feature/pipeline database crawl by HanlinKong · Pull Request #2 · Shannon4Science/impacthub

HanlinKong · 2026-05-24T11:57:42Z

概要

这个 PR 主要完善导师数据库爬取 Pipeline，重点是浙江大学和复旦大学的学院/导师页面适配，让导师名单和导师详情抓取更完整、更稳定。

主要改动：

增加重点学校的人工学院种子清单逻辑。
补充浙江大学页面适配，支持异步栏目、个人主页和网安学院长简介页面。
补充复旦大学页面适配，包括计算与智能创新学院新入口、智能机器人与先进制造创新学院聚合入口等。
避免 Stage 3 重新跑导师列表时覆盖 Stage 4 已经由 LLM 抽取过的详细简介。
改进列表页过滤逻辑，避免把当前页面、#、JavaScript 链接、退休教师栏目误识别成导师主页。

影响范围

pipeline
backend/app/services/advisor_crawler_service.py
pipeline/data/advisor_college_seeds.json
backend/data/impacthub.db

验证

已运行：

python3 -m compileall backend/app pipeline
git diff --check

另外实际跑过浙江大学和复旦大学的爬取验证，包括：

浙江大学异步页面/导师详情页抓取
浙江大学网安学院长简介抓取
复旦大学智能机器人与先进制造创新学院导师聚合抓取
复旦大学计算与智能创新学院导师列表同步
部分导师（浙江大学）Stage 4 LLM 详情抽取
复旦大学的导师详情页仅写好逻辑、还未爬取

# Conflicts: # backend/app/main.py # backend/data/impacthub.db # frontend/src/App.tsx # frontend/src/lib/api.ts

gemini-code-assist · 2026-05-24T11:57:46Z

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

SJTU CS/AI scope is covered by 5 ex-SEIEE schools whose Webplus CMS hides faculty behind a POST /active/ajax_teacher_list.html JSON endpoint instead of serving anchors statically, plus the AI school (soai.sjtu.edu.cn) which ships faculty as static <a> but lives at a different URL pattern. - pipeline/data/advisor_college_seeds.json: add 上海交通大学 with 5 colleges (计算机, 人工智能, 集成电路, 自动化与感知, 电气工程) and manually-confirmed faculty_list_url entries. - backend/app/services/advisor_crawler_service.py: * _extract_sjtu_via_ajax: paginated POST to /active/ajax_teacher_list.html, parses cat_id/cat_code from inline JS, harvests <a href=".../<cat>/<slug> .html"> teacher links across pages until empty. * _extract_soai_static_advisors: matches /facultydetails/ anchors that the generic heuristic over-filters. * Dispatch in _crawl_one_college_advisors merges both with the generic result; SJTU CMS hosts additionally drop generic noise that looks like /index_links_* admin redirects (工会, 资实处, 校友会, etc.). - frontend: add @tanstack/react-query dependency required by the new AdvisorDetailPage / RecommendationPage on this branch. Result: 5 SJTU colleges 957 stubs in DB, bio coverage 937/957 (98%). Fudan reached 495/495 (100%) by simply running stage 4 with --school 复旦. Note: DB change (88MB → 131MB) intentionally excluded — file now exceeds GitHub's 100MB limit. Keep the local impacthub.db for queries; teammates who need the crawled rows can regenerate by re-running stages 2-4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tsinghua's Webplus skin renders teacher cards as <a title="姓名" href> (name only in title attr — the colleague's heuristic scanning <a> inner text finds nothing). Some THU colleges further split faculty across institute / rank sub-pages with no anchor-text leading from the main page. - _extract_thu_via_title_anchor: matches <a title="..."> teacher cards, guarded by a 15+ candidate threshold to ignore pages whose only title anchors are nav (两院院士 / 国际交流 / 概述 / 数学 / …). Covers collegeai 人工智能 (83) and iiis 交叉信息 (74) cleanly. - _llm_extract_advisors + extract_advisor_list LLM fallback: when heuristic and per-school adapters all return 0, ask gpt-5-mini to parse the cleaned HTML. Recovers sic 集成电路 (24 stubs) and all 16 ee 电子工程 sub-pages. - _find_thu_faculty_sub_links: discovers sibling/child faculty pages by anchor-text rank (教授/副教授/讲师) or institute label (研究所/中心), constrained to share base path's first N-1 segments. Covers au 自动化系's 8 research-institute partition (112 stubs) and ee's 7×3 institute×rank cross-references (78 stubs). - pipeline/data/advisor_college_seeds.json: add 清华大学 with 8 CS/AI colleges (人工智能 / 软件 / 集成电路 / 自动化 / 电子工程 / 信息科学技术 / 交叉信息 / 高等研究院); ee uses the dlyxtyjs/js2.htm hub page since the colleague's "在职教师" link only points to one institute's sub-page. Result: THU CS/AI scope reached 632 stubs, 568 bio (90%). 3 already-done colleges from earlier crawls (CS 127, network/space 54, statistics 19) unchanged. ee bio at 25/78 — the rest are web.ee.tsinghua.edu.cn 503 rate-limit failures, retryable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

清华电子工程系 stage 4 bio coverage stuck at 25/78 — web.ee.tsinghua.edu.cn subdomain (where 53 of the 78 teachers' personal pages live) is returning HTTP 503 across the board since 2026-05-30. Retried with --school 清华 --college 电子工程系 once: 0 ok / 52 fail / 52 rate-skipped. The crawler itself is correct; once the subdomain comes back, re-running stage 4 should fill the remaining 53 bios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DB has grown past GitHub's 100MB inline blob limit (152MB now, after adding Fudan/SJTU/THU crawled rows). The previous 3 commits skipped the db change to keep pushes working; from this commit on it goes through LFS so future stages-2/3/4 runs can be tracked normally. Anyone cloning the branch should have git-lfs installed (`git lfs install`) so the db file materializes on checkout instead of landing as a 134-byte pointer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HanlinKong added 16 commits May 16, 2026 11:56

feat(advisor): migrate recruitment and recommendation features

dab851f

feat(pipeline): integrate xhs recruitment crawl

2af2da3

Merge branch 'main' into feature/migrate-advisor-features

a8f7813

# Conflicts: # backend/app/main.py # backend/data/impacthub.db # frontend/src/App.tsx # frontend/src/lib/api.ts

fix(db): register models during init

a5b7165

chore(db): remove acceptance crawl artifacts

f6e61df

fix: improve advisor crawl json parsing

2b7492c

feat: preserve richer advisor profile details

a4a09ab

fix: tune advisor bio extraction prompt

cb2abf4

feat: add zju advisor seed pipeline

59040a8

feat: adapt zju icsr advisor details

48dabc8

fix: preserve zju icsr long advisor bios

4070e1c

fix: run zju icsr advisor bios through llm

bd268ed

feat(pipeline): add fudan advisor college seeds

ccecb83

feat(pipeline): adapt fudan advisor crawling

13926f2

feat(pipeline): adapt fudan ciram advisor sources

e1b0efb

fix(pipeline): sync fudan ai advisor sources

ab8325b

LHL3341 and others added 6 commits May 28, 2026 12:56

feat(pipeline): merge advisor crawl datasets

5ff353a

feat(pipeline): adapt pku advisor crawl

9b43835

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pipeline database crawl#2

Feature/pipeline database crawl#2
HanlinKong wants to merge 22 commits into
mainfrom
feature/pipeline-database-crawl

HanlinKong commented May 24, 2026

Uh oh!

gemini-code-assist Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HanlinKong commented May 24, 2026

概要

影响范围

验证

Uh oh!

gemini-code-assist Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants