Skip to content

Feature/pipeline database crawl#2

Open
HanlinKong wants to merge 22 commits into
mainfrom
feature/pipeline-database-crawl
Open

Feature/pipeline database crawl#2
HanlinKong wants to merge 22 commits into
mainfrom
feature/pipeline-database-crawl

Conversation

@HanlinKong

Copy link
Copy Markdown
Collaborator

概要

这个 PR 主要完善导师数据库爬取 Pipeline,重点是浙江大学和复旦大学的学院/导师页面适配,让导师名单和导师详情抓取更完整、更稳定。

主要改动:

  • 增加重点学校的人工学院种子清单逻辑。
  • 补充浙江大学页面适配,支持异步栏目、个人主页和网安学院长简介页面。
  • 补充复旦大学页面适配,包括计算与智能创新学院新入口、智能机器人与先进制造创新学院聚合入口等。
  • 避免 Stage 3 重新跑导师列表时覆盖 Stage 4 已经由 LLM 抽取过的详细简介。
  • 改进列表页过滤逻辑,避免把当前页面、#、JavaScript 链接、退休教师栏目误识别成导师主页。

影响范围

  • pipeline
  • backend/app/services/advisor_crawler_service.py
  • pipeline/data/advisor_college_seeds.json
  • backend/data/impacthub.db

验证

已运行:

python3 -m compileall backend/app pipeline
git diff --check

另外实际跑过浙江大学和复旦大学的爬取验证,包括:

  • 浙江大学异步页面/导师详情页抓取
  • 浙江大学网安学院长简介抓取
  • 复旦大学智能机器人与先进制造创新学院导师聚合抓取
  • 复旦大学计算与智能创新学院导师列表同步
  • 部分导师(浙江大学)Stage 4 LLM 详情抽取
  • 复旦大学的导师详情页仅写好逻辑、还未爬取

@gemini-code-assist

Copy link
Copy Markdown

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

LHL3341 and others added 6 commits May 28, 2026 12:56
SJTU CS/AI scope is covered by 5 ex-SEIEE schools whose Webplus CMS hides
faculty behind a POST /active/ajax_teacher_list.html JSON endpoint instead
of serving anchors statically, plus the AI school (soai.sjtu.edu.cn) which
ships faculty as static <a> but lives at a different URL pattern.

- pipeline/data/advisor_college_seeds.json: add 上海交通大学 with
  5 colleges (计算机, 人工智能, 集成电路, 自动化与感知, 电气工程) and
  manually-confirmed faculty_list_url entries.
- backend/app/services/advisor_crawler_service.py:
  * _extract_sjtu_via_ajax: paginated POST to /active/ajax_teacher_list.html,
    parses cat_id/cat_code from inline JS, harvests <a href=".../<cat>/<slug>
    .html"> teacher links across pages until empty.
  * _extract_soai_static_advisors: matches /facultydetails/ anchors that
    the generic heuristic over-filters.
  * Dispatch in _crawl_one_college_advisors merges both with the generic
    result; SJTU CMS hosts additionally drop generic noise that looks like
    /index_links_* admin redirects (工会, 资实处, 校友会, etc.).
- frontend: add @tanstack/react-query dependency required by the new
  AdvisorDetailPage / RecommendationPage on this branch.

Result: 5 SJTU colleges 957 stubs in DB, bio coverage 937/957 (98%).
Fudan reached 495/495 (100%) by simply running stage 4 with --school 复旦.

Note: DB change (88MB → 131MB) intentionally excluded — file now exceeds
GitHub's 100MB limit. Keep the local impacthub.db for queries; teammates
who need the crawled rows can regenerate by re-running stages 2-4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tsinghua's Webplus skin renders teacher cards as <a title="姓名" href>
(name only in title attr — the colleague's heuristic scanning <a> inner text
finds nothing). Some THU colleges further split faculty across institute /
rank sub-pages with no anchor-text leading from the main page.

- _extract_thu_via_title_anchor: matches <a title="..."> teacher cards,
  guarded by a 15+ candidate threshold to ignore pages whose only title
  anchors are nav (两院院士 / 国际交流 / 概述 / 数学 / …). Covers
  collegeai 人工智能 (83) and iiis 交叉信息 (74) cleanly.

- _llm_extract_advisors + extract_advisor_list LLM fallback: when heuristic
  and per-school adapters all return 0, ask gpt-5-mini to parse the cleaned
  HTML. Recovers sic 集成电路 (24 stubs) and all 16 ee 电子工程 sub-pages.

- _find_thu_faculty_sub_links: discovers sibling/child faculty pages by
  anchor-text rank (教授/副教授/讲师) or institute label (研究所/中心),
  constrained to share base path's first N-1 segments. Covers au 自动化系's
  8 research-institute partition (112 stubs) and ee's 7×3 institute×rank
  cross-references (78 stubs).

- pipeline/data/advisor_college_seeds.json: add 清华大学 with 8 CS/AI
  colleges (人工智能 / 软件 / 集成电路 / 自动化 / 电子工程 / 信息科学技术 /
  交叉信息 / 高等研究院); ee uses the dlyxtyjs/js2.htm hub page since the
  colleague's "在职教师" link only points to one institute's sub-page.

Result: THU CS/AI scope reached 632 stubs, 568 bio (90%). 3 already-done
colleges from earlier crawls (CS 127, network/space 54, statistics 19)
unchanged. ee bio at 25/78 — the rest are web.ee.tsinghua.edu.cn 503
rate-limit failures, retryable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
清华电子工程系 stage 4 bio coverage stuck at 25/78 — web.ee.tsinghua.edu.cn
subdomain (where 53 of the 78 teachers' personal pages live) is returning
HTTP 503 across the board since 2026-05-30. Retried with --school 清华
--college 电子工程系 once: 0 ok / 52 fail / 52 rate-skipped.

The crawler itself is correct; once the subdomain comes back, re-running
stage 4 should fill the remaining 53 bios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DB has grown past GitHub's 100MB inline blob limit (152MB now, after
adding Fudan/SJTU/THU crawled rows). The previous 3 commits skipped the
db change to keep pushes working; from this commit on it goes through
LFS so future stages-2/3/4 runs can be tracked normally.

Anyone cloning the branch should have git-lfs installed
(`git lfs install`) so the db file materializes on checkout instead of
landing as a 134-byte pointer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants