Feature/pipeline database crawl#2
Open
HanlinKong wants to merge 22 commits into
Open
Conversation
# Conflicts: # backend/app/main.py # backend/data/impacthub.db # frontend/src/App.tsx # frontend/src/lib/api.ts
|
Important Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services. |
SJTU CS/AI scope is covered by 5 ex-SEIEE schools whose Webplus CMS hides
faculty behind a POST /active/ajax_teacher_list.html JSON endpoint instead
of serving anchors statically, plus the AI school (soai.sjtu.edu.cn) which
ships faculty as static <a> but lives at a different URL pattern.
- pipeline/data/advisor_college_seeds.json: add 上海交通大学 with
5 colleges (计算机, 人工智能, 集成电路, 自动化与感知, 电气工程) and
manually-confirmed faculty_list_url entries.
- backend/app/services/advisor_crawler_service.py:
* _extract_sjtu_via_ajax: paginated POST to /active/ajax_teacher_list.html,
parses cat_id/cat_code from inline JS, harvests <a href=".../<cat>/<slug>
.html"> teacher links across pages until empty.
* _extract_soai_static_advisors: matches /facultydetails/ anchors that
the generic heuristic over-filters.
* Dispatch in _crawl_one_college_advisors merges both with the generic
result; SJTU CMS hosts additionally drop generic noise that looks like
/index_links_* admin redirects (工会, 资实处, 校友会, etc.).
- frontend: add @tanstack/react-query dependency required by the new
AdvisorDetailPage / RecommendationPage on this branch.
Result: 5 SJTU colleges 957 stubs in DB, bio coverage 937/957 (98%).
Fudan reached 495/495 (100%) by simply running stage 4 with --school 复旦.
Note: DB change (88MB → 131MB) intentionally excluded — file now exceeds
GitHub's 100MB limit. Keep the local impacthub.db for queries; teammates
who need the crawled rows can regenerate by re-running stages 2-4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tsinghua's Webplus skin renders teacher cards as <a title="姓名" href> (name only in title attr — the colleague's heuristic scanning <a> inner text finds nothing). Some THU colleges further split faculty across institute / rank sub-pages with no anchor-text leading from the main page. - _extract_thu_via_title_anchor: matches <a title="..."> teacher cards, guarded by a 15+ candidate threshold to ignore pages whose only title anchors are nav (两院院士 / 国际交流 / 概述 / 数学 / …). Covers collegeai 人工智能 (83) and iiis 交叉信息 (74) cleanly. - _llm_extract_advisors + extract_advisor_list LLM fallback: when heuristic and per-school adapters all return 0, ask gpt-5-mini to parse the cleaned HTML. Recovers sic 集成电路 (24 stubs) and all 16 ee 电子工程 sub-pages. - _find_thu_faculty_sub_links: discovers sibling/child faculty pages by anchor-text rank (教授/副教授/讲师) or institute label (研究所/中心), constrained to share base path's first N-1 segments. Covers au 自动化系's 8 research-institute partition (112 stubs) and ee's 7×3 institute×rank cross-references (78 stubs). - pipeline/data/advisor_college_seeds.json: add 清华大学 with 8 CS/AI colleges (人工智能 / 软件 / 集成电路 / 自动化 / 电子工程 / 信息科学技术 / 交叉信息 / 高等研究院); ee uses the dlyxtyjs/js2.htm hub page since the colleague's "在职教师" link only points to one institute's sub-page. Result: THU CS/AI scope reached 632 stubs, 568 bio (90%). 3 already-done colleges from earlier crawls (CS 127, network/space 54, statistics 19) unchanged. ee bio at 25/78 — the rest are web.ee.tsinghua.edu.cn 503 rate-limit failures, retryable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
清华电子工程系 stage 4 bio coverage stuck at 25/78 — web.ee.tsinghua.edu.cn subdomain (where 53 of the 78 teachers' personal pages live) is returning HTTP 503 across the board since 2026-05-30. Retried with --school 清华 --college 电子工程系 once: 0 ok / 52 fail / 52 rate-skipped. The crawler itself is correct; once the subdomain comes back, re-running stage 4 should fill the remaining 53 bios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DB has grown past GitHub's 100MB inline blob limit (152MB now, after adding Fudan/SJTU/THU crawled rows). The previous 3 commits skipped the db change to keep pushes working; from this commit on it goes through LFS so future stages-2/3/4 runs can be tracked normally. Anyone cloning the branch should have git-lfs installed (`git lfs install`) so the db file materializes on checkout instead of landing as a 134-byte pointer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
概要
这个 PR 主要完善导师数据库爬取 Pipeline,重点是浙江大学和复旦大学的学院/导师页面适配,让导师名单和导师详情抓取更完整、更稳定。
主要改动:
#、JavaScript 链接、退休教师栏目误识别成导师主页。影响范围
pipelinebackend/app/services/advisor_crawler_service.pypipeline/data/advisor_college_seeds.jsonbackend/data/impacthub.db验证
已运行:
另外实际跑过浙江大学和复旦大学的爬取验证,包括: