From e5249eab8394548beb3edc1144e2f7e9e5672350 Mon Sep 17 00:00:00 2001 From: Eyal Gaon Date: Mon, 25 May 2026 21:12:36 +0300 Subject: [PATCH 1/2] docs(tiktok): document Creator Search Insights (CSI) scraping CSI (tiktok.com/csi) exposes trending search queries by category and is auth-walled (redirects to /login for anonymous users). Captures the private API endpoint (/api/search/inspired_query/recommended_queries), request body shape, category filter values including Food, response fields worth keeping (popularity_v2, trending_seq_v2, query_labels, textnet), pagination, and the js()-non-async-IIFE gotcha that breaks top-level await inside the captured fetch loop. Co-Authored-By: Claude Opus 4.7 (1M context) --- domain-skills/tiktok/csi-scraping.md | 166 +++++++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 domain-skills/tiktok/csi-scraping.md diff --git a/domain-skills/tiktok/csi-scraping.md b/domain-skills/tiktok/csi-scraping.md new file mode 100644 index 00000000..f8c576ab --- /dev/null +++ b/domain-skills/tiktok/csi-scraping.md @@ -0,0 +1,166 @@ +# TikTok Creator Search Insights (CSI) — scraping + +URL: `https://www.tiktok.com/csi?lang=en` + +CSI surfaces what TikTok users are actively searching for, broken down by +category. Auth-walled — `/csi` redirects to `/login` for anonymous users. + +## Auth + +Requires a logged-in TikTok web session. There's no anonymous fallback. If +the harness lands on `/login`, ask the user to log in once in the attached +Chrome and retry — cookies persist across runs. + +Detect the wall by reading `page_info()["url"]` after `wait_for_load()`: + +```python +new_tab("https://www.tiktok.com/csi?lang=en") +wait_for_load() +if "/login" in page_info()["url"]: + print("LOGIN_WALL") + sys.exit(0) +``` + +## Page structure + +Two top-level tabs: **Suggested** (default) and **Trending**. The +Suggested tab is what you want for "what's hot right now in category X". + +Filter pills below the tabs: **All** (default), **Content gap**, +**Searches by followers**. "Content gap" surfaces queries with high +search demand and low video supply — the gold for trend hunters; the API +exposes this as `query_labels: ["content_gap"]`. + +**Filters dropdown** (top-right button, text "Filters"): opens a panel +with **Category (multi-select)** and **Language** sections. As of +2026-05 the categories were: + +- Fashion +- **Food** +- Gaming +- Tourism +- Science +- Sports + +Apply via the red **Apply** button at the bottom of the dropdown. + +## Private API + +The query/filter UI POSTs to: + +``` +POST https://www.tiktok.com/api/search/inspired_query/recommended_queries + ?WebIdLastTime=…&aid=1988&app_language=en&app_name=tiktok_web&… (device fingerprint params) +``` + +The query string is a device-fingerprint blob (`WebIdLastTime`, +`device_id`, `msToken`, `_signature`, etc.). **Do not try to reconstruct +it cold** — capture it from a real session via a `window.fetch` interceptor +installed before triggering the filter: + +```js +window.__cap = []; +const orig = window.fetch; +window.fetch = async function(...args) { + const url = typeof args[0] === "string" ? args[0] : args[0].url; + const r = await orig.apply(this, args); + if (url.indexOf("/api/search/inspired_query/recommended_queries") !== -1) { + window.__cap.push({url, body: args[1] && args[1].body}); + } + return r; +}; +``` + +Then click Filters → Food → Apply. The captured `url + body` is your +template for paginated re-calls. + +### Request body shape + +```json +{ + "pagination": { + "offset": 0, + "limit": 21, + "order_by": ["_score", "search_cnt"] + }, + "tab": "all", + "accept_inspiration_types": ["ecom", "phototext", "search"], + "basic_info": {"session_refresh_index": 1}, + "inspiration_vertical_filter": { + "language_filters": ["en"], + "category_filters": ["Food"] + }, + "cli_session_id": "", + "creator_source": "csi_webapp" +} +``` + +Override `pagination.offset` for paging. `limit` of 50 is accepted (UI +default is 21). Increment `basic_info.session_refresh_index` on each call +so the server treats them as a continuous session. + +### Response shape + +Top-level: `{inspiration_list, has_more, cursor, status_code, status_msg, banners, cost_tracker, extra, has_follower_searched, session_refresh_index}`. + +`status_code: 0` is success. Anything else is a soft error — read +`status_msg` for the reason; do not treat as fatal HTTP error. + +Per-item fields worth keeping (rest are debug/scoring noise): + +| field | meaning | +| ---------------------- | -------------------------------------------------------------------- | +| `query_text` | the trending search string (the trend) | +| `query_id_str` | stable identifier | +| `popularity_v2` | latest search-volume estimate (UI shows this as "Search popularity") | +| `popularity` | rank-bucket integer (small — 2–90 range observed) | +| `trending_seq_v2` | ~7-point time series of `popularity_v2`. Growth = `seq[-1]/seq[0]` | +| `popularity_updated_at`| epoch seconds | +| `video_num` | videos already published for this query | +| `query_labels` | list — `["content_gap"]` flags low-supply / high-demand | +| `textnet` | hierarchical category path (layer1..layer4) | +| `inspiration_types` | `["search"]` for searches, also `"ecom"` / `"phototext"` | + +### Pagination + +`has_more: true` + `cursor: N` means there's more; pass `cursor` as the +next `offset`. The Food category has ~150–200 items total in en. Loop +until `has_more: false` or you get an empty `inspiration_list`. + +### Awaiting the response from `js()` + +`browser-harness`'s `js()` wraps your code in a **non-async** IIFE, so +top-level `await` inside the expression won't compile. Use a `.then()` +chain that resolves to a `JSON.stringify` payload — `js()` will await the +returned promise via `awaitPromise: true`: + +```python +raw = js( + "return fetch(url, {...}).then(r => r.json()).then(j => JSON.stringify(j));" +) +data = json.loads(raw) +``` + +For multi-page work, paginate from the Python side — one page per `js()` +call — instead of looping inside a single async block. + +## Gotchas + +- **Toggling Filters twice closes the dropdown.** If a previous run left + the panel open, clicking the Filters button collapses it; the Food + label click then misses. Detect dropdown state or always start from a + fresh tab (`new_tab(...)`). +- **Cookies tied to TikTok profile, not Chrome profile.** Re-running + after switching TikTok accounts in the same Chrome window will swap + the session. +- **`popularity_v2` ≠ `video_num`.** Don't conflate them. A high-popularity, + low-video term is a content-gap signal; that's exactly what `query_labels` + flags. +- **Numbers in UI look bigger than the data.** The UI shows trends as + `7.93M` etc. That's `popularity_v2` formatted, not a separate field. +- **Suggested vs Trending tab.** The default tab is `Suggested` and that + matches the `tab: "all"` field in the body. The Trending tab uses a + different `tab` value; not yet captured. +- **Query string is unstable.** The fingerprint params (`msToken`, + `_signature`, etc.) rotate; cookies do too. Always re-capture the + template each scrape rather than persisting a fixed URL. From 5b8c9805886f63a87b0475953fb4ef197f2bea64 Mon Sep 17 00:00:00 2001 From: Eyal Gaon Date: Tue, 26 May 2026 08:19:49 +0300 Subject: [PATCH 2/2] docs(tiktok): document tt-target-idc cookie swap for regional targeting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CSI is geo-personalized to the logged-in account region, not the IP and not via any UI control. Swapping tt-target-idc (and dropping the bound tt-target-idc-sign) reroutes the API call to a different IDC and returns region-appropriate data — confirmed alisg→useast1a moves an Israeli account from "brunch tel aviv" results to "hot cheetos" results. Also notes the language_filters sweet spot — ["en"] balances coverage and quality, ["en-US"] sometimes returns 0 items, empty/missing surfaces ~40% non-target-region noise. Co-Authored-By: Claude Opus 4.7 (1M context) --- domain-skills/tiktok/csi-scraping.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/domain-skills/tiktok/csi-scraping.md b/domain-skills/tiktok/csi-scraping.md index f8c576ab..738955af 100644 --- a/domain-skills/tiktok/csi-scraping.md +++ b/domain-skills/tiktok/csi-scraping.md @@ -21,6 +21,33 @@ if "/login" in page_info()["url"]: sys.exit(0) ``` +## Regional targeting via `tt-target-idc` cookie + +CSI is geo-personalized — the feed reflects the **logged-in account's +region**, not your IP and not anything in the filter UI (there is no +audience/country selector). Israeli accounts default to `tt-target-idc=alisg` +(Alibaba Singapore IDC) and surface heavily Israel-flavored results +("brunch tel aviv", "ethiopian tik toks"). Forcing a different IDC works: + +1. Pull cookies via CDP — `cdp("Network.getCookies", urls=["https://www.tiktok.com/"])`. +2. Replace `tt-target-idc` and `store-idc` with the target IDC (e.g. + `useast1a` for US-East, `useast2a` for US-East-2). +3. **Drop `tt-target-idc-sign`** — it's an HMAC bound to the original IDC + value; TikTok will fall back to deriving routing from the unsigned + `tt-target-idc` cookie alone. +4. POST the API endpoint via Python `requests` with the rewritten cookie + header. No proxy needed — TikTok routes to the requested IDC and + returns region-appropriate data. + +Some personalization still bleeds through deeper offsets (~10–20% of +items can carry account-region keywords like a city name). The `tt-target-idc` +swap gets you most of the way; post-filter obvious geo markers in your +own code rather than trying to fight it further in the API call. + +`language_filters: ["en"]` in the body is the sweet spot — `["en-US"]` +sometimes returns 0 items (over-restrictive), and an empty/missing +`language_filters` field surfaces 40%+ non-English-region noise. + ## Page structure Two top-level tabs: **Suggested** (default) and **Trending**. The