You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tutorial loading and using Suicide Risk Lexicon. l.load_lexicon(name)
load/save all vs. most prototypical
Tutorial: add example loading embeddings. cts.measure(documents, stored_embeddings_path = PATH)
Add docstring for everything.
Add example of how much a result costs for a definition (add date and model).
Use generative AI model from huggingface from cache, instead of huggingface API
Tutorial
save other outputs of lexicon count and cts.
Put CTS first.
Sort lexicon by similarity for validation (see code in lexicon repo)
Show how to load embeddings pickle to save time.
Add ipywidgets and tqdm and jupyter to toml so you can view progress bar.
Lexicon
lexicon.extract should output a columns called document_n and document_str
l.extract() as a method instead of lexicon.extract(l.constructs)
obtain lexicon_dict = l.to_dict() --> {construct: [tokens]} from lexicon object.
lexicon.add clean up how I store metadata automatically and manually. Maybe create a brief input() dialogue so it saves user, timestamp, source, etc.
create lexicon from seance. Modify seance tutorial accordingly.
Outputs/visualization
Clean up output for matches_per_constructmatches_counter_dandmatches_per_doc`?
I created a highlight function. But I had other code I used to look at context by printing (in lexicon repo). add to tutorial and scripts.
CTS:
🔴 instead of saving cosine_similarities (2GB for 5000 CTL chats, compressed), you can provide the tuple (lexicon token, document token, similarity) in the DF. And just output a visualization for those as HTML.
exact match within a string == 1
threshold = 0.4 (depends on embedding) for final score. Remember CTS for bursting study where values (without model) where too high for some features.
Add additional arguments for CTS
# TODO: double check values for temperature from paper
Outputs/visualization
CTS: Plot matches in text
CTS: show top token and cosine similarity in column of features DF
Extensions
Implement in R or have an R wrapper
Create a website where csv file can return features.
🔴 High Priority
🟡 Medium Priority
🟢 Low Priority
General
l.load_lexicon(name)cts.measure(documents, stored_embeddings_path = PATH)Tutorial
Lexicon
lexicon.extractshould output a columns calleddocument_nanddocument_strl.extract()as a method instead oflexicon.extract(l.constructs)lexicon_dict = l.to_dict()-->{construct: [tokens]}from lexicon object.lexicon.addclean up how I store metadata automatically and manually. Maybe create a brief input() dialogue so it saves user, timestamp, source, etc.Outputs/visualization
Clean up output formatches_per_constructmatches_counter_dandmatches_per_doc`?CTS:
cosine_similarities(2GB for 5000 CTL chats, compressed), you can provide the tuple(lexicon token, document token, similarity)in the DF. And just output a visualization for those as HTML.Outputs/visualization
Extensions