benevolentbandwidth · yashika-arora · May 21, 2026
diff --git a/cbis_ddsm_dataset_card.md b/cbis_ddsm_dataset_card.md
@@ -0,0 +1,52 @@
+# Dataset Card: CBIS-DDSM
+
+## Overview
+
+CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is a curated subset of the Digital Database for Screening Mammography (DDSM). It was created to address limitations in the original DDSM by providing a standardized and more accessible collection of mammography images. Unlike the original DDSM which contained both positive and negative cases, CBIS-DDSM focuses specifically on abnormal cases — including both benign and malignant findings — and provides pixel-level segmentation masks that outline the lesion regions of interest. These masks make it particularly valuable for segmentation and lesion detection tasks. The dataset contains images in DICOM format, which have been converted to PNG for use in this project. It is one of the smaller datasets in our pipeline but provides high-quality ROI annotations that are well-suited for guiding the segmentation component of the model.
+
+---
+
+## Dataset Details
+
+| Field | Details |
+|---|---|
+| Access | [Kaggle](https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset) |
+| Number of Patients | 2,620 |
+| Number of Images | 10,239 |
+| Image Format | DICOM (converted to PNG for this project) |
+| Annotations | Pixel-level segmentation masks for lesion ROI |
+| Task Suitability | Segmentation, lesion detection |
+
+---
+
+## Role in Second Look Pipeline
+
+CBIS-DDSM serves as the primary source of high-quality ROI annotations in the Second Look pipeline. Its pixel-level segmentation masks allow the model to learn not just whether an image contains a suspicious finding, but precisely where within the image that finding is located. This is critical for the app's goal of highlighting regions of interest to the user rather than simply returning a binary classification. CBIS-DDSM is used specifically to guide the segmentation component of the model, complementing RSNA which provides large-scale exam-level labels and VinDR which provides bounding box annotations and BI-RADS scores. Because CBIS is derived from the original DDSM, raw DDSM cannot serve as a validation or test source while CBIS appears in training, or vice versa, due to data leakage risk.
+
+---
+
+## Biases
+
+- **Abnormal-case emphasis**: CBIS-DDSM primarily contains abnormal mammograms — including benign and malignant findings — with segmentation masks. Normal screening mammograms are not meaningfully represented in this subset. This means models trained solely on CBIS-DDSM may not adequately learn the characteristics of truly normal screening exams.
+
+- **Skewed distribution in original DDSM**: The original DDSM exhibits a heavily imbalanced distribution with 695 normal findings and 1,784 abnormal cases. This imbalance carries over when combining CBIS-DDSM with the original DDSM and requires a sub-sampling strategy to achieve a realistic class distribution. Even with sub-sampling, there are not enough normal cases to fully balance the dataset, which can bias model predictions toward abnormal findings.
+
+---
+
+## Limitations
+
+- **Very limited normal cases**: The very limited representation of normal mammograms means this dataset cannot be used alone for binary cancer/no-cancer classification. Any model trained exclusively on CBIS-DDSM would need to be combined with a dataset containing normal cases, such as RSNA, before it can reliably distinguish healthy from unhealthy scans.
+
+- **Data leakage risk**: Since CBIS-DDSM is derived from the original DDSM, the two datasets cannot both appear in training and validation/test splits simultaneously. Using both without careful separation would result in data leakage, where the model indirectly sees test data during training, artificially inflating performance metrics.
+
+- **Small dataset size**: At 10,239 images from 2,620 patients, CBIS-DDSM is significantly smaller than RSNA (54,710 images) or VinDR (20,000 images). This limits its use as a standalone training set for deep learning models that typically require large volumes of data to generalize well.
+
+- **Sub-sampling required**: A sub-sampling strategy is needed to achieve a realistic class distribution, and even with sub-sampling there are not enough normal cases. This makes it difficult to train models that reflect real-world screening populations where the vast majority of cases are normal.
+
+- **Over-specification risk**: There is a risk that algorithms trained heavily on CBIS-DDSM become over-specified to this dataset distribution and exhibit degraded performance when evaluated on external screening populations with substantially different class distributions.
+
+---
+
+## Ethical Considerations
+
+CBIS-DDSM contains de-identified patient mammography data collected for research purposes. As a medical imaging dataset, it must be used responsibly and solely for legitimate scientific research. The Second Look project uses this data exclusively for the purpose of developing a privacy-preserving screening tool. No attempt is made to re-identify any individual patient. All team members working with this data are expected to handle it with care and in accordance with applicable data use agreements.
diff --git a/rsna_dataset_card.md b/rsna_dataset_card.md
@@ -0,0 +1,61 @@
+# Dataset Card: RSNA Screening Mammography
+
+## Overview
+
+The RSNA Screening Mammography Breast Cancer Detection dataset contains full-field digital mammography images from a large-scale screening population covering nearly 12,000 patients and over 54,000 individual mammogram screens. The dataset provides exam-level cancer labels rather than lesion-level annotations, making it particularly well-suited for large-scale classification and cancer risk learning tasks. Images were originally in DICOM format and have been converted to PNG for use in this project, resulting in a dataset of approximately 144 GB. Its scale makes it the primary dataset for large-scale screening classification within the Second Look pipeline.
+
+---
+
+## Dataset Details
+
+| Field | Details |
+|---|---|
+| Access | [Kaggle](https://www.kaggle.com/competitions/rsna-breast-cancer-detection) / [AWS Open Data](https://registry.opendata.aws/rsna-screening-mammography-breast-cancer-detection/) |
+| Number of Patients | 11,914 |
+| Number of Images | 54,710 |
+| Image Format | DICOM (converted to PNG, 144 GB) |
+| Annotations | Exam-level cancer label |
+| Task Suitability | Exam-level classification, cancer risk learning |
+
+---
+
+## Positive Case Breakdown
+
+| Category | Count |
+|---|---|
+| Right breast positive | 570 |
+| Left breast positive | 588 |
+| Positive in both breasts | 2 |
+| Total positive rate | 2.1% |
+
+---
+
+## Role in Second Look Pipeline
+
+RSNA is the primary dataset for large-scale exam-level cancer and risk learning in the Second Look pipeline. Its scale makes it particularly useful for training deep learning models on large screening populations. Because it reflects a real screening population where the vast majority of cases are normal, it provides the negative examples that CBIS-DDSM lacks. It complements CBIS-DDSM which provides high-quality ROI segmentation masks, and VinDR which provides modern FFDM lesion-level supervision and BI-RADS scores.
+
+---
+
+## Biases
+
+- **Extreme class imbalance**: Only 2.1% of cases are positive, meaning the dataset is heavily skewed toward normal findings. This reflects the realistic distribution of a screening population but creates significant class imbalance challenges during model training.
+
+- **Screening population bias**: The dataset reflects a screening population rather than a diagnostic population. This means it skews heavily toward normal findings, which is realistic for deployment but requires careful handling during training to ensure the model does not underperform on the rare positive cases that matter most clinically.
+
+---
+
+## Limitations
+
+- **No lesion-level annotations**: Unlike CBIS-DDSM or VinDR, RSNA does not provide bounding boxes or segmentation masks. It cannot be used alone for lesion localization or segmentation training. The model can learn whether cancer is present at the exam level but cannot learn where within the image to look.
+
+- **Large storage requirement**: At 144 GB after PNG conversion, the dataset often requires cloud-based storage and processing workflows due to its size.
+
+- **Exam-level labels only**: Labels are provided at the exam level, not the lesion level. This limits fine-grained analysis and means RSNA must be combined with datasets like CBIS-DDSM or VinDR for any localization or segmentation task.
+
+- **Very low positive rate**: At only 2.1% positive cases, standard training without rebalancing will produce a model biased toward predicting negative, potentially missing the rare positive cases that are most clinically significant.
+
+---
+
+## Ethical Considerations
+
+RSNA contains de-identified patient mammography data collected for research purposes through the Radiological Society of North America. As a medical imaging dataset, it must be used responsibly and solely for legitimate scientific research. The Second Look project uses this data exclusively for the purpose of developing a privacy-preserving screening tool. No attempt is made to re-identify any individual patient. All team members working with this data are expected to handle it with care and in accordance with applicable data use agreements.
diff --git a/vindr_dataset_card.md b/vindr_dataset_card.md
@@ -0,0 +1,50 @@
+# Dataset Card: VinDr-Mammo
+
+## Overview
+
+VinDr-Mammo is a large-scale full-field digital mammography dataset gathered in Vietnam from 5,000 mammography exams. Each patient sample contains two medial lateral oblique and two cranial caudal views, resulting in 20,000 images in total. Unlike RSNA which provides exam-level labels only, VinDr-Mammo includes detailed radiologist annotations covering BI-RADS scores, reported breast density, and bounding box locations of non-benign findings along with their category and BI-RADS assessment. The dataset used a double-read protocol where two radiologists independently reviewed each case, with disagreements resolved through arbitration. Images were originally in DICOM format and have been converted to PNG for use in this project, reducing storage requirements substantially after conversion.
+
+---
+
+## Dataset Details
+
+| Field | Details |
+|---|---|
+| Access | [PhysioNet](https://physionet.org/content/vindr-mammo/1.0.0/) — requires data use agreement |
+| Number of Exams | 5,000 |
+| Number of Images | 20,000 (2 MLO + 2 CC views per patient) |
+| Image Format | DICOM (converted to PNG, approximately 62 GB reduced from 338 GB) |
+| Annotations | BI-RADS score, breast density, bounding boxes, finding category, location |
+| Task Suitability | Lesion detection, localization, BI-RADS classification, breast density assessment |
+
+---
+
+## Role in Second Look Pipeline
+
+VinDr-Mammo provides modern FFDM breast-level and lesion-level supervision in the Second Look pipeline. Its bounding box annotations allow the model to learn where within the image suspicious findings are located, complementing the pixel-level masks from CBIS-DDSM. Its BI-RADS scores map directly to the project's three-tier concern classification system — Low, Moderate, and Elevated — making it the most directly aligned dataset with the app's output format. It complements RSNA which provides large-scale exam-level cancer labels and CBIS-DDSM which provides high-quality ROI segmentation masks.
+
+---
+
+## Biases
+
+- **Geographic bias**: VinDr-Mammo was gathered exclusively in Vietnam from a specific patient population. This may limit generalizability to patients from other countries or ethnic backgrounds where breast tissue density and cancer presentation patterns differ from those observed in the Vietnamese screening population.
+
+- **Not biopsy confirmed**: The positive cases do not appear to be biopsy confirmed and thus may contain some false positives. This introduces uncertainty in the ground truth labels and means some cases labeled as positive findings may not represent true malignancies.
+
+---
+
+## Limitations
+
+- **False positives possible**: Since positive findings are based on radiologist assessment rather than pathology confirmation through biopsy, some labeled positive cases may be incorrect. This could introduce noise into model training and affect the reliability of the learned representations.
+
+- **Restricted access**: Access requires signing a PhysioNet data use agreement before the dataset can be used. This limits reproducibility for external researchers who have not completed this process and adds an administrative step before the data can be accessed.
+
+- **Large storage requirement**: The original unconverted dataset is 338 GB. Even after conversion to PNG the dataset remains large at approximately 62 GB, requiring cloud-based storage and processing workflows.
+
+- **Single-country screening population**: VinDr-Mammo was collected entirely in Vietnam under a specific screening protocol. Findings and radiologist conventions from this population may not fully generalize to screening populations in other countries or healthcare systems.
+
+---
+
+## Ethical Considerations
+
+VinDr-Mammo contains de-identified patient mammography data collected for research purposes and is hosted on PhysioNet. Access requires signing a formal data use agreement which obligates all users to protect patient privacy, use the data solely for scientific research, and not attempt to re-identify any individual. The Second Look project uses this data exclusively for the purpose of developing a privacy-preserving screening tool. All team members working with this data have signed the PhysioNet data use agreement and are expected to handle it in full compliance with its terms.