feat: enhance Code City Apocalypse dashboard with logs and logic fixes by BleakNarratives · Pull Request #32 · BleakNarratives/AIRTBench-Code

BleakNarratives · 2026-05-23T22:12:44Z

This PR significantly enhances the 'Code City Apocalypse' Streamlit dashboard.

Key changes include:

System Logs Tab: A new interactive tab that displays live and archived system telemetry in a terminal-themed view, complete with timestamp parsing and model identification.
Leaderboard Overhaul: Success rates are now calculated based on unique challenge IDs secured by each model, ensuring accurate representation of progress. All active models are listed, regardless of success count.
Metric Correction: The 'Sectors Compromised' metric has been updated to count unique challenges with at least one success, aligning with the 'Secured Sectors' theme.
Visual Polish:
- Updated CSS for .lore-card to provide a more immersive apocalyptic feel.
- Added a neon-green glow specifically for 'The Beacon of Hope' success event in the Digital Bestiary.
- Standardized role labels in chat histories (🤖 AGENT, 👤 OPERATOR, ⚙️ SYSTEM).
Robustness & Performance:
- Added checks to prevent crashes when data is filtered down to zero events.
- Used .copy() on dataframe slices to eliminate pandas warnings.
- Integrated standard library datetime and html for safer log rendering.

Verified with ruff, mypy, and automated Playwright UI testing.

PR created automatically by Jules for task 16745960992002796971 started by @BleakNarratives

- Added 'System Logs' tab for terminal-style telemetry feed. - Refactored leaderboard to correctly calculate unique successes per model. - Fixed 'Sectors Compromised' logic to reflect secured sectors accurately. - Improved 'Digital Bestiary' visuals with neon-green success highlights. - Added safety guards for empty data states in monster manifests. - Standardized use of .copy() to avoid SettingWithCopyWarning. - Ensured all active models are shown in leaderboard, even with 0 successes. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>

google-labs-jules · 2026-05-23T22:12:46Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

gemini-code-assist

Code Review

This pull request updates the airtbench frontend by adding a System Logs tab, refining the Digital Bestiary UI, and improving the accuracy of apocalypse metrics. It also includes logic refinements for success detection and data handling. Review feedback identifies performance bottlenecks in the new logging function related to string concatenation and redundant date parsing, and recommends hoisting constant list derivations to the global scope to avoid unnecessary re-computation.

gemini-code-assist · 2026-05-23T22:15:09Z

+    log_html = '<div class="log-container">'
+    # Display newest first
+    for _, row in failure_df.iloc[::-1].iterrows():
+        # Handle timestamp: use existing col if present, else fallback
+        timestamp = "LIVE"
+        if "timestamp" in row and pd.notna(row["timestamp"]):
+            try:
+                # If numeric, assume unix
+                ts_val = row["timestamp"]
+                if isinstance(ts_val, (int, float)):
+                    timestamp = datetime.datetime.fromtimestamp(
+                        ts_val,
+                        tz=datetime.timezone.utc,
+                    ).strftime("%H:%M:%S")
+                else:
+                    # Try ISO string
+                    timestamp = pd.to_datetime(ts_val).strftime("%H:%M:%S")
+            except (ValueError, OSError):
+                timestamp = "LIVE"
+
+        model = html.escape(str(row.get("model_display", "UNKNOWN")))
+        challenge = html.escape(str(row.get("challenge", "UNKNOWN")))
+
+        if row.get("is_success"):
+            status_class = "log-status-success"
+            msg = "BEACON SECURED"
+        else:
+            status_class = "log-status-fail"
+            monster_info = get_monster_info(row)
+            msg = f"OVERWHELMED BY {monster_info['monster']}"
+
+        log_html += (
+            f'<div class="log-entry">'
+            f'<span class="log-time">[{timestamp}]</span> '
+            f'<span class="log-model">{model}</span> @ '
+            f'<span class="log-challenge">{challenge}</span>: '
+            f'<span class="{status_class}">{msg}</span>'
+            f"</div>"
+        )
+
+    log_html += "</div>"


The render_system_logs function has significant performance and scalability issues:

Scalability: It attempts to render the entire failure_df as a single HTML block. As the dataset grows, this will result in a massive HTML payload, causing the Streamlit app to lag and potentially crashing the user's browser.

Efficiency: Building a large string using += in a loop is an $O(n^2)$ operation in Python. Using "".join() with a list is much more efficient.

Redundant Parsing: pd.to_datetime is called inside the loop for every row, which is a heavy operation.

I recommend limiting the display to the most recent entries (e.g., the last 100) and optimizing the string construction.

References

Follow standard Python performance practices regarding string immutability and pandas iteration. ^(link)

Avoid iterrows() and string concatenation in loops for large DataFrames to maintain performance.

gemini-code-assist · 2026-05-23T22:15:09Z

        else row.get("error")
    )
-    error_keys = [k for k in ERROR_MAP.keys() if k != "found_flag"]
+    error_keys = [k for k in ERROR_MAP if k != "found_flag"]


The error_keys list is re-created every time get_monster_info is called. Since ERROR_MAP is a constant, this list (and the sorted version used later in the function) should be defined as a global constant to avoid redundant computations, especially when this function is called within loops (e.g., during log rendering).

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance Code City Apocalypse dashboard with logs and logic fixes#32

feat: enhance Code City Apocalypse dashboard with logs and logic fixes#32
BleakNarratives wants to merge 1 commit into
mainfrom
feat/improve-code-city-dashboard-16745960992002796971

BleakNarratives commented May 23, 2026

Uh oh!

google-labs-jules Bot commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BleakNarratives commented May 23, 2026

Uh oh!

google-labs-jules Bot commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant