Collect additional property tax related data using LLMs

## Goal

**Collect additional property tax related data from documents, using LLMs for parsing.**

## Overview

There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. *If* this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.

The problem is that this data is *messy*. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.

Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We *may* be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.

> :warning: This is an experimental/moonshot task. We don't know for sure that LLMs will work here or that there's enough structured information to be useful. However, if it *does* work, then it will produce the first digitized collection of such data and a nice proof-of-concept that we can potentially use elsewhere in the office.

## Getting Started

The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.

#### Data

We need to take stock of what data actually exists that is:

- **Available** - Easy(ish) to collect
- **Valuable** - Actually needed inside PTAXSIM and useful for analysis
- **Parsable** - Possible to be read and contextualized by an LLM

I recommend we start with the following datasets:

##### TIF information

- Includes ordinance, redevelopment plans, and annual reports
  - See examples for the [City of Chicago](https://www.chicago.gov/city/en/depts/dcd/provdrs/tif.html)
  - [Example ordinance](https://www.chicago.gov/content/dam/city/depts/dcd/tif/plans/T_150_AddisonSouthRDPORD.pdf)
  - [Redev. plan](https://www.chicago.gov/content/dam/city/depts/dcd/tif/plans/T_150_AddisonSouthRDP.pdf)
- This data might be a good place to start because it's limited in scope. There are a finite number of TIFs in Cook County, and most of their text *should* be available online
- Possible datapoints to collect include:
  - Establishment information (who proposed, what criteria were used, what was the projected revenue, initial PINs included, what projects were originally planned)
  - Expenditures and porting information (where is money going, to what other TIFs)

##### Taxing district budgets

- Includes topline expenditures by taxing agency + contextual notes on that agency's main functions
- This *might* be easier to collect for the county since most bodies publish a public budget, but condensing it down into a single SQL table will be challenging
- This will be harder to collect in the long run since there are far more taxing agencies than TIFs, and budgets would need to be collected for each fiscal year + different agencies my have different fiscal years

#### LLMs

The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:

- Can ingest a large document and provide a summary or key bits of information, ideally in a machine-readable output format
- Can ideally ingest a raw *image* PDF, rather than one that's OCR'd. If necessary, we could setup a separate OCR pipeline
- Are specifically trained on legal or government documents
- Are low-cost or free. We can run locally or on EC2 if needed

### Tasks

Before proceeding to coding, the following tasks should be complete:

- [ ] Topline inventory of data to be collected. Can be a markdown list or Excel sheet attached to this issue. Should include:
  - Which *documents* need to be collected (TIF ordinance, muni budgets, etc.)
  - Where those documents will be collected from
  - What (estimated) percentage of those documents are immediately available i.e. without a records request
  - What datapoints can be collected from those documents i.e. establishment criteria, projected revenue, etc.
- [x] Inventory of LLMs and their capabilities. I would create a table/matrix with each LLM as a column and each capability or attribute as a row. Attach to this issue
- [ ] Specific inventory of data to be collected. Once the topline inventory is done, make a list of all the documents we need to collect, their source, whether they've been fetched, whether they've been OCR'd, whether they've been LLM-parsed, etc.

## Outline

Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.

#### Processing

Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.

- [ ] Share a broad overview of data processing architecture with @dfsnow
- [ ] Any scripts used for processing *must* live in `data-raw/`, though we may not want the raw data itself there

#### Package updates

Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:

- [ ] Update `data-raw/create_db.sql` to add new table definitions for your finished data
- [ ] Add a new script (or scripts) to `data-raw/` that pulls the processed data from S3 and loads it into the SQLite DB (via `data-raw/create_db.R`)
- [ ] Update the database diagrams in the README to include your new tables
- [ ] (Reach goal) Add a new R function (or arguments to an existing function) to return your data
- [ ] (Reach goal) Add a short document to `vignettes/` describing what your data is and how to use it

## Additional Requirements

- Any updates to the package *must* come via a pull request. You should work on a separate branch and notify @dfsnow when ready to merge
- Don't commit large data objects to the repository, particularly large Git LFS objects
- This data *must* be accurate. At some point down the line we will need to discuss a review process for this data


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collect additional property tax related data using LLMs #8

Goal

Overview

Getting Started

Data

TIF information

Taxing district budgets

LLMs

Tasks

Outline

Processing

Package updates

Additional Requirements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Collect additional property tax related data using LLMs #8

Description

Goal

Overview

Getting Started

Data

TIF information

Taxing district budgets

LLMs

Tasks

Outline

Processing

Package updates

Additional Requirements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions