You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collect additional property tax related data from documents, using LLMs for parsing.
Overview
There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. If this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.
The problem is that this data is messy. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.
Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We may be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.
⚠️ This is an experimental/moonshot task. We don't know for sure that LLMs will work here or that there's enough structured information to be useful. However, if it does work, then it will produce the first digitized collection of such data and a nice proof-of-concept that we can potentially use elsewhere in the office.
Getting Started
The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.
Data
We need to take stock of what data actually exists that is:
Available - Easy(ish) to collect
Valuable - Actually needed inside PTAXSIM and useful for analysis
Parsable - Possible to be read and contextualized by an LLM
I recommend we start with the following datasets:
TIF information
Includes ordinance, redevelopment plans, and annual reports
This data might be a good place to start because it's limited in scope. There are a finite number of TIFs in Cook County, and most of their text should be available online
Possible datapoints to collect include:
Establishment information (who proposed, what criteria were used, what was the projected revenue, initial PINs included, what projects were originally planned)
Expenditures and porting information (where is money going, to what other TIFs)
Taxing district budgets
Includes topline expenditures by taxing agency + contextual notes on that agency's main functions
This might be easier to collect for the county since most bodies publish a public budget, but condensing it down into a single SQL table will be challenging
This will be harder to collect in the long run since there are far more taxing agencies than TIFs, and budgets would need to be collected for each fiscal year + different agencies my have different fiscal years
LLMs
The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:
Can ingest a large document and provide a summary or key bits of information, ideally in a machine-readable output format
Can ideally ingest a raw image PDF, rather than one that's OCR'd. If necessary, we could setup a separate OCR pipeline
Are specifically trained on legal or government documents
Are low-cost or free. We can run locally or on EC2 if needed
Tasks
Before proceeding to coding, the following tasks should be complete:
Topline inventory of data to be collected. Can be a markdown list or Excel sheet attached to this issue. Should include:
Which documents need to be collected (TIF ordinance, muni budgets, etc.)
Where those documents will be collected from
What (estimated) percentage of those documents are immediately available i.e. without a records request
What datapoints can be collected from those documents i.e. establishment criteria, projected revenue, etc.
Inventory of LLMs and their capabilities. I would create a table/matrix with each LLM as a column and each capability or attribute as a row. Attach to this issue
Specific inventory of data to be collected. Once the topline inventory is done, make a list of all the documents we need to collect, their source, whether they've been fetched, whether they've been OCR'd, whether they've been LLM-parsed, etc.
Outline
Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.
Processing
Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.
Share a broad overview of data processing architecture with @dfsnow
Any scripts used for processing must live in data-raw/, though we may not want the raw data itself there
Package updates
Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:
Update data-raw/create_db.sql to add new table definitions for your finished data
Add a new script (or scripts) to data-raw/ that pulls the processed data from S3 and loads it into the SQLite DB (via data-raw/create_db.R)
Update the database diagrams in the README to include your new tables
(Reach goal) Add a new R function (or arguments to an existing function) to return your data
(Reach goal) Add a short document to vignettes/ describing what your data is and how to use it
Additional Requirements
Any updates to the package must come via a pull request. You should work on a separate branch and notify @dfsnow when ready to merge
Don't commit large data objects to the repository, particularly large Git LFS objects
This data must be accurate. At some point down the line we will need to discuss a review process for this data
Goal
Collect additional property tax related data from documents, using LLMs for parsing.
Overview
There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. If this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.
The problem is that this data is messy. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.
Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We may be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.
Getting Started
The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.
Data
We need to take stock of what data actually exists that is:
I recommend we start with the following datasets:
TIF information
Taxing district budgets
LLMs
The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:
Tasks
Before proceeding to coding, the following tasks should be complete:
Outline
Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.
Processing
Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.
data-raw/, though we may not want the raw data itself therePackage updates
Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:
data-raw/create_db.sqlto add new table definitions for your finished datadata-raw/that pulls the processed data from S3 and loads it into the SQLite DB (viadata-raw/create_db.R)vignettes/describing what your data is and how to use itAdditional Requirements