Data extraction
https://github.com/jheasly/homeless-cleanups
Using the pdfplumber Python package, wrote a jupyter notebook script to go through 6,590 pages in 2,814 documents FOIA'd from the City of Eugene to extract work order data and write it to a .csv file.
Dataviz
https://www.gannett-cdn.com/west-hub-production/homeless-camps/index.html
With data from .csv file created above, made a map in Mapbox GL and hosted it in a Google Cloud Platform bucket and embedded it in this story presentation I built using Gannett's proprietary In Depth framework.
How Lane County voted for president
First attempt at a choropleth, using Mapbox GL & QGIS to join scraped county election results with precinct shapefile. Exported GeoJSON out of QGIS (and then learned about reducing GeoJSON file size).
Scrapers
USA TODAY national data & investigations team: ‘A national disgrace’: 40,600 deaths tied to US nursing homes
Pitched in to work on a distributed three-person collaboration of Python developers building scrapers on deadline to supplement the manual collection of state nursing home data for a USA TODAY story detailing the national COVID-19 death toll at long-term care facilities.
Eugene, Ore., police call log
First scraper I wrote; Dec. 2008. Scrapes Eugene Police Department police call log every 15 minutes. Currently >920K rows.
Springfield, Ore., police call log
Scrapes Springfield police call log every 15 minutes. Since 2013, >230K rows.
Websites that reverse publish, APIs
http://local2.registerguard.com/civic/meetings/
A place for local credentialed entities to enter meeting information as required by law. Password-protected posts publish immediately to web (and owner has CRUD capability) and reverse publishes daily into print Civic Calendar item.
Public repo: github.com/registerguard/civic_calendar2
http://vote.registerguard.com
No link for it is currently sad and moribund. (Perhaps resurrected in 2020.) A landing page for local election information. Powered by JSON feeds that come from a Django backend fed by a Selenium-powered web scraper of Oregon Secretary of State site. Outputs results in InDesign tagged text for use in print. (Okay, if you must look, here's a link.)
Public repo: github.com/registerguard/ballot
Sample JSON API response: vote.registerguard.com/results/laneco.json
http://go.registerguard.com/entertainment/
A currently superseded Django entertainment calendar app that allowed for anonymous and trusted users to enter event information, available online and created weekly Entertainment section listing via InDesign tagged text.
https://cloud.registerguard.com/discovery/
Online adventure guide listing utilizing Leaflet & Open Street Map, powered by Tarbell and Google Sheets that also produces InDesign tagged text for print.
Public repo: github.com/registerguard/discovery
XML feed mungers, Twitter bots, RSS feeds
http://projects.registerguard.com/school-closings/
Parses a push
FlashAlert.net
XML feed every 15 minutes that results, when there are school
delayed openings and closures, in this index page, a home page
widget and a Tweet from @registerguard.
(Note: If there currently isn't bad weather in
Lane County, Ore., USA,
there probably isn’t a lot to see here.)
Public repo: github.com/registerguard/django-flashnews
http://projects.registerguard.com/school-closings/roads/
Ditto.
Public repo: github.com/registerguard/django-flashnews
Also, built an automated print archive.
Our previous CMS had no public-facing archive, so I took the initiative to build one. The only available database driver was written in Java, so I learned enough Jython to get a nightly cronjob export working.
The archive was useful for many things, e.g. it powered story feeds used by The Associated Press, ProQuest, etc. Here's a NewsBank Atom feed.
When it came time to transition to a new CMS, I used the archive app to quickly pull together a custom XML export of nine year's worth of stories — more than 250,000 locally-produced items plus related assets — that were all imported into the new CMS; no stories lost.