PeopleFinderTech

From Katrina Help Info

Table of contents

What we are doing

There are over 50 sites out there that help people find lost loved ones from new orleans. The problem is none of the sites talk to one another. We are solving this problem by building automated data interchange systems and scraping data sets. We need your help!

Here are the goals of this project

  1. Implement automated data interchange systems around the PFIF spec
  2. Scrape and merge data from sets that will not implement PFIF
  3. Minimize duplicate records
  4. Make the central database avaliable to be searched

We are in contact with KatrinaSafe and they will be using data we collect.

Coordination & Leadership

PFIF spec leaders:

  • Ka-Ping Yee (ping [at] zesty ca)
  • Josh Kleinpeter (kleinpeterj [at] corp.earthlink.net)
  • Jon Plax (salesforce.com)

Scraping Effort Leader:

  • Zack Rosen <zack at civicspacelabs.org>

PFIF implementation coordinator:

  • Zack Rosen <zack at civicspacelabs.org

Discussions

Master Database

http://katrinalist.net - doesn't yet include any scraped or PFIF fed data

Data interchange spec

We have an FINAL PeopleFinder Interchange Format avaliable at

http://zesty.ca/pfif/


PFIF/RDF

  • cilibrar has used this as the basis for the [removed SQL] and [removed XML] he made, however he had to make two small adjustments:
  • in both tables the primary key is an int whose name is id for easy ruby + rails (http://www.rubyonrails.org/) compatibility. this was added and the other key (the incoming source db key string) is still named as it was before but it is no longer the real primary key for our db
  • the entry_date was converted to an integer in order to be maximally portable, since different databases and SQL flavors handle dates differently. the integer is just seconds passed since the epoch and therefore can be converted easily to human readable form in any language using standard time functions

The SQL database schema is included with the SQL dump. Both are compressed using bzip2 (http://www.bzip.org/). If people are willing to dump their records in the same SQL schema I have made above (based on PFIF) then please do so and I can merge all dumps. The ftp site, username, and password are available from the katrinadev list.

How to get involved

Scrape data sets

  1. Sign up on the katrina-scrapers mailinglist mailto:katrinascrapers-subscribe@civicspacelabs.org
  2. Choose a set from the list PeopleFinderTech#Sites_that_need_to_be_scraped
  3. Move it under the "Sites that are currently being scraped" heading
  4. Update it's status on PeopleFinderTechStructuredDataSets
  5. Let people know you are scraping the set on the KatrinaScrapers mailinglist
  6. When you are done scraping, validate the data by
    1. uploading a single record of the data to: http://www.katrinalist.net/uploadPFIF/
    2. run the set through the PFIF validator: http://www.w3.org/2001/03/webdata/xsv
  7. Link to your data on the pass word protected wiki (email zack [at] civicspacelabs.org for access)
  8. Move your wiki listing under the PeopleFinderTech#Validated section after it is validated
  9. Let the KatrinaScrapers mailinglist know you have succesfully scraped and validated a data set

If you have trouble getting your data to validate feel free to ask PFIF questions on the KatrinaDev mailinglist.

Validate data sets

  1. Choose a set and move it under the PeopleFinderTech#Being_validated section and add your email address and name below it on the listing
  2. Notify the KatrinaScrapers mailinglist as to which data set you are validating
  3. Get access to the file on the password protected wiki (email zack [at] civicspacelabs.org for access)
  4. Validate the data by
    1. uploading a single record of the data to: http://www.katrinalist.net/uploadPFIF/
    2. run the set through the PFIF validator: http://www.w3.org/2001/03/webdata/xsv
    3. If you have problem with file size and this interface, there is a *NIX command line utility which has been recommended:
      1. Get xmllint (comes with most unix distros and cygwin - go to http://xmlsoft.org/downloads.html for source, binaries, etc)
      2. Download the XSD file at http://zesty.ca/pfif/1.1/pfif-1.1.xsd
      3. Invoke xmllint on your XML file (assume we call it pfif.xml):
        xmllint --noout --schema pfif-1.1.xsd pfif.xml
  5. If the feed is valid move it under the PeopleFinderTech#Validated section. If it is invalid then move it under the PeopleFinderTech#Invalid_sets heading and contact the data set scraper and help them fix their set
  6. Notify the KatrinaScrapers of your results

Helping site admins implement PFIF feeds

  1. Choose a site from this list PeopleFinderTech#Sites_that_need_help_implementing_PFIF_feeds
  2. Contact the site admin and offer assistance
  3. Move the listing under the heading "Sites currently implementing PFIF feeds"
  4. When the site is putting out a validated PFIF feed send a note to the KatrinaDev mailinglist

Also, we have a task list accessible here: Task List


Data Sets

A list of structured data sets and contact information for the owners is up on PeopleFinderTechStructuredDataSets

PFIF Feeds

PFIF/RDF TRANSFORM


Courtesy of Peter Mika pmika at cs.vu.nl (http://prauw.cs.vu.nl:8080/pfif/) Feedback Welcome

Sites that have PFIF feeds

Sites currently implementing PFIF feeds

Sites that need help implementing PFIF feeds

Sites that agreed to implement PFIF but have unknown status

PFIF Implementation Volunteers

If you are avaliable to help site admins implement PFIF please add your name and email address to the list below

  • Tony Chang: tony [at] ponderer.org - email me if you want help implementing PFIF
  • Andy Schmitz: andy.schmitz [at] gmail.com - at school most of the day, but can help in the evening.
  • Gordon E. Amond: Gordon [at] amonds.net - I would be proud to help my american neighbors.
  • Geoff Webb: geofflwebb [at] yahoo.com - I have time in the evenings and weekends.

Scraping

  • Mark sets that have been scraped.
  • Mark sets that have been uploaded to the salesforce.com repository with the date/time of the scrape and the date/time of the upload.
    • Uploads MUST conform to PFIF.
    • Source Name MUST be clear, unique explicit and the same across all records from a single source and include the time OF THE SCRAPE (For example: Scrape-gulfcoastnews-bycilibrar-9/5/2005-10am).

Sites that have been scraped

Imported

Being Uploaded to SalesForce

  • Family messages - 20,000 records
    • PeopleFinderTechStructuredDataSets#Family_Messages (more information)
    • Being uploaded --Aschmitz 21:13, 14 Sep 2005 (EDT). Status is here (http://lardbucket.org/projects/katrina/status_sul.txt)
    • Validation complete <darci.hanning @ gmail.com> (xmllint, XMLSpy and one record uploaded successfully) with the following outstanding questions by Dan <chaney @ dcre-labs.com>:
      • Q1: Zipcodes The first unresolved error involves the zipcode field. It demands an integer (which I suspect will change in PFIF 1.2) so for now, is it appropriate to put in 00000 when the zipcode is unavailable (and strip out +4 zip codes for now?)
      • A1: Yes.
      • Q2: Null date fields Null date fields aren't allowed, nor the unsightly "unknown" so, when given that I have no date field for source or entry dates, is the preferred action o not list the tagset at all?
      • A2: Source date should be the current date(?), entry date should either be provided or an old date(?). I'm not sure about this, if Ping could take a look and give an authoritative answer, that would help.
      • Q3: In general, if I do not have data for a field, should I just not print a tagset for it?
      • A3: It's not clear. I would add it with blank data, otherwise SalesForce may choke on it.


Validated

  • None that aren't being/haven't been uploaded

Invalid sets

  • None

Being validated

  • None

Need to be validated

  • None

Sites that are currently being scraped

Sites that need to be scraped

Sites that can't be scraped

Scraping volunteers

Please sign up on the Katrina Scrapers mailinglist: mailto:katrinascrapers-subscribe@civicspacelabs.org and introduce yourself

Tools


PFIF XML Generators

  • PFIF XML Generation (http://ponderer.org/cvs/index.pl/python/katrina/src/) (Python) - objects that can easily be serialized into PFIF XML.
  • PFIF XML Generation (http://katrina.internet2.edu/~cilibrar/pfifmake.rb) (Ruby) - This is based on single function call with array of Person objects. Based on Josh's script sent to list
  • Perl XML::PFIF module (http://erislabs.net/ianb/projects/pfif/) (Perl) - problems to <ianb [at] erislabs.net>.
  • PFIF XML exporter (http://www.hurricanerefugee.com/pfif_asp_code/) (ASP) - sample code for generating PFIF from SQL Server <egvandell at hotmail.com>

Scrapers

  • ICRC scraper (http://www.billglover.com/software/katrina/scrape_ICRC) (Perl) - This is deprecated, Brent has a new Java version with fixes for some problems.
  • CNN scraping code (http://www.summertime-software.com/CNNScrape.090705.0526.zip) (Python)
  • Gulf Coast Scraper (http://homepages.cwi.nl/~cilibrar/projects/a/gulfcoast/process.rb) (Ruby)
  • OO PHP Scraping Tool hacked together by Jonathan Lambert (PHP) The main scraper script (http://workhabit.com/framework/scraper.phps), which in this case was used to hack http://www.publicpeoplelocator.com and the http class that does the work (http://workhabit.com/framework/class_http.phps). This should be really to adjust to scrape virtually any sites. Automatically rips tables to arrays, generates header and footer, cleans up, etc... in a couple of lines of code. Does not appear to support notes.

Misc

  • PFIF Uploader (http://lardbucket.org/projects/katrina/split_upload.phps) (PHP) Splits a large XML file into smaller (30 people) chunks and uploads them. Requires libCurl for PHP and write access to the current directory. Edit the second line to refer to your PFIF file. Andy Schmitz <andy.schmitz at gmail.com>
  • FindAPlace Application (Drupal)
  • Missing image
    Http://civicspacelabs.org/home/files/images/FindAPlace.jpeg
    Image:

Help us stay online!