View Single Post
  #1  
Old 04-28-2008, 02:56 PM
naraht naraht is offline
GreekChat Member
 
Join Date: Apr 2001
Location: Rockville,MD,USA
Posts: 3,564
Automation of download of chapter list page

I've written a script that will download the information on the
http://www.apo.org/show/How_to_Start...r/Chapter_List page (after you click on go) and boil it down to a vertical bar delimited file suitable for importing into spreadsheets and databases. This should work on any Unix/Linux/Mac machine with nc (netcat) on it. It might be called nc or netcat on different machines.

#!/bin/bash
(nc www.apo.org 80 < nc.apo.in) | grep "<table>"| sed -e 's#</tr><tr>#</tr>+<tr>#g'| tr "+" "\n"| grep -v colspan | grep -v "td width="| sed -e 's/Send Email//g'|sed -e 's#</b><br>#|#g' | sed -e 's#<br>Region:#|Region:#g' |sed -e 's#<i>#|#g'| perl -pe 's/<[^>]*>//g' > apo`date '+%y%m%d'`
cut -f 4 -d \| apo`date '+%y%m%d'`| sort | uniq -c > apo`date '+%y%m%d'`.count

All of the line breaks except the one before the word 'cut' are simply from wordwrap and should not be in the program.

In addition the file nc.apo.in needs to exist which contains

POST /show/How_to_Start_a_Chapter/Chapter_List HTTP/1.0
Content-Length: 71
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: www.apo.org

bystatus=0&byregion=0&bysection=0&bycity=&bystate= 0&bysort=S&submit=Go
__________________
Because "undergrads, please abandon your national policies and make something up" will end well --KnightShadow

Last edited by naraht; 04-28-2008 at 03:20 PM. Reason: added other file.
Reply With Quote