Automation of download of chapter list page

naraht · #1 04-28-2008, 02:56 PM

I've written a script that will download the information on the
http://www.apo.org/show/How_to_Start...r/Chapter_List page (after you click on go) and boil it down to a vertical bar delimited file suitable for importing into spreadsheets and databases. This should work on any Unix/Linux/Mac machine with nc (netcat) on it. It might be called nc or netcat on different machines.

#!/bin/bash
(nc www.apo.org 80 < nc.apo.in) | grep "<table>"| sed -e 's#</tr><tr>#</tr>+<tr>#g'| tr "+" "\n"| grep -v colspan | grep -v "td width="| sed -e 's/Send Email//g'|sed -e 's#</b><br>#|#g' | sed -e 's#<br>Region:#|Region:#g' |sed -e 's#<i>#|#g'| perl -pe 's/<[^>]*>//g' > apo`date '+%y%m%d'`
cut -f 4 -d \| apo`date '+%y%m%d'`| sort | uniq -c > apo`date '+%y%m%d'`.count

All of the line breaks except the one before the word 'cut' are simply from wordwrap and should not be in the program.

In addition the file nc.apo.in needs to exist which contains

POST /show/How_to_Start_a_Chapter/Chapter_List HTTP/1.0
Content-Length: 71
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: www.apo.org

bystatus=0&byregion=0&bysection=0&bycity=&bystate= 0&bysort=S&submit=Go