Thursday 30 May 2013

Get a list of files that exist on a website via curl and strip out HTML code

The following can be used to display a list of .csv.gz files that exist on a website and strips out all HTML code:-
 curl --silent http://www.theurl.com/thefiles/ | egrep -o "<a href=[^>]*>*.csv.gz"
 | sed 's/<a href=\"\([^"]*\).*/\1/g'  

The --silent flag in curl supresses the progress information and any error messages