Clojure Web Scraper
Mar 1, 2013 09:03 · 478 words · 3 minute read
Okay so another thing that I did this week was go to a clojure meet up. I really enjoyed it. There I met some really cool people like Anthony from WalmartLabs. In the end we created a hacker news scraper from scratch in a couple of hours. So here is a brief tutorial on how we did it. (Note I’m not that great with Clojure yet thought I do think it is rather awesome. So most of this is because Anthony is a rock star.)
Things you are going to need (or I expect to be installed):
- Java 1.5 or greater
- A text editor
- Lein
So first install lein which means going here and downloading the script.
Then move that script to somewhere in your path.
$ export PATH=$PATH:/path/to/script
Next make a new lein project.
$ lein new hnharvest
Then cd into the project.
$ cd hnharvest
So we can add some text to project.clj.
$ vim project.clj
Then make it look like this. If you don’t know how to use vim use any text editor you like.
(defproject hnharvest "0.1.0-SNAPSHOT"
:description "FIXME: write description"
:url "http://example.com/FIXME"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.4.0"]
[clj-http/clj-http "0.6.4"]
[enlive "1.1.1"]
[org.clojure/data.json "0.2.1"]])
Next we need to add our modify core.clj.
$ vim src/hnharvest/core.clj
Make it look like this.
(ns hnharvest.core
(:require [net.cgrand.enlive-html :as html]
[clojure.string :as s]))
(defn fetch-url []
(html/html-resource
(java.net.URL. "http://news.ycombinator.com")))
(defonce content (fetch-url))
(defn parse-int [s]
(try
(Integer/parseInt s)
(catch NumberFormatException e
0)))
(defn hn-str->number [s]
(-> (s/split s #"\s")
first
parse-int))
(defn get-info
[{[{[points-string] :content}
_
_
_
{[comments-string] :content}] :content}]
(when (and points-string comments-string)
{:points (hn-str->number points-string)
:comments (hn-str->number comments-string)}))
(defn go []
(remove nil? (map get-info (html/select content [:td.subtext]))))
Finally we need to open a repl session.
$ lein repl
Then we need to import our source files and change the name space and start the function.
user=> (use 'hnharvest.core)
user=> (ns hnharvest.core)
hnharvest.core=> (go)
This should spit out something like this:
({:points 283, :comments 105} {:points 99, :comments 29} {:points 345, :comments 123} {:points 51, :comments 23} {:points 398, :comments 252} {:points 142, :comments 97} {:points 219, :comments 63} {:points 18, :comments 4} {:points 228, :comments 78} {:points 264, :comments 153} {:points 125, :comments 19} {:points 67, :comments 37} {:points 202, :comments 31} {:points 34, :comments 19} {:points 43, :comments 13} {:points 36, :comments 6} {:points 136, :comments 44} {:points 87, :comments 17} {:points 44, :comments 3} {:points 236, :comments 100} {:points 35, :comments 21} {:points 26, :comments 4} {:points 68, :comments 17} {:points 41, :comments 8} {:points 193, :comments 69} {:points 67, :comments 20} {:points 55, :comments 16} {:points 608, :comments 181} {:points 63, :comments 16})
This is just a current listing of points and comments from http://news.ycombinator.com/
I thought it was really cool how quickly we got something up and running with Clojure. I hope you did too.