Intro to Machine Learning in Ruby
This blog post was originally published on the Siyelo blog in December 2011.
Machine learning is branch of Artificial Intelligence (AI) concerned with design and development of algorithms that allow computers to learn. It’s a very broad subject so we will just focus on a simple example that uses statistical classification.
Let’s build…
In this tutorial we are going to build a simple news classification application that will parse and classify RSS/HTML articles from the Times Live newspaper.
For the job, we will use nokogiri gem and 2 ruby standard libraries: open-uri and rss/2.0.
RSS Parser
To find sources of articles for processing, we could build a complex search engine or we can simply use the RSS feeds the newspaper provides us with to look for and discover links. RssParser class does exactly that, you initialize it with a feed url and it gives you back the links to all the articles discovered in that feed.
class RssParser
attr_accessor :url
def initialize(url)
@url = url
end
def article_urls
RSS::Parser.parse(open(url), false).items.map{ |item| item.link }
end
end
HTML Parser
Having article links, we need to parse page content and extract meaningful parts from these pages. HtmlParser class can be initialized with a page url and DOM selector. In this example we will be using a CSS selector to extract the content from articles - Firebug and jQuery were used to find the selector for the text we are extracting from the article. In this class you will also notice clean_whitespace method which cleans the whitespace characters from the extracted text.
class HtmlParser
attr_accessor :url, :selector
def initialize(url, selector)
@url = url
@selector = selector
end
def content
doc = Nokogiri::HTML(open(url))
html_elements = doc.search(selector)
html_elements.map { |element| clean_whitespace(element.text) }.join(' ')
end
private
def clean_whitespace(text)
text.gsub(/\s{2,}|\t|\n/, ' ').strip
end
end
Statistical Classifier
We introduce a class that is responsible for classification of articles. It is initialized with a hash that consists of categories (keys) and training data (values).
Training data is used to discover potential relationships between articles and categories. This data should be carefully selected in order to give better classification results. It is created by determining the value for each word in the context of all words for that category (see the train_data() method).
In this example we are using content of Wikipedia articles for economy, sport and health as training data for our categories.
When classifying articles we want to compare only meaningful words and ignore other words that do not add any value for a certain category. We (partially) solve this problem using stop words.
Finally, the scores() method creates the scores for each category (per text) that we are testing.
class Classifier
attr_accessor :training_sets, :noise_words
def initialize(data)
@training_sets = {}
filename = File.join(File.dirname(__FILE__), 'stop_words.txt')
@noise_words = File.new(filename).readlines.map(&:chomp)
train_data(data)
end
def scores(text)
words = text.downcase.scan(/[a-z]+/)
scores = {}
training_sets.each_pair do |category, word_weights|
scores[category] = score(word_weights, words)
end
scores
end
def train_data(data)
data.each_pair do |category, text|
words = text.downcase.scan(/[a-z]+/)
word_weights = Hash.new(0)
words.each {|word| word_weights[word] += 1 unless noise_words.index(word)}
ratio = 1.0 / words.length
word_weights.keys.each {|key| word_weights[key] *= ratio}
training_sets[category] = word_weights
end
end
private
def score(word_weights, words)
score = words.inject(0) {|acc, word| acc + word_weights[word]}
1000.0 * score / words.size
end
end
Lets have a go
Here is the script that runs the program:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rss/2.0'
# training data samples
economy = HtmlParser.new('http://en.wikipedia.org/wiki/Economy', '.mw-content-ltr')
sport = HtmlParser.new('http://en.wikipedia.org/wiki/Sport', '.mw-content-ltr')
health = HtmlParser.new('http://en.wikipedia.org/wiki/Health', '.mw-content-ltr')
training_data = {
:economy => economy.content,
:sport => sport.content,
:health => health.content
}
classifier = Classifier.new(training_data)
results = {
:economy => [],
:sport => [],
:health => []
}
rss_parser = RssParser.new('http://avusa.feedsportal.com/c/33051/f/534658/index.rss')
rss_parser.article_urls.each do |article_url|
article = HtmlParser.new(article_url, '#article .area > h3, #article .area > p, #article > h3')
scores = classifier.scores(article.content)
category_name, score = scores.max_by{ |k,v| v }
results[category_name] << article_url
end
p results
Although our statistical classification algorithm is very simple, it can give remarkably good results provided the training data is good. For even better results you can try other classification algorithms like Bayesian probability and Latent Semantic Analysis.
If you are interested in a more indepth example of a news aggregator application - you can check newsagg at Github. It’s a simple Sinatra application with a Redis datastore that I put together. It crawls, classifies and creates ‘clusters’ of articles using statistical algorithms.
If you want to learn more about Machine Learning, checkout Programming Collective Intelligence book - code examples written in Python and Scripting Intelligence - code examples written in Ruby.