Wednesday, November 20, 2013

Web Scraping in Ruby with Mechanize and Nokogiri

I recently received a code challenge during an interview process. Though I didn't get the job, I realized that though I've managed to build a rails application, I've not really solved a lot of problems with Ruby. This sent me searching for other problems, and while I found some nifty code challenges and general resources, the challenges weren't being maintained and the resources all seemed bland. It was at this point I realized that my friends still back in college are immersed in a world full of quality challenges.

I asked one of my friends, a young man in Louisiana, if he would be willing to share the prompt for his semester final project with me. While the class wasn't a Ruby or Rails class, it was a Database class and it was couched thoroughly in Java, JDBC and MySQL. This made it rather trivial to reimagine the project as a rails application and get to work. The prompt was essentially to build a travel agency flight booking tool. To meet the prompt's requirements on the scale I felt fitting I found myself needing a list of airlines, their destination airports and a list of airports.

Sure this data could be stubbed out, but the whole goal of this adventure was to solve problems in Ruby, so why not learn how to scrape web pages?

The Tools
 - Nokogiri (Mechanize pre-req) http://nokogiri.org/

Desired Data
  - Airport's name, city, country and IATA code.
  - Airline names and destination airport IATA codes.

The Targets
 - Airports
 - Airlines

Wikipedia of course contains a more comprehensive list of airports then flighradar24. However, as we'll discuss in this blog, the highly structured form of flightradar24 makes it MUCH easier to scrape compared to wikipedia.

Part One - Parsing FlightRadar24
This being the first time I've ever scraped a web page I found myself looking for data formats that I could easily understand. A quick look at the source for http://www.flightradar24.com/data/airports/ presents a very well defined unordered list being styled into the pretty web page.
1:            <ul id="countriesList">  
2:  <li>  
3:       <a title="Airports in Albania" href="/data/airports/Albania">  
4:            <div class="left"><img class="lazy" src="http://www.flightradar24.com/data/images/destination/flags-big/albania.gif" alt="Albania flag" /></div>  
5:            <div class="right">Albania airports</div>  
6:       </a>  
7:  </li>  
8:  <li>  
9:       <a title="Airports in Algeria" href="/data/airports/Algeria">  
10:            <div class="left"><img class="lazy" src="http://www.flightradar24.com/data/images/destination/flags-big/algeria.gif" alt="Algeria flag" /></div>  
11:            <div class="right">Algeria airports</div>  
12:       </a>  
13:  </li>  

Within the unordered list are individual list elements containing the link and some image and formatting information. Mechanize is not a full featured HTML parser like Nokogiri, but it does have built in functionality to find links, click links, and fill out forms. These helpers use Nokogiri to locate elements within the dom and interact with them. Unfortunately a lot of the Nokogiri interaction is abstracted away from the user in these methods and there is a loss of functionality as a result, though I am not skilled enough to speak as to how much.

In order to parse this page, I relied almost exclusively on Mechanize's links.find_all method. Unfortunately the source code and documentation that I've been able to find has absolutely no mention of this method.

1:  agent = Mechanize.new  
2:  page = agent.get('http://www.flightradar24.com/data/airports/')  
3:  country_links = page.links.find_all { |l| l.attributes.parent.name == 'li'}  

For each link found on the page, the parent attribute's name is checked against the string. In this case if the parent element is a list element the link will be saved. We know from above that each link we want is wrapped by a list element, but we haven't proven that there are not other links on the page matching this pattern. Instead the assumption is made that this pattern is accurate enough and we proceed to the next step; clicking the links contained within country_links.

1:  country_links.each do |c|  
2:       if country_links.index(c) > 19  
3:            country = c.click  
4:            airport_links = country.links.find_all { |d| d.attributes.parent.name == 'div'}  

In the above snippet at line 2 you can see the result of the "accurate enough" assumption made earlier. I encountered an error where I was not being routed to the country pages as expected. By outputting the link and index of the array I was able to simply add a branch to ignore the first 20 indexes. While adding an if branch is not a terribly inefficient thing, ascertaining what condition to set is not always as easy.

The link is then clicked and the resulting page is passed to the variable country. Notice that Mechanize is smart enough to correctly handle the partial link being passed to it. Country is then parsed for links and the same pattern is executed against it except looking for links with a div element as an immediate parent this time. The resulting links are then added to the airport_links array and we move on to the next step.

1:  airport_links.each do |a|  
2:                 if airport_links.index(a).even?  
3:                      airport = a.click  
4:                      doc = airport.parser  

At line two we once again see the consequences of this method of data acquisition. I repeated the same troubleshooting method and realized that every even index had valid data. As a result we have another almost random if branch. With valid data in hand, the link is clicked, however this time we aren't looking for links; the page is instead passed to the doc variable after being run through Mechanize's .parser method. The doc variable may now be treated as a Nokogiri object, the full spectrum of Nokogiri methods, from .css to xpath may be run against it.

Airport Source
1:  <div class="lightNoise" style="padding: 6px 12px 12px; width: 225px;">  
2:                 <h3 class="noborder">Information</h3>  
3:                 <ul>  
4:                      <li><strong>Airport name: </strong>Albuquerque International Sunport</li>  
5:                      <li><strong>IATA code: </strong>ABQ</li>  
6:                      <li><strong>ICAO code: </strong>KABQ</li>  
7:                      <li><strong>Country: </strong>United States</li>  
8:                      <li><strong>State: </strong>NM</li>                    <li><strong>City: </strong>Albuquerque</li>  
9:                      <li><strong>Time: </strong>17:44 (UTC: 00:44)</li>  
10:                      <li><strong>Latitude: </strong>35.040218</li>  
11:                      <li><strong>Longitude: </strong>-106.609001</li>  
12:                      <li><strong>Altitude: </strong>5355 feet</li>                    <li><strong>Homepage: </strong><a target="_blank" title="Visit homepage" href="">Visit homepage</a></li>  
13:                      <li><strong>Airport info: </strong><a target="_blank" title="Visit Flightstats" href="http://www.flightstats.com/go/Airport/airportDetails.do?airportCode=ABQ">Visit Flightstats</a></li>  
14:                      <li><strong>More info: </strong><a target="_blank" title="Visit Great Circle Mapper" href="http://gc.kls2.com/airport/ABQ">Visit Great Circle Mapper</a></li>  
15:                 </ul>  
16:            </div>  

Parser
1: doc.css('div.lightNoise li').each do |l|  
2:                           if doc.css('div.lightNoise li').index(l) == 0  
3:                                temp_storage[:name] = l.text.slice(14, l.text.length)  
4:                                #puts temp_storage[:name]  
5:                           end  
6:                           if doc.css('div.lightNoise li').index(l) == 1  
7:                                temp_storage[:i_code] = l.text.slice(11, l.text.length)  
8:                                #puts temp_storage[:i_code]  
9:                           end  
10:                           if doc.css('div.lightNoise li').index(l) == 3  
11:                                temp_storage[:country] = l.text.slice(9, l.text.length)  
12:                                #puts temp_storage[:country]  
13:                           end  
14:                           if doc.css('div.lightNoise li').index(l) == 4  
15:                                temp_storage[:city] = l.text.slice(6, l.text.length)  
16:                                #puts temp_storage[:city]  
17:                           end  
18:                      end  
19:                      puts temp_storage[:name]  
20:                      storage.push(temp_storage.clone)  

Using Nokogiri's .css selectors parsing an airport page is relatively trivial. In line one of Airport Source you can see there is a div with the class 'lightnoise'. Contained within this div at lines 4, 5,7 and 8 is Airport name, IATA code, Country and City.

In line one of the parser you can see that the selector follows the convention of delineating between classes and ID's with a period for classes, a hash mark is used for IDs. The parser also accepts child elements delineated by a space. You can see this at work in line one as well where I built an array of all list elements within div.lightnoise.  This array was then iterated through and since the list elements are ordered identically from page to page it was possible to extract data based on index location. The first part of the returned text data is always the same as well, so the excess was removed using ruby's string.slice method. This returned only the dynamic data; in this case the name "Albuquerque International Sunport", IATA code "ABQ", city "Albuquerque", and country "United States".

While the target data was successfully retrieved, the methods used to retrieve target links in this section are undeniably inefficient and error prone. Requiring two different types of branches to ensure the right interactions isn't necessarily terrible, but on a larger project this would have been much more unwieldy.

Source
Scraper: https://github.com/islador/mechanize_scrapers/blob/master/v2.rb
Output: https://github.com/islador/mechanize_scrapers/blob/master/seed/airports.yml


Part Two - Parsing wikipedia

Taking the experience I garnered parsing FlightRadar24 I began work on gathering the Airline data I would need. The best source I found for the data was Wikipedia, specifically http://en.wikipedia.org/wiki/Airline_codes-All . Fortunately the start point for this scrape is a highly regimented table.

As before we start with instantiating a Mechanize instance and visiting a webpage.

1:  agent = Mechanize.new  
2:  begin  
3:       page = agent.get('http://en.wikipedia.org/wiki/Airline_codes-All')  
4:  rescue Mechanize::ResponseCodeError => e  
5:       puts e.to_s  
6:  end  

You'll notice that this time the process is wrapped in some basic error handling. Mechanize has a few error types you can find more detail on by checking the source. This specific rescue block catches 404 and other HTTP response codes.

From there, the page is parsed and sent to a helper function.

1:  airline_code_parser = page.parser  
2:  storage = []  
3:  destination_airports = []  
4:  #find all non-red airline links in the table  
5:  airlines = extract_column_airlines(airline_code_parser, "Airline")  

1:  def extract_column_airlines(page, column_name)  
2:       airport_links = []  
3:       table_width = 0  
4:       airport_index = 0  
5:       #parse the table head and return the index of the "Airport" column as well as the total column count.  
6:       page.css('table.toccolours.sortable th').each do |c|  
7:            if c.text.eql?(column_name)  
8:                 airport_index = page.css('table.toccolours.sortable th').index(c)  
9:                 table_width = page.css('table.toccolours.sortable th').length  
10:                 break  
11:            end  
12:       end  
13:       #parse the table rows  
14:       page.css('table.toccolours.sortable tr').each do |tr|  
15:            #for each each row, parse the columns  
16:            tr.css('td').each do |td|  
17:                 #if the column's index modulous the width is equal it to the desired index  
18:                 if tr.css('td').index(td)%table_width == airport_index  
19:                      #retrieve the link within that column.  
20:                      td.css('a').each do |a|  
21:                           if a['href'].to_s.include?("redlink=1") == false  
22:                                airport_links.push(a['href'].clone)  
23:                           end  
24:                      end  
25:                 end  
26:            end  
27:       end  
28:       return airport_links  
29:  end  

The helper function takes two arguments, a Nokogiri object and a string. The method then locates the table with class="tocolours sortable" and iterates through the table header elements. When it finds a header matching the supplied string it saves the index of that header as well as the length of the header array being iterated through.

The method then iterates through each row in the table extracting the links contained within the column matching the index retrieved earlier. You'll notice that "redlink=1" results in the link being discarded. Links with "redlink=1" in the href are known bad links, so by doing this a large portion of the links in the table are automatically discarded as they're known to lead nowhere. Links that are not discarded are pushed onto the array airport_links and the array is returned at the end of the method.

1:  airlines = extract_column_airlines(airline_code_parser, "Airline")  
2:  #iterate through those links  
3:  airlines.each do |d|  
4:       puts d  
5:       #Visit each airline page  
6:       begin  
7:       airline = agent.get('http://en.wikipedia.org' + d)  
8:       rescue Mechanize::ResponseCodeError, StandardError => e  
9:            puts "Error fetching airline pages: " + e.to_s  
10:       end  
11:       #Build the nokogiri object  
12:       airline_parser_object = airline.parser  
13:       #Extract the airline name from the page title by trimming off the wikipedia suffix  
14:       airline_name = airline_parser_object.title.slice(0, airline_parser_object.title.length-35)
15:       #search for a main destinations article
16:       main_dest_page = find_main_destinations(airline_parser_object)

The returned value is then cast to an array that is then iterated through. You'll notice that in this situation we will never have to visit the source page again as we've already retrieved all useful information from it. Each link is then appended to the domain and visited, in line 8 there is also the addition of StandardError to the rescue block. This is done because Net will throw a getaddrinfo error if d begins with 'http://', so rather then sanitize the data, I elected to catch the error. Then a nokogiri object is made, the Airport's name is extracted from the title and the nokogiri object is passed to the helper method below.

1:  def find_main_destinations(page)  
2:       storage = ""  
3:       page.css('div.rellink a').each do |ad|  
4:            if ad['href'].slice(ad['href'].length-13, ad['href'].length).eql?("_destinations")  
5:                 #returns a string  
6:                 storage = storage + ad['href']  
7:                 break  
8:            end  
9:       end  
10:       return storage  
11:  end  

The find_main_destinations method was built in response to the fact that not all airline pages on wikipedia have a list of destination airports. Some have a separate page linked in the destinations section that links to a list or table of airports. Fortunately all of these relevant links are wrapped in divs with a class of 'rellink' so isolating them is fairly straight forward. I then compared the last thirteen characters in the href and if they were equal to "_destinations" the href is appended to storage. In this manner if a proper rellink is not found an empty string is returned by the method.

1:  if main_dest_page.empty?  
2:            puts "Main dest not found."  
3:            h2_index = 0  
4:            #Search for the destinations section.  
5:            airline_parser_object.css('div#mw-content-text h2').each do |c|  
6:                 #puts "Parsing links"  
7:                 if c.css("span.mw-headline").text.eql?"Destinations"  
8:                      h2_index = airline_parser_object.css('div#mw-content-text h2').index(c)  
9:                 end  
10:            end  

There are two common situations, a different page containing the destinations and the same page containing the destinations. In the event that main_dest_page is empty, meaning there is no separate page, it is necessary to check for a destinations section and if found, to check it for destination airports. Line 5 checks for destination by searching for each h2 within the div mw-content-text, which is the primary content div used by wikipedia. Each found h2 is then further explored by checking the child span mw-headline element's content. If it is equal to "Destinations" then the index is saved as h2_index.

1:  if h2_index != 0  
2:                 destinations = extract_links(["h2"], h2_index, airline_parser_object) 

Assuming h2_index isn't still zero, it, along with an array of delimiter tags and the current nokogiri object are passed to the extract_links method. Be aware that in the unlikely event that an h2_index of 0 is actually the real index of the destination h2, the destination section will not be processed.

1:  def extract_links(delim_tags, index, page)  
2:       extract_ranges = [index...index+1]  
3:       doc = page  
4:       extracted_links = []  
5:       i = 0  
6:       # Change /"html"/"body" to the correct path of the tag which contains this list  
7:       (doc/"html"/"body"/"div").children.each do |el|  
8:        if (delim_tags.include? el.name)  
9:         i += 1  
10:        else  
11:         extract = false  
12:         extract_ranges.each do |cur_range|  
13:          if (cur_range.include? i)  
14:           extract = true  
15:           break  
16:          end  
17:         end  
18:         if extract  
19:              el.children.each do |d|  
20:                   d.css("a").each do |k|  
21:                        #destination = agent.get('en.wikipedia.org' + k['href'])  
22:                        extracted_links.push(k['href'].clone)  
23:                   end  
24:              end  
25:         end  
26:        end  
27:       end  
28:       return extracted_links  
29:  end  

The extract_links method is a modified version of Dan Healy's code at Stack Overflow.  This code pulls each link contained between the passed in index and the next index of the delimiter tag. This is accomplished by first navigating to the location of the h2's, and iterating through the child elements. Each child is then checked if it matches the delimiter, if it does, i is incremented. If it does not, i remains the same and extract_ranges are iterated through. Should the extract_ranges include i, then extract is set to true and the extract_ranges.each loop is broken. If extract is set to true, then for each child,  search for a link and push that link to the extracted_links array. Repeat until i is no longer found within the extract_range. The method then returns the extracted_links array.

This means that all links between two h2's are returned. This methodology is not very precise, but it ensures that all links within the Destinations section of a page will be checked.

1:  destinations.each do |d|  
2:                      begin  
3:                           #puts "Querying page: " + d  
4:                           page = agent.get('http://en.wikipedia.org' + d)  
5:                           nokogiri_page = page.parser  
6:                           #store that IATA code  
7:                           destination_airports.push(extract_iata_code(nokogiri_page))  
8:                      rescue Mechanize::ResponseCodeError, StandardError => e  
9:                           puts "Error fetching airline destination IATA codes: " + e.to_s  
10:                      end  
11:                 end  
12:            end  

The returned links are then iterated through. Each link is visited and if the page doesn't throw an error, the page is parsed into a nokogiri object and passed to the extract_iata_code method.

1:  def extract_iata_code(page)  
2:       page.css('th a').each do |d|  
3:            if d['href'].eql?("/wiki/International_Air_Transport_Association_airport_code")  
4:                 return d.next_element.text  
5:                 break  
6:            end  
7:       end  
8:  end  

The extract_iata_code method searches the page for table headers with links. For each one found, it checks if the href matches the wikipedia page of IATA. If that link is found, then the next element's text is returned. You can see the table being parsed on the right of this page. Each returned value is pushed onto the destination_airports array and this branch ends.

On the next branch, airport pages with main destination pages are handled.

1:  else  
2:            puts "Main dest found."  
3:            begin  
4:                 airline_destination = agent.get('http://en.wikipedia.org' + main_dest_page)  
5:                 airline_destination_parser = airline_destination.parser  
6:                 airport_links = extract_column_airports(airline_destination_parser, "Airport")  

This branch is called when the main_dest_page variable is not empty. The page contained in the variable is visited and a nokogiri object is made. That nokogiri object is then passed to the extract_column_airports method along with the name of the column to be parsed for links, in this case "Airport".

1:  def extract_column_airports(page, column_name)  
2:       airport_links = []  
3:       table_width = 0  
4:       airport_index = 0  
5:       #parse the table head and return the index of the "Airport" column as well as the total column count.  
6:       page.css('table.wikitable.sortable th').each do |c|  
7:            if c.text.eql?(column_name)  
8:                 airport_index = page.css('table.wikitable.sortable th').index(c)  
9:                 table_width = page.css('table.wikitable.sortable th').length  
10:                 break  
11:            end  
12:       end  
13:       #parse the table rows  
14:       page.css('table.wikitable.sortable tr').each do |tr|  
15:            #for each each row, parse the columns  
16:            tr.css('td').each do |td|  
17:                 #if the column's index modulous the width is equal it to the desired index  
18:                 if tr.css('td').index(td)%table_width == airport_index  
19:                      #retrieve the link within that column.  
20:                      td.css('a').each do |a|  
21:                           airport_links.push(a['href'].clone)  
22:                      end  
23:                 end  
24:            end  
25:       end  
26:       return airport_links  
27:  end  

The extract_column_airports method is functionally identical to the extract_column_airlines method detailed above. The one major difference is that it targets a different class of table element, the "wikitable" instead of the "toccolours" table.

1:  airport_links.each do |d|  
2:                      begin  
3:                           page = agent.get('http://en.wikipedia.org' + d)  
4:                           nokogiri_page = page.parser  
5:                           #store that IATA code  
6:                           destination_airports.push(extract_iata_code(nokogiri_page))  
7:                      rescue Mechanize::ResponseCodeError, StandardError => e  
8:                           puts "Error fetching airport IATA codes from main destination article: " + e.to_s  
9:                      end  
10:                 end  
11:            rescue Mechanize::ResponseCodeError, StandardError => e  
12:                 puts "Error fetching main airline destination article: " + e.to_s  
13:            end  
14:       end  

Each returned airport_link is then iterated through and the IATA code is extracted in the same manner as above. IATA codes are then pushed onto the destination_airports array and the destination handling if branch ends.

1:  storage.push({name: airline_name.clone, destinations: destination_airports.clone})  
2:       #then reset the destination airports array to empty.  
3:       destination_airports = []  
4:       if airlines.index(d)%50 == 0  
5:            puts "Writing airlines#{airlines.index(d)}"  
6:            FileUtils.mkdir_p "./seed/"  
7:            File.open("./seed/airlines#{airlines.index(d)}.yml",'w') do |out|  
8:            YAML.dump(storage, out)  
9:            end  
10:       end  
11:  end  
12:  puts "Converting final array to yaml."  
13:  FileUtils.mkdir_p "./seed/"  
14:  File.open("./seed/airlines.yml",'w') do |out|  
15:       YAML.dump(storage, out)  
16:  end  

Within the airlines loop, after the destination_airports array has been built for the current airline, a hash is built consisting of the name of the airline and the destinations the airline flies to. This hash is then pushed onto the storage array, so the storage array consists of hashes made up of a string and an array containing strings. The destination_airports array is then reset to empty. This prevents writing IATA codes from all airlines parsed to each airline's destination.

At every fifty airlines scraped, the storage array is output to a YAML file. This file is dynamically renamed, so there is heavy data duplication. However this allows the process to error out or otherwise fail and for the user to easily restart it from a recent point in the airlines array by adding a simple if block. Considering that this scraper will have to visit in excess of 15,000 web pages, I consider this a worthwhile use.

Lastly, once the scraper finishes and the airlines loop exits, a master airlines.yml is saved to the drive.

Sources
Scraper: https://github.com/islador/mechanize_scrapers/blob/master/airline_scraper_v2.rb
Output: https://github.com/islador/mechanize_scrapers/blob/master/seed/airlines.yml

No comments:

Post a Comment