Crawl a web page and extract URLs from its HTML using naive loop implementation. Its running time is comparable or less than using URI.extract().
Ali Hamdi Ali Fadel · @AliOsm · over 3 years
Crawl a web page and extract URLs from its HTML using naive loop implementation. Its running time is comparable or less than using URI.extract().
require 'uri'require 'net/http'def crawl_and_extract_urls(page_url)uri = URI.parse(page_url)response = Net::HTTP.get_response(uri)body = response.body.force_encoding('UTF-8')urls = []str_index = body.index(/http|www/)while not str_index.nil? doend_index_1 = body.index('"', str_index + 1)end_index_2 = body.index("'", str_index + 1)urls << body[str_index, [end_index_1, end_index_2].min - str_index]str_index = body.index(/http|www/, str_index + 1)endurlsend