Ruby Functions - crawl_and_extract_urls

Ali Hamdi Ali Fadel · @AliOsm · about 4 years

Crawl a web page and extract URLs from its HTML using naive loop implementation. Its running time is comparable or less than using URI.extract().


require 'uri'

require 'net/http'


def crawl_and_extract_urls(page_url)
  uri = URI.parse(page_url)
  response = Net::HTTP.get_response(uri)
  body = response.body.force_encoding('UTF-8')

  urls = []

  str_index = body.index(/http|www/)
  while not str_index.nil? do
    end_index_1 = body.index('"', str_index + 1)
    end_index_2 = body.index("'", str_index + 1)

    urls << body[str_index, [end_index_1, end_index_2].min - str_index]

    str_index = body.index(/http|www/, str_index + 1)
  end

  urls

end

2 · 0 · 1