crawl_and_extract_urls

Ali Hamdi Ali Fadel · @AliOsm · about 4 years

Crawl a web page and extract URLs from its HTML using naive loop implementation. Its running time is comparable or less than using URI.extract().

require 'uri'
require 'net/http'
def crawl_and_extract_urls(page_url)
uri = URI.parse(page_url)
response = Net::HTTP.get_response(uri)
body = response.body.force_encoding('UTF-8')
urls = []
str_index = body.index(/http|www/)
while not str_index.nil? do
end_index_1 = body.index('"', str_index + 1)
end_index_2 = body.index("'", str_index + 1)
urls << body[str_index, [end_index_1, end_index_2].min - str_index]
str_index = body.index(/http|www/, str_index + 1)
end
urls
end
2 · 0 · 1