Web Scraping Framework

Overview

Grab Framework Documentation

pytest status coveralls documentation

Installation

    $ pip install -U grab

See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html

Support

Documentation: https://grablab.org/docs/

Russian telegram chat: https://t.me/grablab_ru

English telegram chat: https://t.me/grablab

To report bug please use GitHub issue tracker: https://github.com/lorien/grab/issues

What is Grab?

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:

  • Automatic cookies (session) support
  • HTTP and SOCKS proxy with/without authorization
  • Keep-Alive support
  • IDN support
  • Tools to work with web forms
  • Easy multipart file uploading
  • Flexible customization of HTTP requests
  • Automatic charset detection
  • Powerful API to extract data from DOM tree of HTML documents with XPATH queries
  • Asynchronous API to make thousands of simultaneous queries. This part of library called Spider. See list of spider fetures below.
  • Python 3 ready

Spider is a framework for writing web-site scrapers. Features:

  • Rules and conventions to organize the request/parse logic in separate blocks of codes
  • Multiple parallel network requests
  • Automatic processing of network errors (failed tasks go back to task queue)
  • You can create network requests and parse responses with Grab API (see above)
  • HTTP proxy support
  • Caching network results in permanent storage
  • Different backends for task queue (in-memory, redis, mongodb)
  • Tools to debug and collect statistics

Grab Example

    import logging

    from grab import Grab

    logging.basicConfig(level=logging.DEBUG)

    g = Grab()

    g.go('https://github.com/login')
    g.doc.set_input('login', '****')
    g.doc.set_input('password', '****')
    g.doc.submit()

    g.doc.save('/tmp/x.html')

    g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

    home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
    repo_url = home_url + '?tab=repositories'

    g.go(repo_url)

    for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):
        print('%s: %s' % (elem.text(),
                          g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

    import logging

    from grab.spider import Spider, Task

    logging.basicConfig(level=logging.DEBUG)


    class ExampleSpider(Spider):
        def task_generator(self):
            for lang in 'python', 'ruby', 'perl':
                url = 'https://www.google.com/search?q=%s' % lang
                yield Task('search', url=url, lang=lang)

        def task_search(self, grab, task):
            print('%s: %s' % (task.lang,
                              grab.doc('//div[@class="s"]//cite').text()))


    bot = ExampleSpider(thread_number=2)
    bot.run()
Comments
  • Spider hangs during work randomly

    Spider hangs during work randomly

    now I have an issue I can't crack myself, last log entries looks this way:

    [12.05.2018] [15:23:07] [DEBUG] [1535-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:08] [DEBUG] RPS: 2.02 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:08] [DEBUG] RPS: 0.51 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:08] [DEBUG] [1536-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:08] [DEBUG] [1537-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:10] [DEBUG] RPS: 0.76 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:10] [DEBUG] RPS: 0.76 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:10] [DEBUG] [1538-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:10] [DEBUG] [1539-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:10] [DEBUG] [1540-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:11] [DEBUG] RPS: 1.17 [error:grab-connection-error=284, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:11] [DEBUG] RPS: 0.59 [error:grab-connection-error=284, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:11] [DEBUG] [1541-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:11] [DEBUG] [1542-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:12] [DEBUG] [1543-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:12] [DEBUG] RPS: 1.73 [error:grab-connection-error=285, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    [12.05.2018] [15:23:12] [DEBUG] [1544-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:12] [DEBUG] [1545-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [12.05.2018] [15:23:14] [DEBUG] RPS: 3.69 [error:grab-connection-error=285, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
    

    or this:

    [13.05.2018] [07:13:53] [DEBUG] RPS: 1.49 [fatal=24]
    [13.05.2018] [07:13:53] [DEBUG] RPS: 0.75 [fatal=24]
    [13.05.2018] [07:13:53] [DEBUG] [363-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:53] [DEBUG] [364-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:53] [DEBUG] [365-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:54] [DEBUG] RPS: 1.29 [fatal=24]
    [13.05.2018] [07:13:54] [DEBUG] RPS: 0.65 [fatal=24]
    [13.05.2018] [07:13:54] [DEBUG] [366-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:54] [DEBUG] [367-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:55] [DEBUG] [368-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:56] [DEBUG] RPS: 1.21 [fatal=24]
    [13.05.2018] [07:13:56] [DEBUG] RPS: 0.61 [fatal=24]
    [13.05.2018] [07:13:56] [DEBUG] [369-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:56] [DEBUG] [370-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:56] [DEBUG] [371-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:58] [DEBUG] RPS: 1.38 [fatal=24]
    [13.05.2018] [07:13:58] [DEBUG] RPS: 0.69 [fatal=24]
    [13.05.2018] [07:13:58] [DEBUG] [372-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:58] [DEBUG] [373-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:58] [DEBUG] [374-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:59] [DEBUG] RPS: 1.77 [fatal=24]
    [13.05.2018] [07:13:59] [DEBUG] RPS: 0.00 [fatal=24]
    [13.05.2018] [07:13:59] [DEBUG] [375-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:59] [DEBUG] [376-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:13:59] [DEBUG] [377-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:14:01] [DEBUG] RPS: 1.21 [fatal=24]
    [13.05.2018] [07:14:01] [DEBUG] RPS: 0.60 [fatal=24]
    [13.05.2018] [07:14:01] [DEBUG] [378-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
    [13.05.2018] [07:14:02] [DEBUG] RPS: 2.20 [fatal=24]
    

    it's final entries I can see in log. no mater how much time to wait (I was waiting for 20 hours to see result), spider just eat 1 CPU core at 100% and do absolutely nothing.

    my bot have 5 stages of work and it can be hanged on any stage.

    it's very difficult to debug this issue because no any errors in log, and it happens randomly

    I am running bot this way:

    bot = ExampleSpider(thread_number=3, network_service='threaded', grab_transport='pycurl')
    bot.load_proxylist("./proxy_tor.txt", "text_file", "socks5")
    bot.run()
    

    any ideas guys? :confused:

    bug 
    opened by EnzoRondo 25
  • Ошибка: pycurl.error: (0, '') при попытке отправки формы

    Ошибка: pycurl.error: (0, '') при попытке отправки формы

    Имеется html с формой:

    <form enctype="multipart/form-data" action="http://example.com/" method="post" accept-charset="UTF-8">
        <textarea name="body">Beställa</textarea>
        <input type="submit" name="op" value="Save"/>
    </form>
    

    Есть который отправляет форму:

    from grab import Grab
    
    g = Grab()
    g.setup(debug=True)
    
    g.go('file:///C:/page.html') # тут вставьте путь до файла что указан выше
    
    g.doc.set_input('op', 'Save')
    g.doc.submit(submit_name='op')
    

    При отправке формы получаем ошибку:

    pycurl.error: (0, '')
    

    Но если заменить внутри textarea код на другой, например вот так сделать в textarea:

    <textarea name="body">123</textarea>
    

    То всё отправится нормально.

    Как это исправить?

    opened by InputError 23
  • Python 3.5 - Unable to build DOM tree.

    Python 3.5 - Unable to build DOM tree.

    File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
      File "src/lxml/parser.pxi", line 1799, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116219)
      File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
      File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
      File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
      File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
      File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
      File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
      File "<string>", line None
    lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
    

    With preceding:

    encoding error : input conversion failed due to input error, bytes 0x21 0x00 0x00 0x00
    encoding error : input conversion failed due to input error, bytes 0x44 0x00 0x00 0x00
    I/O error : encoder error
    

    Example:

    class Scraper(Spider):
        def task_generator(self):
            urls = [
                'https://au.linkedin.com/directory/people-a/',
                'https://www.linkedin.com/directory/people-a/'
            ]
            for url in urls:
                yield Task('url', url=url)
    
        def task_url(self, grab, task):
            links = grab.doc('//div[@class="columns"]//ul/li[@class="content"]/a')
    
    
    bot = Scraper()
    bot.run()
    

    That's happened on some pages, perhaps lxml failed to detect correct encoding.

    bug 
    opened by oiwn 23
  • Пустые cookie

    Пустые cookie

    from grab import Grab
    url = 'https://www.fiverr.com/'
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
    session_f_name1 = 'fiverr_session1.txt'
    session_f_name2 = 'fiverr_session2.txt'
    
    g1 = Grab()
    g1.setup(cookiefile=session_f_name1)
    g1.go(url)
    print 'g1', g1.cookies.cookiejar
    
    g2 = Grab()
    g2.setup(cookiefile=session_f_name2, user_agent=user_agent)
    g2.go(url)
    print 'g2', g2.cookies.cookiejar
    

    в первом случае кука есть, во втором нет

    opened by nodermann 22
  • Cookies issue on windows with pycurl version pycurl 7.43.0.1

    Cookies issue on windows with pycurl version pycurl 7.43.0.1

    Проверял на:

    Microsoft Windows Server 2012 Standard
    Microsoft Windows 7 Ultimate
    

    Версия питона на обоих машинах: Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

    grab=0.6.3.8

    Код:

    from grab import Grab, error
    import sys
    import logging
    import base64
    
    g = Grab()
    g.setup(timeout=60)
    g.setup(debug=True, debug_post=True)
    
    logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
    
    # запрятано в интересах приватности
    url, login, pwd = base64.b64decode("v1WZktzbtVGZ7AHaw5CelRmbp9icvRXYyR3cp5WatRWYvwGcu0Wdw12LvoDc0RHa"[::-1])\
        .decode("utf-8").split(";")
    
    g.go(url)
    g.doc.set_input('username', login)
    g.doc.set_input('passwd', pwd)
    try:
        g.doc.set_input('lang', 'en-GB')
    except:
        pass
    g.doc.submit()
    
    is_logged = g.doc.text_search("task=logout")
    if not is_logged:
        raise error.GrabError("не вошли")
    
    print("all right!")
    

    Ставим pycurl-7.19.5.3, получаем результат: all right!

    Ставим последний pycurl-7.43.0.1, получаем: grab.error.GrabError: не вошли

    Это не особенность сайта, я проверял в других местах и там тоже самое. Сверял руками тело пост запроса с тем что отправляется в браузере, содержимое идентично.

    opened by InputError 20
  • GrabTimeoutError

    GrabTimeoutError

    Hello, I have exception like https://github.com/lorien/grab/issues/140, but I haven`t some DNS Errors. OS and network: OS: Windows 8 (x64) fixed DNS: 8.8.8.8

    Script:

    import pycurl; 
    from grab import Grab
    import logging
    
    print(pycurl.version); 
    print(pycurl.__file__);
    
    logging.basicConfig(level=logging.DEBUG)
    g = Grab(verbose_logging=True, debug=True)
    g.go('http://github.com')
    print g.xpath_text('//title')
    

    script output:

    PycURL/7.43.0 libcurl/7.47.0 OpenSSL/1.0.2f zlib/1.2.8 c-ares/1.10.0 libssh2/1.6.0
    c:\Python27\lib\site-packages\pycurl.pyd
    DEBUG:grab.network:[01] GET http://github.com
    DEBUG:grab.transport.curl:i: Rebuilt URL to: http://github.com/
    DEBUG:grab.transport.curl:i: Resolving timed out after 3000 milliseconds
    DEBUG:grab.transport.curl:i: Closing connection 0
    Traceback (most recent call last):
      File "D:\pr_files\source\python\planned\htmlParser\bgsParser\NewPythonProject\src\bgsParser.py", line 10, in <module>
        g.go('http://github.com')
      File "c:\Python27\lib\site-packages\grab-0.6.30-py2.7.egg\grab\base.py", line 377, in go
        return self.request(url=url, **kwargs)
      File "c:\Python27\lib\site-packages\grab-0.6.30-py2.7.egg\grab\base.py", line 450, in request
        self.transport.request()
      File "c:\Python27\lib\site-packages\grab-0.6.30-py2.7.egg\grab\transport\curl.py", line 489, in request
        raise error.GrabTimeoutError(ex.args[0], ex.args[1])
    grab.error.GrabTimeoutError: [Errno 28] Resolving timed out after 3000 milliseconds
    

    I tried to reinstall curl:

    C:\Python27\Scripts>pip install pycurl-7.43.0-cp27-none-win_amd64.whl --upgrade
    Processing c:\python27\scripts\pycurl-7.43.0-cp27-none-win_amd64.whl
    Installing collected packages: pycurl
      Found existing installation: pycurl 7.43.0
        Uninstalling pycurl-7.43.0:
          Successfully uninstalled pycurl-7.43.0
    Successfully installed pycurl-7.43.0
    

    but It doesn`t work. What can be wrong?

    Thank you for help.

    opened by tofflife 20
  • socks5 ip mismatch

    socks5 ip mismatch

    Hi, I am trying to use socks5 proxy list with grab.spider

    Here is my small test script:

    from grab.spider import Spider, Task
    import logging
    
    
    class TestSpider(Spider):
        def prepare(self):
            self.load_proxylist(
                'proxy.list',
                source_type='text_file', proxy_type='socks5',
                auto_change=True,
                read_timeout=180
            )
            self.set_proxy = set()
            self.real_proxy = set()
    
        def task_generator(self):
            for i in range(200):
                yield Task('2ip', 'http://2ip.ru/')
    
        def task_2ip(self, grab, task):
            ip = grab.doc.select('//big[@id="d_clip_button"]').text()
            self.real_proxy.add(ip)
    
            proxy = grab.config['proxy'].split(':')[0]
            self.set_proxy.add(proxy)
    
            # if proxy != ip:
            #     print proxy, ip
    
        def shutdown(self):
            print len(self.set_proxy), len(self.real_proxy)
    
    
    logging.basicConfig(level=logging.DEBUG)
    TestSpider(thread_number=16).run()
    

    The result is: 197 16

    As we can see, the real number of used proxies is only 16, the same as thread_number.

    I am using grab version from pip. The version of the curl is

    curl 7.38.0 (x86_64-pc-linux-gnu) libcurl/7.38.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
    Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp 
    Features: AsynchDNS IDN IPv6 Largefile GSS-API SPNEGO NTLM NTLM_WB SSL libz TLS-SRP 
    

    I think, the problem is with libcurl. What shall I do to fix it?

    opened by asluchevskiy 16
  • accidental work of grab.spider

    accidental work of grab.spider

    Hey there (@lorien), thanks a lot for great library :smiley:

    I am learning your library and now see unexpected behavior during work, here is my code sample which is based on example in documentation:

    import csv
    import logging
    import re
    
    from grab.spider import Spider, Task
    
    
    class ExampleSpider(Spider):
        def create_grab_instance(self, **kwargs):
            g = super(ExampleSpider, self).create_grab_instance(**kwargs)
            g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
            return g
    
        def task_generator(self):
            for i in range(1, 1 + 1):
                page_url = "{}{}/".format("https://www.mourjan.com/properties/", i)
                # print("page url: {}".format(page_url))
                yield Task('stage_two', url=page_url)
    
        def prepare(self):
            # Prepare the file handler to save results.
            # The method `prepare` is called one time before the
            # spider has started working
            self.result_file = csv.writer(open('result.txt', 'w'))
    
            # This counter will be used to enumerate found images
            # to simplify image file naming
            self.result_counter = 0
    
        def task_stage_two(self, grab, task):
            for elem in grab.doc.select("//li[@itemprop='itemListElement']//p")[0:4]:
                part = elem.attr("onclick")
                url_part = re.search(r"(?<=wo\(\').*(?=\'\))", part).group()
                end_url = grab.make_url_absolute(url_part)
                yield Task('stage_three', url=end_url)
    
        def task_stage_three(self, grab, task):
            # First, save URL and title into dictionary
            post = {
                'url': task.url,
                'title': grab.doc.xpath_text("//title/text()"),
            }
            self.result_file.writerow([
                post['url'],
                post['title'],
            ])
            # Increment image counter
            self.result_counter += 1
    
    
    if __name__ == '__main__':
        logging.basicConfig(level=logging.DEBUG)
        # Let's start spider with two network concurrent streams
        bot = ExampleSpider(thread_number=2)
        bot.run()
    
    

    first run:

    DEBUG:grab.spider.base:Using memory backend for task queue
    DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.stat:RPS: 7.35 [error:multi-added-already=5, network-count-rejected=1]
    DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
    DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
    DEBUG:grab.spider.base:Main process [pid=4064]: work done
    

    :confused:

    then I am running code again ~20 attempts and have same shit, but 21 time gives success and I see what I want to see:

    DEBUG:grab.spider.base:Using memory backend for task queue
    DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.stat:RPS: 0.52 []
    DEBUG:grab.network:[02] GET https://www.mourjan.com/kw/kuwait/warehouses/rental/10854564/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[03] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11047384/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[04] GET https://www.mourjan.com/kw/kuwait/villas-and-houses/rental/11041455/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.network:[05] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11009663/ via 127.0.0.1:8090 proxy of type socks5
    DEBUG:grab.stat:RPS: 2.36 []
    DEBUG:grab.stat:RPS: 1.28 []
    DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
    DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
    DEBUG:grab.spider.base:Main process [pid=4860]: work done
    

    why it happens?

    bug 
    opened by EnzoRondo 15
  • Баг post запросов windows x64

    Баг post запросов windows x64

    с 32 битными системами все проще, один добрый человек собрал курл под него, есть ли у вас возможность сделать тоже самое для x64? Мы тут всей командой мучаемся из за этого бага. Библиотека мощная и труда в нее вложено не мало, но она становится бесполезной при таком баге. Или возможно ли сделать его не на curl, а на сокетах к примеру.

    opened by ArturFis 15
  • Добавление возможности накопления проксей между их обновлениями

    Добавление возможности накопления проксей между их обновлениями

    Раньше, при каждом обновлении прокси-листа, старый лист затирался новым. Добавил возможность держать старые прокси в списке, а при обновлении просто расширять список проксями, которых ещё нет в списке. (т.е. добавлять только свежие соксы, не дублируя старые) Также, ранее, при обновлении проксей, итератор каждый раз создавался заново, что влечёт за собой такую ситацию: если время обновления списка относительно маленькое, то итератор не успевает доходить до конца списка и обновляется и , в итоге, последние прокси в списке так и не юзаются. Поэтому сейчас он создаётся единожды, но вместо itertools.cycle использую свой cycle, потому что в itertools список кешируется, т.е. не получится динамически его обновлять.

    opened by temptask 13
  • Как избавиться от

    Как избавиться от "operation-timeouted"?

    Здравствуйте. Только начал использовать Grab, но все очень нравится.

    Проблема следующая: Парсю больше 3 миллионов инвентарей со стима. Инвентари - это просто json-файлы, бывают большие и маленькие. Маленькие инвентари парсятся Spyder'ом без проблем, но вот инвентари побольше как-то страшно подвешивают поток, а потом выдают ошибку вида:

    DEBUG:grab.stat:RPS: 0.26 [error:operation-timeouted=7]

    Искал везде: тут, на гитхабе, в гуглгруппах, в документации, но ничего не нашел про то, что это значит и как с этим бороться. Пробовал вручную создавать инстансы Grab'а, передавать им connection_timeout и timeout и сувать их в Task'и, но видимого эффекта не получил.

    opened by seniorjoinu 10
Releases(v0.6.40)
  • v0.6.40(May 14, 2018)

  • v0.6.39(May 10, 2018)

    Fixed

    • Fix bug: task generator works incorrectly
    • Fix bug: pypi package misses http api html file
    • Fix bug: dictionary changed size during iteration in stat logging
    • Fix bug: multiple errors in urllib3 transport and threaded network service
    • Fix short names of errors in stat logging
    • Improve error handling in urrllib3 transport
    • Fix #299: multi-added errors
    • Fix bug: pypi package misses http api html file
    • Fix #285: pyquery extension parses html incorrectly
    • Fix #267: normalize handling of too many redirect error
    • Fix #268: fix processing of utf cookies
    • Fix #241: form_fields() fails on some HTML forms
    • Fix normalize_unicode issue in debug post method
    • Fix #323: urllib3 transport fails with UnicodeError on some invalid URLs
    • Fix #31: support for multivalue form inputs
    • Fix #328, fix #67: remove hard link between document and grab
    • Fix #284: option headers affects content of common_headers
    • Fix #293: processing non-latin chars in Location header
    • Fix #324: refactor response header processing

    Changed

    • Refactor Spider into set of async. services
    • Add certifi dependency into grab[full] setup target
    • Fix #315: use psycopg2-binary package for postgres cache
    • Related to #206: do not use connection_reuse=False for proxy connections in spider

    Removed

    • Remove cache timeout option
    • Remove structured extension
    Source code(tar.gz)
    Source code(zip)
  • v0.6.38(May 10, 2018)

    Fixed

    • Fix "error:None" in spider rps logging
    • Fix race condition bug in task generator

    Added

    • Add original_exc attribute to GrabNetworkError (and subclasses) that points to original exception

    Changed

    • Remove IOError from the ancestors of GrabNetworkError
    • Add default values to --spider-transport and --grab-transport options of crawl script
    Source code(tar.gz)
    Source code(zip)
  • v0.6.37(May 10, 2018)

    Added

    • Add --spider-transport and --grab-transport options to crawl script
    • Add SOCKS5 proxy support in urllib3 transport

    Fixed

    • Fix #237: urllib3 transport fails without pycurl installed
    • Fix bug: incorrect spider request logging when cache is enabled
    • Fix bug: crawl script fails while trying to process a lock key
    • Fix bug: urllib3 transport fails while trying to throw GrabConnectionError exception
    • Fix bug: Spider add_task method fails while trying to log invalid URL error

    Removed

    • Remove obsoleted hammer_mode and hammer_timeout config options
    Source code(tar.gz)
    Source code(zip)
  • v0.6.36(May 10, 2018)

    Added

    • Add pylint to default test set

    Fixed

    • Fix #229: using deprecated response object inside Grab

    Removed

    • Remove spider project template and start_project script
    Source code(tar.gz)
    Source code(zip)
Parse feeds in Python

feedparser - Parse Atom and RSS feeds in Python. Copyright 2010-2020 Kurt McKee Kurt McKee 1.5k Dec 30, 2022

An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022
WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

Michelle 2 Jul 22, 2022
This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

schutera 4 Oct 25, 2022
A universal package of scraper scripts for humans

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

299 Dec 15, 2022
京东抢茅台,秒杀成功很多次讨论,天猫抢购,赚钱交流等。

Jd_Seckill 特别声明: 请添加个人微信:19972009719 进群交流讨论 目前群里很多人抢到【扫描微信添加群就好,满200关闭群,有喜欢薅信用卡羊毛的也可以找我交流】 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性

50 Jan 05, 2023
Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

Faeze Ghorbanpour 1 Dec 30, 2021
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
Scrap the 42 Intranet's elearning videos in a single click

42intra_scraper Scrap the 42 Intranet's elearning videos in a single click. Why you would want to use it ? Adjust speed at your convenience. (The intr

Noufel 5 Oct 27, 2022
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
Scrap-mtg-top-8 - A top 8 mtg scraper using python

Scrap-mtg-top-8 - A top 8 mtg scraper using python

1 Jan 24, 2022
Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

윤성도 1 Nov 10, 2021
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
京东云无线宝积分推送,支持查看多设备积分使用情况

JDRouterPush 项目简介 本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息 更新日志 2021-03-02: 查询绑定的京东账户 通知排版优化 脚本检测更新 支持Server酱Turbo版 2021-02-25: 实现多设备查询 查询今

雷疯 199 Dec 12, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Web Scrapping Popular Youtube Tech Channels with Selenium Data Mining, Data Wrangling, and Exploratory Data Analysis About the Data Web scrapi

David Rusho 0 Aug 18, 2021
Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

crawlersuseragents This Python script can be used to check if there is any differences in responses of an application when the request comes from a se

Podalirius 13 Dec 27, 2022
Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

Lexile-Atos-Scraper Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN You will need to install the chrome webdriver if you have n

1 Feb 11, 2022