Extract embedded metadata from HTML markup

Overview

extruct

Build Status Coverage report PyPI Version

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, base_url=base_url) with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
                                      'content': 'What is Open Graph Protocol '
                                                 'and why you need it? Learn to '
                                                 'implement Open Graph Protocol '
                                                 'for Facebook on your website. '
                                                 'Open Graph Protocol Meta Tags.',
                                      'name': 'description'}],
                      'namespaces': {},
                      'terms': []}],

'json-ld': [ { '@context': 'https://schema.org',
                 '@id': '#organization',
                 '@type': 'Organization',
                 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
                 'name': 'Optimize Smart',
                 'sameAs': [ 'https://www.facebook.com/optimizesmart/',
                             'https://uk.linkedin.com/in/analyticsnerd',
                             'https://www.youtube.com/user/optimizesmart',
                             'https://twitter.com/analyticsnerd'],
                 'url': 'https://www.optimizesmart.com/'}],
  'microdata': [ { 'properties': {'headline': ''},
                   'type': 'http://schema.org/WPHeader'}],
  'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
                                                     'name': [ 'Open Graph '
                                                               'Protocol for '
                                                               'Facebook '
                                                               'explained with '
                                                               'examples\n'
                                                               '\n'
                                                               'Specialized '
                                                               'Tracking\n'
                                                               '\n'
                                                               '\n'
                                                               (...)
                                                               'Follow '
                                                               '@analyticsnerd\n'
                                                               '!function(d,s,id){var '
                                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                               "'script', "
                                                               "'twitter-wjs');"]},
                                     'type': ['h-entry']}],
                     'properties': { 'name': [ 'Open Graph Protocol for '
                                               'Facebook explained with '
                                               'examples\n'
                                               (...)
                                               'Follow @analyticsnerd\n'
                                               '!function(d,s,id){var '
                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                               "'script', 'twitter-wjs');"]},
                     'type': ['h-feed']}],
  'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
                   'properties': [ ('og:locale', 'en_US'),
                                   ('og:type', 'article'),
                                   ( 'og:title',
                                     'Open Graph Protocol for Facebook '
                                     'explained with examples'),
                                   ( 'og:description',
                                     'What is Open Graph Protocol and why you '
                                     'need it? Learn to implement Open Graph '
                                     'Protocol for Facebook on your website. '
                                     'Open Graph Protocol Meta Tags.'),
                                   ( 'og:url',
                                     'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
                                   ('og:site_name', 'Optimize Smart'),
                                   ( 'og:updated_time',
                                     '2018-03-09T16:26:35+00:00'),
                                   ( 'og:image',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
                                   ( 'og:image:secure_url',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
  'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
              'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
            { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
              'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
              'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
              'article:section': [{'@value': 'Specialized Tracking'}],
              'http://ogp.me/ns#description': [ { '@value': 'What is Open '
                                                            'Graph Protocol '
                                                            'and why you need '
                                                            'it? Learn to '
                                                            'implement Open '
                                                            'Graph Protocol '
                                                            'for Facebook on '
                                                            'your website. '
                                                            'Open Graph '
                                                            'Protocol Meta '
                                                            'Tags.'}],
              'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
              'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
                                                      'Facebook explained with '
                                                      'examples'}],
              'http://ogp.me/ns#type': [{'@value': 'article'}],
              'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
              'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes

It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                  'fb': 'http://www.facebook.com/2008/fbml',
                                  'og': 'http://ogp.me/ns#'},
                   'properties': [ ('fb:app_id', '308540029359'),
                                   ('og:site_name', 'Songkick'),
                                   ('og:type', 'songkick-concerts:artist'),
                                   ('og:title', 'Elysian Fields'),
                                   ( 'og:description',
                                     'Find out when Elysian Fields is next '
                                     'playing live near you. List of all '
                                     'Elysian Fields tour dates and concerts.'),
                                   ( 'og:url',
                                     'https://www.songkick.com/artists/236156-elysian-fields'),
                                   ( 'og:image',
                                     'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Uniform

Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure:

{'@context': 'http://example.com',
             '@type': 'example_type',
             /* All other the properties in keys here */
             }

To do so set uniform=True when calling extract, it's false by default for backward compatibility. Here the same example as before but with uniform set to True:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                               'fb': 'http://www.facebook.com/2008/fbml',
                               'og': 'http://ogp.me/ns#'},
                 '@type': 'songkick-concerts:artist',
                 'fb:app_id': '308540029359',
                 'og:description': 'Find out when Elysian Fields is next '
                                   'playing live near you. List of all '
                                   'Elysian Fields tour dates and concerts.',
                 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
                 'og:site_name': 'Songkick',
                 'og:title': 'Elysian Fields',
                 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet

Returning HTML node

It is also possible to get references to HTML node for every extracted metadata item. The feature is supported only by microdata syntax.

To use that, just set the return_html_node option of extract method to True. As the result, an additional key "nodeHtml" will be included in the result for every item. Each node is of lxml.etree.Element type:

>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
                   'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
                                                  'Not your thin sticky pad, '
                                                  'No-Muv is truly the best!',
                                   'image': ['', ''],
                                   'name': ['No-Muv', 'No-Muv'],
                                   'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': 'Price:  '
                                                                          '$45'},
                                                 'type': 'http://schema.org/Offer'},
                                               { 'htmlNode': <Element div at 0x7f10f8e60f48>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': '(Select '
                                                                          'Size/Shape '
                                                                          'for '
                                                                          'Pricing)'},
                                                 'type': 'http://schema.org/Offer'}],
                                   'ratingValue': ['5.00', '5.00']},
                   'type': 'http://schema.org/Product'}]}

Single extractors

You can also use each extractor individually. See below.

Microdata extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <meta property="og:type" content="article"/>
...   <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
...   <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
...   <meta property="fb:admins" content="himanshu160"/>
...   <meta property="og:site_name" content="Event Education"/>
...   <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
...  </head>
...  <body>
...   <div id="fb-root"></div>
...   <script>(function(d, s, id) {
...               var js, fjs = d.getElementsByTagName(s)[0];
...               if (d.getElementById(id)) return;
...                  js = d.createElement(s); js.id = id;
...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
...                  fjs.parentNode.insertBefore(js, fjs);
...                  }(document, 'script', 'facebook-jssdk'));</script>
...  </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
      "og": "http://ogp.me/ns#"
  },
  "properties": [
      [
          "og:title",
          "Himanshu's Open Graph Protocol"
      ],
      [
          "og:type",
          "article"
      ],
      [
          "og:url",
          "https://www.eventeducation.com/test.php"
      ],
      [
          "og:image",
          "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
      ],
      [
          "og:site_name",
          "Event Education"
      ],
      [
          "og:description",
          "Event Education provides free courses on event planning and management to event professionals worldwide."
      ]
    ]
 }]

Microformat extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <article class="h-entry">
...    <h1 class="p-name">Microformats are amazing</h1>
...    <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
...       on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
...    <p class="p-summary">In which I extoll the virtues of using microformats.</p>
...    <div class="e-content">
...     <p>Blah blah blah</p>
...    </div>
...   </article>
...  </head>
...  <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
      "h-entry"
  ],
  "properties": {
      "name": [
          "Microformats are amazing"
      ],
      "author": [
          {
              "type": [
                  "h-card"
              ],
              "properties": {
                  "name": [
                      "W. Developer"
                  ],
                  "url": [
                      "http://example.com"
                  ]
              },
              "value": "W. Developer"
          }
      ],
      "published": [
          "2013-06-13 12:00:00"
      ],
      "summary": [
          "In which I extoll the virtues of using microformats."
      ],
      "content": [
          {
              "html": "\n<p>Blah blah blah</p>\n",
              "value": "\nBlah blah blah\n"
          }
      ]
    }
 }]

DublinCore extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML &lt;meta&gt; elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                    'content': 'Expressing Dublin Core\n'
                               'in HTML/XHTML meta and link elements',
                    'lang': 'en',
                    'name': 'DC.title'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                    'content': 'Andy Powell, UKOLN, University of Bath',
                    'name': 'DC.creator'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                    'content': 'http://dublincore.org/documents/dcq-html/',
                    'name': 'DC.identifier',
                    'scheme': 'DCTERMS.URI'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/format',
                    'content': 'text/html',
                    'name': 'DC.format',
                    'scheme': 'DCTERMS.IMT'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/type',
                    'content': 'Text',
                    'name': 'DC.type',
                    'scheme': 'DCTERMS.DCMIType'}],
    'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                    'DCTERMS': 'http://purl.org/dc/terms/'},
    'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                 'content': '2003-11-01',
                 'name': 'DCTERMS.issued',
                 'scheme': 'DCTERMS.W3CDTF'},
               { 'URI': 'http://purl.org/dc/terms/abstract',
                 'content': 'This document describes how\n'
                            'qualified Dublin Core metadata can be encoded\n'
                            'in HTML/XHTML <meta> elements',
                 'name': 'DCTERMS.abstract'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DC.Date.modified'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DCTERMS.modified'},
               { 'URI': 'http://purl.org/dc/terms/replaces',
                 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                 'hreflang': 'en',
                 'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install extruct[cli]

Usage

extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph and Microformat metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph and Microformat). If you want to restrict the output to just one or a subset of those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from "http://example.com":

extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
Demonstration on how to use async python to control multiple playwright browsers for web-scraping

Playwright Browser Pool This example illustrates how it's possible to use a pool of browsers to retrieve page urls in a single asynchronous process. i

Bernardas Ališauskas 8 Oct 27, 2022
This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

Devansh Singh 1 Feb 10, 2022
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Joseph Lai 543 Jan 03, 2023
This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021
Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

SpaceX Sofware I developed software to scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info to use the software you need Python a

Maxence Rémy 16 Aug 02, 2022
A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

Davis David 2 Oct 31, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
script to scrape direct download links (ddls) from google drive index.

bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

sαɴᴊɪᴛ sɪɴʜα 53 Dec 16, 2022
A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

Saurabh G. 2 Dec 14, 2022
A social networking service scraper in Python

snscrape snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the disco

2.4k Jan 01, 2023
Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
A web crawler for recording posts in "sina weibo"

Web Crawler for "sina weibo" A web crawler for recording posts in "sina weibo" Introduction This script helps collect attributes of posts in "sina wei

4 Aug 20, 2022
自动完成每日体温上报(Github Actions)

体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

Teng Zhang 23 Sep 15, 2022
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Eric DE MARIA 1 Nov 30, 2021
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

28 Dec 08, 2022
Fundamentus scrapy

Fundamentus_scrapy Baixa informacões que os outros scrapys do fundamentus não realizam. Para iniciar (python main.py), sera criado um arquivo chamado

Guilherme Silva Uchoa 1 Oct 24, 2021
Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper GDev Profile Badge Scraper is a Google Developer Profile Web Scraper which scrapes for specific badges in a use

Siddhant Lad 7 Jan 10, 2022