Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Overview

tldextract

Python Module PyPI version Build Status

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

tldextract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List (PSL). So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')

>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
>>> # rejoin subdomain and domain
>>> '.'.join(ext[:2])
'forums.bbc'
>>> # a common alias
>>> ext.registered_domain
'bbc.co.uk'

Note subdomain and suffix are optional. Not all URL-like inputs have a subdomain or a valid suffix.

>>> tldextract.extract('google.com')
ExtractResult(subdomain='', domain='google', suffix='com')

>>> tldextract.extract('google.notavalidsuffix')
ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='')

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='')

If you want to rejoin the whole namedtuple, regardless of whether a subdomain or suffix were found:

>>> ext = tldextract.extract('http://127.0.0.1:8080/deployed/')
>>> # this has unwanted dots
>>> '.'.join(ext)
'.127.0.0.1.'
>>> # join each part only if it's truthy
>>> '.'.join(part for part in ext if part)
'127.0.0.1'

By default, this package supports the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.

This module started by implementing the chosen answer from this StackOverflow question on getting the "domain name" from a URL. However, the proposed regex solution doesn't address many country codes like com.au, or the exceptions to country codes like the registered domain parliament.uk. The Public Suffix List does, and so does this module.

Installation

Latest release on PyPI:

pip install tldextract

Or the latest dev version:

pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'

Command-line usage, splits the url components by space:

tldextract http://forums.bbc.co.uk
# forums bbc co.uk

Note About Caching

Beware when first running the module, it updates its TLD list with a live HTTP request. This updated TLD set is usually cached indefinitely in ``$HOME/.cache/python-tldextract`. To control the cache's location, set TLDEXTRACT_CACHE environment variable or set the cache_dir path in TLDExtract initialization.

(Arguably runtime bootstrapping like that shouldn't be the default behavior, like for production systems. But I want you to have the latest TLDs, especially when I haven't kept this code up to date.)

# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=None)
no_fetch_extract('http://www.google.com')

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')

If you want to stay fresh with the TLD definitions--though they don't change often--delete the cache file occasionally, or run

tldextract --update

or:

env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update

It is also recommended to delete the file after upgrading this lib.

Advanced Usage

Public vs. Private Domains

The PSL maintains a concept of "private" domains.

PRIVATE domains are amendments submitted by the domain holder, as an expression of how they operate their domain security policy. … While some applications, such as browsers when considering cookie-setting, treat all entries the same, other applications may wish to treat ICANN domains and PRIVATE domains differently.

By default, tldextract treats public and private domains the same.

>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com')

The following overrides this.

>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com', include_psl_private_domains=True)
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com')

or to change the default for all extract calls,

>>> extract = tldextract.TLDExtract( include_psl_private_domains=True)
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com')

The thinking behind the default is, it's the more common case when people mentally parse a URL. It doesn't assume familiarity with the PSL nor that the PSL makes such a distinction. Note this may run counter to the default parsing behavior of other, PSL-based libraries.

Specifying your own URL or file for the Suffix List data

You can specify your own input data in place of the default Mozilla Public Suffix List:

extract = tldextract.TLDExtract(
    suffix_list_urls=["http://foo.bar.baz"],
    # Recommended: Specify your own cache file, to minimize ambiguities about where
    # tldextract is getting its data, or cached data, from.
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)

The above snippet will fetch from the URL you specified, upon first need to download the suffix list (i.e. if the cached version doesn't exist).

If you want to use input data from your local filesystem, just use the file:// protocol:

extract = tldextract.TLDExtract(
    suffix_list_urls=["file://absolute/path/to/your/local/suffix/list/file"],
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)

Use an absolute path when specifying the suffix_list_urls keyword argument. os.path is your friend.

FAQ

Can you add suffix ____? Can you make an exception for domain ____?

This project doesn't contain an actual list of public suffixes. That comes from the Public Suffix List (PSL). Submit amendments there.

(In the meantime, you can tell tldextract about your exception by either forking the PSL and using your fork in the suffix_list_urls param, or adding your suffix piecemeal with the extra_suffixes param.)

If I pass an invalid URL, I still get a result, no error. What gives?

To keep tldextract light in LoC & overhead, and because there are plenty of URL validators out there, this library is very lenient with input. If valid URLs are important to you, validate them before calling tldextract.

This lenient stance lowers the learning curve of using the library, at the cost of desensitizing users to the nuances of URLs. Who knows how much. But in the future, I would consider an overhaul. For example, users could opt into validation, either receiving exceptions or error metadata on results.

Contribute

Setting up

  1. git clone this repository.
  2. Change into the new directory.
  3. pip install tox

Running the Test Suite

Run all tests against all supported Python versions:

tox --parallel

Run all tests against a specific Python environment configuration:

tox -l
tox -e py37
Owner
John Kurkowski
UX Engineering Consultant
John Kurkowski
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of D

Joachim 1 Dec 26, 2021
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
Top 50 best selling books on amazon

It's a dashboard that shows the detailed information about each book in the top 50 best selling books on amazon over the last ten years

Nahla Tarek 1 Nov 18, 2021
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
Maximum Covariance Analysis in Python

xMCA | Maximum Covariance Analysis in Python The aim of this package is to provide a flexible tool for the climate science community to perform Maximu

Niclas Rieger 39 Jan 03, 2023
track your GitHub statistics

GitHub-Stalker track your github statistics 👀 features find new followers or unfollowers find who got a star on your project or remove stars find who

Bahadır Araz 34 Nov 18, 2022
Implementation in Python of the reliability measures such as Omega.

OmegaPy Summary Simple implementation in Python of the reliability measures: Omega Total, Omega Hierarchical and Omega Hierarchical Total. Name Link O

Rafael Valero Fernández 2 Apr 27, 2022
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021