Make Your Own Dictionary based on Data from Wiktionary by using Python - A Practical Guide

Why You need Your Own Dictionary?

For many users, because there are already many great free online dictionaries available, there is no need to make your own. But some geeky users like me may always want to have a offline copy of dictionary with necessary things but not too many.

Wiktionary may be the only free accesible and downloadable dictionary data source there with text, images and audio data.

Since personally, I have most needs of my own English dictionary, so I managed to make my own Wiktionary English Dictionary based on Wiktionary data. Why I choose to make my dictionary based on Wiktionary data? Because Wiktionary has the most necessary parts, a good enough dictionary in my opinion should have: pronounciation with audio data, illustration with images and definitions with etymology explanations.

How does the Finished Product look like?

How could You also Make It?

Yes, this is the topic of this artilce. I am going to tell you how I made it and give you ideas of how you can also make it.

As an overview, there are the following steps:

  1. fetch the wordlist and get the url for each word
  2. fetch the HTML page of each word and also its important media data files
  3. process the downloaded HTML files - abstracting interesting parts, changing media reference URL into local URL and constructing your own HTML documents

Then let's look into each step in details.

Fetching the wordlist and word URL

For the first step I have taken usage of the Wiktionary Appendix:Basic English word list. After comparing several wordlists, I got the impression that this is the most well received basic English wordlist. Words in this list are most essential ones.

get_wiktionary_page_list(the_url, downloads_dir_path)

HTTPDownloader.download_and_save(the_url, html_file_path)

ref_list = []
with open(html_file_path, 'r') as html_fileobj:
    # html =
    # NOTE: 20210602, 'html5lib' can be an alternative to Python's 'html.parser'
    html_bs = BeautifulSoup(html_fileobj, 'html.parser')

    a_bs_list = html_bs.find_all('a', href=re.compile('^/wiki/[^/?:]+$'), title=True)
    for a_bs in a_bs_list:
        url = a_bs['href']
        if url.startswith('//'):
            url = the_url.split(':', 1)[0] + ':' + url
        elif url.startswith('/'):
            url = the_url.split(':', 1)[0] + '://' + the_url.split(':', 1)[1][2:].split('/', 1)[0] + url
            if not url.startswith('http'):
            print('New URL with unsupported pattern: %s' % url)

        title = a_bs['title'] or a_bs.get_text()
        ref_list.append((title, url))

As shown by the above Python code snippet, the process of fetching wordlist and word page URL is not complex.

  1. download the wordlist page and save it as html
  2. analysis the downloaded html file and use regex pattern ^/wiki/[^/?:]+$ to find all word links (with the help of BeatifulSoup4 library)

One thing to mention here, by using the above snippet, there will be 852 words in total. The two exceptional words are Wikipedia and vocabulary. Manually handle it would not take long.

Fetching word pages and their media files

With the word and URL list fetched as described in the last step, this step is a little bit harder but still easy to do.

Downloading the word HTML pages is straightforward although getting the referenced media files could be a little bit more difficult. But let's look at it in code, it should also be easy.

The first sub-step here would be get the URL of referenced media files.

audio_bs_list = html_bs.find_all('audio')
for audio_bs in audio_bs_list:
    source_bs_list = audio_bs.find_all('source')
    for source_bs in source_bs_list:
        url = source_bs['src']
        if url.startswith('//'):
            url = the_url.split(':', 1)[0] + ':' + url
            if not url.startswith('http'):
            print('New URL with unsupported pattern: %s' % url)

        content_type = source_bs['type'].split('/')[0]
        ref_list.append((url, content_type, mediadir_path))

img_bs_list = html_bs.find_all('img', class_='thumbimage')
for img_bs in img_bs_list:
    img_src = img_bs['src']
    img_srcset = img_bs.get('srcset', None)

    url_list = [img_src]
    if img_srcset:
        src_list = [x.strip() for x in img_srcset.split(',')]
        for src in src_list:
            if src:
                url = src.rsplit(' ', 1)[0]

    for url in url_list:
        if url.startswith('//'):
            url = the_url.split(':', 1)[0] + ':' + url
            if not url.startswith('http'):
            print('New URL with unsupported pattern: %s' % url)

        content_type = 'image'
        ref_list.append((url, content_type, mediadir_path))

And the second sub-step would be download the URL.

Downloading the URL is just the not smart part here. Do it anyway you like. In Python multiprocessing.Pool(MAX_PROCESS_NUM).map(download_wiktionary_page_ref, current_download_list) plus a sleep(1 + random() * 2) can be a solution. Or you can also use any other downloading application by providing a list of URL.

Processing the downloaded HTML files

This is the relatively difficult part, but fortunitely, the HTML structure of Wiktionary word page is not that bad. With the help of BeautifulSoup4, it is still do-able for me in one or two days.

The first question here is what to keep? After a careful check of the HTML structure of Wiktionary word pages, I got the following inforamtion.

def is_headword_wrapper_tag(tag):
    return (tag.parent and == 'p') and (
            ( == 'strong' and tag.has_attr('class') and ('headword' in tag['class'])) or (
       == 'b' and (
                    tag.previous_sibling is None
                    or tag.previous_element == tag.parent

def is_etymology_line_tag(tag):
    return == 'h3' and tag.find('span', string=re.compile('^Etymology'))

# SECTION: Pronunciation
pronunciation_tags = mw_content_tag.find_all('span', class_='mw-headline', string='Pronunciation')
for pronunciation_tag in pronunciation_tags:
    pronunciation_headlines = []
    pronunciation_line_tag_names = ['h2', 'h3', 'h4', 'h5']

    pronunciation_line_tag = pronunciation_tag.parent
    if in pronunciation_line_tag_names:
        current_line_tag = pronunciation_line_tag

        while True:
            tag_name_index = pronunciation_line_tag_names.index(
            if tag_name_index == 0:

            parent_line_tag = current_line_tag.find_previous_sibling(pronunciation_line_tag_names[tag_name_index - 1])
            # assert parent_line_tag exists
            parent_line_tag_headline = parent_line_tag.find(class_='mw-headline').string

            current_line_tag = parent_line_tag
        print(f'pronunciation_line_tag {pronunciation_line_tag}')
        sys.exit('Found pronunciation tag which is in fact not.')

    if pronunciation_headlines[-1] == 'English':
        pronunciation_content_list_tag = pronunciation_line_tag.find_next_sibling('ul')

# SECTION: Images
illustration_image_tags = mw_content_tag.find_all('img', class_='thumbimage')
for illustration_image_tag in illustration_image_tags:
    illustration_line_tag = illustration_image_tag.find_parent('div', class_=re.compile(r'(^|\s)thumb(\s|$)'))

# SECTION: Definitions
definition_pending_tags = []
definition_headword_tags = mw_content_tag.find_all(is_headword_wrapper_tag, string=re.compile(f'^{keyword}$'))
for definition_headword_tag in definition_headword_tags:
    definition_line_tag = definition_headword_tag.parent
    language_section_line_tag = definition_line_tag.find_previous_sibling('h2')
    if not language_section_line_tag:
        sys.exit('Found definition headword tag which is in fact not definition headword tag')

    language_section_headline_tag = language_section_line_tag.find('span', 'mw-headline')
    if not language_section_headline_tag or language_section_headline_tag.get_text() != 'English':

    selected_leading_tags = []

    pos_line_tag = definition_line_tag.previous_sibling
    max_number_of_iteration = 10
    current_number_of_iteration = 0
    while pos_line_tag:
        current_number_of_iteration += 1
        if current_number_of_iteration > max_number_of_iteration:
            print(definition_line_tag, pos_line_tag)
            sys.exit('Unable to find headline line')

        pos_line_tag_names = ['h3', 'h4', 'h5']
        if hasattr(pos_line_tag, 'name') and in pos_line_tag_names:

        pos_line_tag = pos_line_tag.previous_sibling

    if not pos_line_tag:
        sys.exit('Unable to find headline line')

    selected_leading_tags = [pos_line_tag] + selected_leading_tags

    etymology_line_tag = pos_line_tag.find_previous_sibling(is_etymology_line_tag)

    if etymology_line_tag and (etymology_line_tag not in definition_pending_tags):
        etymology_content_tags = []
        etymology_content_tag = etymology_line_tag.next_sibling
        max_number_of_iteration = 20
        current_number_of_iteration = 0
        while etymology_content_tag:
            current_number_of_iteration += 1
            if current_number_of_iteration > max_number_of_iteration:
                print(definition_line_tag, etymology_content_tag,
                sys.exit('Unable to end etymology content tag')

            if hasattr(etymology_content_tag, 'name') and in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:

            if == 'div' and \
                hasattr(etymology_content_tag, 'class') and (
                    'thumb' in etymology_content_tag['class']

            etymology_content_tag = etymology_content_tag.next_sibling

        selected_leading_tags = [etymology_line_tag] + etymology_content_tags + selected_leading_tags

    selected_following_tags = []

    explanation_list_line_tag = definition_line_tag.next_sibling
    max_number_of_iteration = 10
    current_number_of_iteration = 0
    while explanation_list_line_tag:
        current_number_of_iteration += 1
        if current_number_of_iteration > max_number_of_iteration:
            print(definition_line_tag, explanation_list_line_tag)
            sys.exit('Unable to find explanation list line')

        explanation_list_line_tag_names = ['ol']
        if hasattr(explanation_list_line_tag, 'name') and in explanation_list_line_tag_names:

        explanation_list_line_tag = explanation_list_line_tag.next_sibling

    if not explanation_list_line_tag:
        sys.exit('Unable to find explanation list line')


    for selected_leading_tag in selected_leading_tags:
    for selected_following_tag in selected_following_tags:

# SECTION: Removing unnecessary
mw_editsection_tags = output_bs.find_all('span', class_='mw-editsection')
for mw_editsection_tag in mw_editsection_tags:

sister_wikipedia_tags = output_bs.find_all('div', class_=re.compile(r'(^|\s)sister-wikipedia(\s|$)'))
for sister_wikipedia_tag in sister_wikipedia_tags:

sup_tags = output_bs.find_all('sup')
for sup_tag in sup_tags:
    sup_first_child_or_sibling_tag = sup_tag.next_element
    if hasattr(sup_first_child_or_sibling_tag, 'name') and == 'a' \
        and sup_first_child_or_sibling_tag.get_text().startswith('[') \
        and sup_first_child_or_sibling_tag.get_text().endswith(']') \
        and (
            (sup_first_child_or_sibling_tag.get('href') or '').startswith('#')
            or 'external' in (sup_first_child_or_sibling_tag.get('class') or '')

# SECTION: output and regex string replacement
# do it as you like

* cached version, generated at 2021-10-27 00:34:10 UTC.

Subscribe by RSS