Make Your Own Dictionary based on Data from Wiktionary by using Python - A Practical Guide

Why You need Your Own Dictionary?

For many users, because there are already many great free online dictionaries available, there is no need to make your own. But some geeky users like me may always want to have a offline copy of dictionary with necessary things but not too many.

Wiktionary may be the only free accesible and downloadable dictionary data source there with text, images and audio data.

Since personally, I have most needs of my own English dictionary, so I managed to make my own Wiktionary English Dictionary based on Wiktionary data. Why I choose to make my dictionary based on Wiktionary data? Because Wiktionary has the most necessary parts, a good enough dictionary in my opinion should have: pronounciation with audio data, illustration with images and definitions with etymology explanations.

How does the Finished Product look like?




How could You also Make It?

Yes, this is the topic of this artilce. I am going to tell you how I made it and give you ideas of how you can also make it.

As an overview, there are the following steps:

  1. fetch the wordlist and get the url for each word
  2. fetch the HTML page of each word and also its important media data files
  3. process the downloaded HTML files - abstracting interesting parts, changing media reference URL into local URL and constructing your own HTML documents

Then let's look into each step in details.

Fetching the wordlist and word URL

For the first step I have taken usage of the Wiktionary Appendix:Basic English word list. After comparing several wordlists, I got the impression that this is the most well received basic English wordlist. Words in this list are most essential ones.

get_wiktionary_page_list(the_url, downloads_dir_path)

HTTPDownloader.download_and_save(the_url, html_file_path)

ref_list = []
with open(html_file_path, 'r') as html_fileobj:
    # html = html_fileobj.read()
    # NOTE: 20210602, 'html5lib' can be an alternative to Python's 'html.parser'
    html_bs = BeautifulSoup(html_fileobj, 'html.parser')

    a_bs_list = html_bs.find_all('a', href=re.compile('^/wiki/[^/?:]+$'), title=True)
    for a_bs in a_bs_list:
        url = a_bs['href']
        if url.startswith('//'):
            url = the_url.split(':', 1)[0] + ':' + url
        elif url.startswith('/'):
            url = the_url.split(':', 1)[0] + '://' + the_url.split(':', 1)[1][2:].split('/', 1)[0] + url
        else:
            if not url.startswith('http'):
                pass
            print('New URL with unsupported pattern: %s' % url)
            continue

        title = a_bs['title'] or a_bs.get_text()
        ref_list.append((title, url))

As shown by the above Python code snippet, the process of fetching wordlist and word page URL is not complex.

  1. download the wordlist page and save it as html
  2. analysis the downloaded html file and use regex pattern ^/wiki/[^/?:]+$ to find all word links (with the help of BeatifulSoup4 library)

One thing to mention here, by using the above snippet, there will be 852 words in total. The two exceptional words are Wikipedia and vocabulary. Manually handle it would not take long.

Fetching word pages and their media files

With the word and URL list fetched as described in the last step, this step is a little bit harder but still easy to do.

Downloading the word HTML pages is straightforward although getting the referenced media files could be a little bit more difficult. But let's look at it in code, it should also be easy.

The first sub-step here would be get the URL of referenced media files.

audio_bs_list = html_bs.find_all('audio')
for audio_bs in audio_bs_list:
    source_bs_list = audio_bs.find_all('source')
    for source_bs in source_bs_list:
        url = source_bs['src']
        if url.startswith('//'):
            url = the_url.split(':', 1)[0] + ':' + url
        else:
            if not url.startswith('http'):
                pass
            print('New URL with unsupported pattern: %s' % url)
            continue

        content_type = source_bs['type'].split('/')[0]
        ref_list.append((url, content_type, mediadir_path))


img_bs_list = html_bs.find_all('img', class_='thumbimage')
for img_bs in img_bs_list:
    img_src = img_bs['src']
    img_srcset = img_bs.get('srcset', None)

    url_list = [img_src]
    if img_srcset:
        src_list = [x.strip() for x in img_srcset.split(',')]
        for src in src_list:
            if src:
                url = src.rsplit(' ', 1)[0]
                url_list.append(url)

    for url in url_list:
        if url.startswith('//'):
            url = the_url.split(':', 1)[0] + ':' + url
        else:
            if not url.startswith('http'):
                pass
            print('New URL with unsupported pattern: %s' % url)
            continue

        content_type = 'image'
        ref_list.append((url, content_type, mediadir_path))

And the second sub-step would be download the URL.

Downloading the URL is just the not smart part here. Do it anyway you like. In Python multiprocessing.Pool(MAX_PROCESS_NUM).map(download_wiktionary_page_ref, current_download_list) plus a sleep(1 + random() * 2) can be a solution. Or you can also use any other downloading application by providing a list of URL.

Processing the downloaded HTML files

This is the relatively difficult part, but fortunitely, the HTML structure of Wiktionary word page is not that bad. With the help of BeautifulSoup4, it is still do-able for me in one or two days.

The first question here is what to keep? After a careful check of the HTML structure of Wiktionary word pages, I got the following inforamtion.

def is_headword_wrapper_tag(tag):
    return (tag.parent and tag.parent.name == 'p') and (
            (tag.name == 'strong' and tag.has_attr('class') and ('headword' in tag['class'])) or (
                tag.name == 'b' and (
                    tag.previous_sibling is None
                    or tag.previous_element == tag.parent
                )
            )
        )

def is_etymology_line_tag(tag):
    return tag.name == 'h3' and tag.find('span', string=re.compile('^Etymology'))


# SECTION: Pronunciation
pronunciation_tags = mw_content_tag.find_all('span', class_='mw-headline', string='Pronunciation')
for pronunciation_tag in pronunciation_tags:
    pronunciation_headlines = []
    pronunciation_line_tag_names = ['h2', 'h3', 'h4', 'h5']

    pronunciation_line_tag = pronunciation_tag.parent
    if pronunciation_line_tag.name in pronunciation_line_tag_names:
        current_line_tag = pronunciation_line_tag

        while True:
            tag_name_index = pronunciation_line_tag_names.index(current_line_tag.name)
            if tag_name_index == 0:
                break

            parent_line_tag = current_line_tag.find_previous_sibling(pronunciation_line_tag_names[tag_name_index - 1])
            # assert parent_line_tag exists
            parent_line_tag_headline = parent_line_tag.find(class_='mw-headline').string
            pronunciation_headlines.append(parent_line_tag_headline)

            current_line_tag = parent_line_tag
    else:
        print(f'pronunciation_line_tag {pronunciation_line_tag}')
        sys.exit('Found pronunciation tag which is in fact not.')

    if pronunciation_headlines[-1] == 'English':
        pronunciation_content_list_tag = pronunciation_line_tag.find_next_sibling('ul')


# SECTION: Images
illustration_image_tags = mw_content_tag.find_all('img', class_='thumbimage')
for illustration_image_tag in illustration_image_tags:
    illustration_line_tag = illustration_image_tag.find_parent('div', class_=re.compile(r'(^|\s)thumb(\s|$)'))


# SECTION: Definitions
definition_pending_tags = []
definition_headword_tags = mw_content_tag.find_all(is_headword_wrapper_tag, string=re.compile(f'^{keyword}$'))
for definition_headword_tag in definition_headword_tags:
    definition_line_tag = definition_headword_tag.parent
    language_section_line_tag = definition_line_tag.find_previous_sibling('h2')
    if not language_section_line_tag:
        print(definition_line_tag)
        sys.exit('Found definition headword tag which is in fact not definition headword tag')

    language_section_headline_tag = language_section_line_tag.find('span', 'mw-headline')
    if not language_section_headline_tag or language_section_headline_tag.get_text() != 'English':
        continue

    selected_leading_tags = []

    pos_line_tag = definition_line_tag.previous_sibling
    max_number_of_iteration = 10
    current_number_of_iteration = 0
    while pos_line_tag:
        current_number_of_iteration += 1
        if current_number_of_iteration > max_number_of_iteration:
            print(definition_line_tag, pos_line_tag)
            sys.exit('Unable to find headline line')

        pos_line_tag_names = ['h3', 'h4', 'h5']
        if hasattr(pos_line_tag, 'name') and pos_line_tag.name in pos_line_tag_names:
            break

        pos_line_tag = pos_line_tag.previous_sibling

    if not pos_line_tag:
        print(definition_line_tag)
        sys.exit('Unable to find headline line')

    selected_leading_tags = [pos_line_tag] + selected_leading_tags

    etymology_line_tag = pos_line_tag.find_previous_sibling(is_etymology_line_tag)

    if etymology_line_tag and (etymology_line_tag not in definition_pending_tags):
        etymology_content_tags = []
        etymology_content_tag = etymology_line_tag.next_sibling
        max_number_of_iteration = 20
        current_number_of_iteration = 0
        while etymology_content_tag:
            current_number_of_iteration += 1
            if current_number_of_iteration > max_number_of_iteration:
                print(definition_line_tag, etymology_content_tag, etymology_content_tag.name)
                sys.exit('Unable to end etymology content tag')

            if hasattr(etymology_content_tag, 'name') and etymology_content_tag.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                break

            if etymology_content_tag.name == 'div' and \
                hasattr(etymology_content_tag, 'class') and (
                    'thumb' in etymology_content_tag['class']
                ):
                pass
            else:
                etymology_content_tags.append(etymology_content_tag)

            etymology_content_tag = etymology_content_tag.next_sibling

        selected_leading_tags = [etymology_line_tag] + etymology_content_tags + selected_leading_tags

    selected_following_tags = []

    explanation_list_line_tag = definition_line_tag.next_sibling
    max_number_of_iteration = 10
    current_number_of_iteration = 0
    while explanation_list_line_tag:
        current_number_of_iteration += 1
        if current_number_of_iteration > max_number_of_iteration:
            print(definition_line_tag, explanation_list_line_tag)
            sys.exit('Unable to find explanation list line')

        explanation_list_line_tag_names = ['ol']
        if hasattr(explanation_list_line_tag, 'name') and explanation_list_line_tag.name in explanation_list_line_tag_names:
            break

        explanation_list_line_tag = explanation_list_line_tag.next_sibling

    if not explanation_list_line_tag:
        print(definition_line_tag)
        sys.exit('Unable to find explanation list line')

    selected_following_tags.append(explanation_list_line_tag)

    for selected_leading_tag in selected_leading_tags:
        definition_pending_tags.append(selected_leading_tag)
    definition_pending_tags.append(definition_line_tag)
    for selected_following_tag in selected_following_tags:
        definition_pending_tags.append(selected_following_tag)


# SECTION: Removing unnecessary
mw_editsection_tags = output_bs.find_all('span', class_='mw-editsection')
for mw_editsection_tag in mw_editsection_tags:
    mw_editsection_tag.extract()

sister_wikipedia_tags = output_bs.find_all('div', class_=re.compile(r'(^|\s)sister-wikipedia(\s|$)'))
for sister_wikipedia_tag in sister_wikipedia_tags:
    sister_wikipedia_tag.extract()

sup_tags = output_bs.find_all('sup')
for sup_tag in sup_tags:
    sup_first_child_or_sibling_tag = sup_tag.next_element
    if hasattr(sup_first_child_or_sibling_tag, 'name') and sup_first_child_or_sibling_tag.name == 'a' \
        and sup_first_child_or_sibling_tag.get_text().startswith('[') \
        and sup_first_child_or_sibling_tag.get_text().endswith(']') \
        and (
            (sup_first_child_or_sibling_tag.get('href') or '').startswith('#')
            or 'external' in (sup_first_child_or_sibling_tag.get('class') or '')
        ):
        sup_tag.extract()


# SECTION: output and regex string replacement
# do it as you like

* cached version, generated at 2021-10-27 00:34:10 UTC.

Subscribe by RSS