JH-Articles: Canonical URLs

Canonical URLs are for reducing confusion resulted by necessary duplication.

Foreword

This is the first release note of JH-Articles I decided to post on this blog. Therefore, let me explain the name "JH-Articles" first.

The product JH-Articles started more than 6 years ago after I made the decision to build my own blog website from ground up instead of continuing to post on hosted platforms including CSDN, Google BlogSpot, Google App Engine + FOSS (AppSpot).

After several years of development, I have deployed it multiple times with different website names, currently mainly including JH-Articles, JH-Catalogues and and RhoPhotos-Articles. All of these websites use the same product but just with different configurations. In other words, JH-Articles as the product name stands for all these websites. Worth mentioning, JH-Blog (the product for the current website) is another product, although it originated from the same code base as JH-Articles and shares a lot with JH-Articles such as the GUI designs.

The biggest difference between JH-Articles and JH-Blog is the information structure. Posts on JH-Articles are indexed hierarchically in nested topics, while posts on JH-Blog are indexed by language, then topic and finally date. This also means, posts on JH-Articles are in generally not time-stamped while posts on JH-Blog have a strong link to the date.

Of course, not the first time I have re-structured my posts on the Internet. During the past several weeks, I have taken quite some time to "migrate" my JH-Articles websites to new URLs which are all under https://www.johannhuang.com/ in order to aggregate my PV statistics and be serious about web posting. Finally, the JH-Articles, JH-Catalogues and RhoPhotos-Articles are now served at https://www.johannhuang.com/articles/, https://www.johannhuang.com/catalogues/ and https://www.rhophotos.com/articles/ instead of at separated sub-domains such as articles.johannhuang.com and catalogues.johannhuang.com. (as told on my tweet at https://twitter.com/johannhuang_com/status/1453493629251620868)

What and Why?

In general, the "Canonical URLs" is necessary to reduce the confusion for search engines when there are multiple copies of the same content published on different distribution platforms and also give audiences a clue to find the most authorized copy.

But still worth mentioning, in my case, to be serious about web posting also means I will do some marketing later instead of just using the strategy of "酒香不怕巷子深" (Good wine needs no bush).

Be different from JH-Blog, posts on JH-Articles are not time-stamped, therefore, it is not possible to use the timestamp as an unique ID for each posts. And considering the possibility of renaming, content updating, removing and even re-categorizing, it is not so easy to find a stable ID for each posts. (Kind of the same issue as mentioned by Zettrl for file identification at https://docs.zettlr.com/en/academic/zkn-method/#file-identification; mention it to give like to it, although I mainly use Obsidian to write my MarkDown notes now.)

In the past, I have implemented an algorithm to generate article ID such as "qia-browser-libraries--2562351" under the category such as "++Qia-Software". However, it is designed for cache invalidation (as indicated by the bottom line of articles on JH-Articles, such as "* cached version, generated at 2021-11-06 12:53:52 UTC"), and not always stable although stable in some cases. This is also to say, if I have updated the content of a post, by large chance, the article ID will also change when I update the index.

In order not to confuse search engines such as Google, re-posting on marketing platforms such as Medium, it is recommended to provide a canonical URL for the corresponding original post. (I guess, at least partially also because platforms such as Medium are much more popular than individual blog, it no canonical URL is provided, search engines would tend to treat the post on marketing platforms as the original post, which is usually not the wanted cases.)

How do I generate the Canonical URLs?

For the sake of marketing, I need to work out a stable enough algorithm to generate a super stable ID to identify a post even when the article ID of the post has changed, therefore, I can make a canonical URL for the post.

The good aspect is JH-Articles is not a product for mess production, it is only supposed to be used by a small group of people, and mainly myself. Therefore, I can also manually provide an unique ID for the post as the metadata.

So what unique ID should I use to keep it unique? My first idea is to use non-sense ID. Because if it is non-sense, the I have no will to touch it. In contrast, if I use an ID with clear sense such as "canonical-url", I may really want to update it in the future if I like "canonical-link" or "Canonical-URL" much better then or if I will have extended the scope of the post. What if the non-sense ID still has some degree of sense but usually no body cares about it? Here, the UUID comes to rescue.

UUID is great also because the universal support by programming languages and as a result it is quite cheap to produce. And it is usually of equal length therefore, looks also neat which match the design language of JH-Articles quite well.

Then the issue resolved!

As a result I manually add a metadata item in my article post, and then I can get a canonical URL when I publish the article on my websites. One great thing is when users or search engine spiders visit the canonical URL, they will always be redirected (HTTP 302) to the updated (up to date) version of the post by an internal mapping of JH-Articles, no matter which URL of the post (category and article ID) will be in the address bar.

Afterword

This is the first release note I have done on this blog and also the first practice after I got aware release notes are best example of posts which have a strong relationship to timestamps.

I will do more in the future to introduce my products better to the world and also enrich my blog. (To be honest, I have difficult to find proper content to put on this blog after I have separate my blog and articles.)


* cached version, generated at 2023-12-12 01:35:54 UTC.

Subscribe by RSS