Time Series Data Analysis and Visualization: Git Commit Timestamps as An Example
- UUID: 06494758-96ed-4f04-bac9-c1e855d883cd
- This article is essentially a new use case (Git commits) of same time series data analysis and visualization techniques used in my other article JH-Articles: Data and Visualization of Making My PONS Dictionary .
- 20240111.Added this article
Timestamps are not only universal but also interesting in many cases. Just as time proceeds every second with a stop, timestamp data is also resulted as time goes by.
Time is a human concept than we use to mark the progress of our universe, and not all human beings live at one place, therefore, we also invented more concepts in the scope of time such as year, month, day and time zones.
Time can of course be tied to events. But only time series itself is already an very fruitful topic, therefore, in this article, let's foucus on pure timestamps.
Digital Representation of Time
Timestamp in my impression is mainly a computer concept, which is a numerical number of seconds after Thursday 1 January 1970 00:00:00 UT, a point in time known as the Unix epoch. One good thing about the ISO 8601 timestamp is that it is base on UTC or time-zone independent. But as human, in daily life, we consider time as concept in relationship to daylight, therefore, we have created the concept of time zones to allow people all around the global to use the same number to describe noon or midnight.
Therefore, when handling timestamps, it is necessary to be consider timezones, if the time point relation to daylight is more of interests.
Taking one of my Git project, the program project for the website JH-Articles, Qia-Articles, as an example. I started using Git for the program files in 2017 in P.R.China which use CST +0800 as its local time representation standard (which also means people's (of course including my) daily life adapts to this time representation) and I continued the project when I came to Germany which use CET +0100 and CEST +0200 depending the time of the year. To make meaningful statistics about when do I usually make Git commits in a day, it is necessary to convert the git commit timestamps into local time representation with time zone taken into account.
Git Commit Timestamps
Git commit timestamps can be easily fetched using the
git log command like.
$ git log --all --reverse --pretty=format:'%ct'
With some python magic, the timestamps can be transformed into some thing like the following.
2017-10-14 13:48:20,+0800 CST
2017-10-21 17:28:50,+0800 CST
2017-10-21 19:04:01,+0800 CST
2019-08-02 00:24:14,+0800 CST
2020-03-01 14:26:54,+0100 CET
2020-03-27 16:08:38,+0100 CET
2020-03-27 16:25:26,+0100 CET
2020-03-27 16:48:00,+0100 CET
2020-05-05 17:37:16,+0200 CEST
2020-05-14 12:16:39,+0200 CEST
2020-05-14 12:17:03,+0200 CEST
Which are much easier for human to read and interprete for aspects such as how many commits on a certain day or at which hour certain commits are made. To make the time series data analysis and visualization more focused, let's just use the Datetime column but leave out the Timezone data as the dataset to use.
Time Series Data Analysis
What are the interesting questions we can ask on the dateset? I think at least the following several.
How many commits have I made per day along the timeline?
This question can also cover a bit on which days, i.e., are there big gaps where a sequence of days with 0 commits or are there a big chunks of days with many commits?
How to answer this question? Just aggregate (count) the date-time data by the first 10 charactors (yyyy-mm-dd). For better human consumption, let's look on the result in the later visulization part.
At what hours did I make most commits or fewest commits?
Hours on each day are usually also not identical, like the 9 am on a workday is usually quite different from the 9 am on a weekend, so is between a work day and a holidy. As for frequence, both average and sum are good indicator - I would prefer to pick the sum as it is integer. So the date-time data can be grouped by the 12 and 13 charactors (hh) with the first 10 charactors used a metadata to retrieve the days of the week.
Time Series Data Visualization
Daily Numbers of Commits Along the Timeline
Hourly Numbers of Commits (Separated in Weekday and Weekend to Compare Between)
Hourly Numbers of Commits (Separated in Years to Compare Among)
When should a day start?
It is usually not a question, but it is sometimes. I first got aware of this question when I came across the option in Anki (a flashcard program to help remember things). The default boundary timepoint was 4 am as far as I can remember, it is quite reasonable as people are commonly asleep at 4 am.
The 4 am seems to be a good separation point. But there is another way of thinking, especially for people who lives in the mainland of Europe. In Germany, there are CET and CEST time zones. The switch between CET and CEST is at 2 am and 3 am. From summer time to winter time, after 02:59+0200 there comes 02:00+0100; from winter time to summer, after 01:59+0100, there comes 03:00+0200. Therefore, 3 am seems to be a good starting point for a new day, as some people could really stay awake later than 2 am.
* cached version, generated at 2024-01-12 16:56:37 UTC.
Subscribe by RSS