Create your own ‘Reading Stats’ page for medium.com using Python

Part 1: Extracting the data from articles read

Richard Quinn
5 min readJul 15, 2020
Macbook with graphs and stats on screen
Photo by Carlos Muza from unsplash

I have known about ‘medium’ for some time, but have always been a free-to-play member. I would use my 5 free reads on the first day of the month and then bookmark loads of articles I intended to read the following month. I am probably the perfect candidate for transitioning to paid membership, so that’s what I did a couple of weeks ago.

I started consuming huge amounts of content (as I’m sure everyone does at the beginning of their membership), I looked at the stats the site provides, and it was very focused on the writers. I realised there was no gamification in the reading. Where’s my ‘read 10 articles’ badge! It sounds silly, but this would motivate me to get the most out of my membership.

I started to think about what a stats page would look like. We have the ‘recently viewed’ articles, which is just a meaningless list. I would like to know how many articles I’ve read this month, last month, the last 6 months and even for all time. I want to know which topics I read about the most, and whether I read excessively from just one publisher. Important to me is the value I receive from my subscription fee, so I’d like to see how many articles I read for £5, and how that equates to cost-per-article.

Medium sends out a few emails when you first sign up, and one of them invites requests for improvements, so I shared this feedback with medium. I doubt this will result in a fancy new ‘reading stats’ section, but I suppose it might. Instead of waiting to see if it ever arrives I decided to write a script that would result in a stats page for my own personal use. You can do the same by following along with the code (although, very likely I am the only person who cares about this!).

The first thing to consider was how to get the information about what I’m reading to my script. I messed about with the idea of scraping the ‘recently viewed’ list, but often I will go into an article to have a look, realise that the content is not for me and immediately leave, this is then added to that list, but I wouldn’t want to include it in my stats. I thought about perhaps writing a chrome plugin, but rejected this idea since I do a lot of the reading on mobile. The solution I landed on does require a user task: to share the article via email. The script then monitors an inbox for these shared links.

Mailbox monitoring

I signed up for a new gmail account which will only be used for sharing medium stories with. Initially I struggled to connect with the inbox via POP3, and kept getting authentication errors. This is because you have to change a setting in the google account called ‘Allow less secure apps’, once this has been set, you’ll have no problem accessing the mailbox. The emails are accessed by the Mailbox.retr() function, and it sends the email in several small chunks, occasionally I have seen it split the link with a trailing = character. This is why we join the raw list and remove that character. We now have a list with all the hyperlinks to the stories I’ve read since the script was last run.

We then set up an sqlite3 database to locally store information about the stories read.

Create SQLite database and connect

But the data I’m referring to in this table is not available in the share link from the email, so we need to do a bit of webscraping.

Setup for Selenium

This can be tricky to get right. Firstly download the chromedriver executable and save it somewhere sensible, then add the directory to your PATH. PATH is a system wide variable that holds a list of directories. When a command is passed to the system, it looks at each of the directories in the PATH variable to see if they contain the command.

The easiest way to add a directory to PATH on a mac (for windows look here), is by opening up a terminal window and typing: sudo nano /etc/paths. You’ll be asked for the admin password, and then you’ll be able to directly edit the PATH variable. Add the path to the directory where the chromedriver is saved, then press ctrl+X to save and exit. Another thing to note is the chromePath variable in the code above, needs to be the full path to the executable (which includes the actual file and not just the directory), and that on mac, it has no extension.

Next we define a couple of helper methods for the scraping.

helper functions for data extraction

Scraping text from websites presents some challenges because python’s default encoding is ASCII, so if any characters are returned that are not in the ASCII set, the code will throw a UnicodeDecodeError. I messed around with decoding the returned string with utf-8, but I found having a lookup table seemed to be the most consistent solution (although does require a bit of trail and error to set up).

Next we define the regex terms we’ll be looking for in the medium story.

titleRegex = r'<meta data-rh="true" property="og:title" content="(.*?)"'
descRegex = r'<meta data-rh="true" name="description" content="(.*?)"'
urlRegex = r'<meta data-rh="true" property="og:url" content="(.*?)"'
tagsRegex = r'"keywords":\[(.*?)\]'
authorRegex = r'"author":{.+?"name":"(.*?)"'

Regex terms can be complicated, but because the information we are searching for is very consistent, we can just use a long string which will point us to the exact location. I then use a capturing group i.e (.*?). Whatever is found in the parentheses is the string that is returned in the findall() function. The . character matches anything, the * character says there can be zero or more of those anything’s and the ? character makes sure that a minimum number of characters are captured between the long search term at the start and the last character in the regex term.

Search through the article source code and extract relevant bits

So what we are doing here is, for each link in our list we are opening the link using our headless chromedriver (which means it doesn’t display anything), and performing all the regex searching, using the helper functions we looked at above. The tags regex returns a mix of tags, genres and publisher, so we split that information out using list comprehension. The final step is to search our database and see if we already have a row that uses the url of this story, if not we save all the information we have found. We also insert month and year that we extracted from datetime. (I started by regexing the incoming email for the date received, but I figured if I set the script to run daily, it would always be the same anyway).

We now have a system for logging our medium reads, and a script that extracts information about each article read. The data is ready to be presented, which will be the topic of part 2 of this story.

If you got this far…thanks so much for reading!

--

--

Richard Quinn

“Old man changes career to become a Software Developer after 20 years in an unrelated sector”