HoroscopeFS: A virtual filesystem in Python for horoscope aggregation
This is an experiment in writing a virtual filesystem in Python using fusepy.
The entire code is available on my github page.
What does this application do ?
Aggregates daily, weekly and monthly horoscopes based on your sunsign and/or moonsign from various horoscope websites and makes them accessible under a single directory in a predictable tree like directory structure.
Let’s get started.
How to run it ?
$ mkdir -p /tmp/mnt; python horoscopeFS.py /tmp/mnt/ <sunsign> <moonsign>
# Run the commands below from another terminal
$ cd /tmp/mnt
$ tree /tmp/mnt
/tmp/mnt
|-- Astrosage
| |-- daily
| |-- monthly
| `-- weekly
|-- Astroyogi
| |-- daily
| |-- monthly
| `-- weekly
|-- AstroyogiCareer
| |-- daily
| |-- monthly
| `-- weekly
`-- IndianAstrology2000
|-- daily
|-- monthly
`-- weekly
You can see how the application provides a clean and consistent tree like directory structure for horoscopes aggregated from different websites. You can navigate directories and read files just like you would any other directory or file.
The usual imports
import os
import sys
import bs4
import fuse
import tempfile
import requests
import textwrap
import argparse
-
bs4 - Beautiful Soup for parsing HTML.
-
tempfile - To create temporary files and directories.
-
requests - To fetch HTML contents from a website.
-
textwrap - To format text for display.
-
argparse - To parse command line arguments.
The starting point
Let’s start by parsing the command line arguments and instantiating an object that provides the filesystem functionality.
The user needs to provide a mount point for the filesystem and his/her sunsign and moonsign.
def main(mountpoint, sunsign, moonsign):
fuse.FUSE(HoroscopeFS(sunsign, moonsign),
mountpoint,
nothreads=True,
foreground=True)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("mountpoint", help="Mount point for the Virtual File System")
parser.add_argument("sunsign", help="Your sun sign")
parser.add_argument("moonsign", help="Your moon sign")
args = parser.parse_args()
main(args.mountpoint, args.sunsign.lower(), args.moonsign.lower())
The filesystem implementation
The following is the main class that implements the filesystem functionality. Without this, you won’t be able to navigate the filesystem and read data. This might seem like a hard thing to implement, but, as you’ll see as you read further, it is actually pretty simple for the application we’re writing.
Note that I’ve removed a lot of the original code (which you can follow along from my github page) in order to concentrate on things that are relevant at this stage.
class HoroscopeFS(fuse.Operations):
"""Virtual filesystem for aggregating horoscopes from various websites"""
def __init__(self, sunsign, moonsign):
pass
def getattr(self, path, fh=None):
pass
def readdir(self, path, fh):
pass
def read(self, path, length, offset, fh):
pass
Now, let’s take a moment to think about what filesystem functionality our application will need to provide to its users.
Firstly, we’ll need to provide filesystem navigation to navigate the tree like directory structure that our application will make visible to its users.
Secondly, we’ll need to provide our filesystem with the ability to list the contents of a directory, i.e, files in a directory.
Lastly, we’ll need to be able to read the contents of files.
You’d be surprised, but the last three class methods above -
getattr(), readdir() and read()
- are the only ones you’ll need to
implement to provide these filesystem functionalities.
getattr()
as you might have guessed is used to get directory/file attributes
like permissions, size, etc.
readdir()
is used to list the contents of a directory, i.e, list the files
inside a directory.
read()
is used to read the contents of a file.
Now, lets walk through how you might actually implement these.
def __init__(self, sunsign, moonsign):
# Get default stats for an empty directory and empty file.
# The temporary directory and file are automatically deleted.
with tempfile.TemporaryDirectory() as tmp_dir:
self.stat_dict_dir = \
self._convert_stat_to_dict(os.lstat(tmp_dir))
with tempfile.NamedTemporaryFile() as tmp_file:
self.stat_dict_file = \
self._convert_stat_to_dict(os.lstat(tmp_file.name))
self.sunsign = sunsign
self.moonsign = moonsign
self.dot_dirs = ['.', '..']
self.current_module = sys.modules[__name__]
self.horoscope_objs = {}
The tempfile
module is used to create a temporary directory and a temporary
file in order to get their stats which can then be replicated for all the
other directories and files in our virtual filesystem with modifications wherever
required.
self.dot_dirs
stores the default current and parent directories that
all directories have so that it can be used when a directories contents
are listed later on.
self.horoscope_objs
contains one object per horoscope website that we’re
aggregating from. It is currently empty and will be instantiated on-demand,
which is to say an object will be created for a particular website when the user
navigates into the top level directory corresponding to that website.
def getattr(self, path, fh=None):
if any(map(path.endswith, horoscope_sites)):
# For directories corresponding to the horoscope websites,
# return the default stats for directory.
return self.stat_dict_dir
elif any(map(path.endswith, horoscope_types)):
# Fetch content from the horoscope site we are looking at on-demand.
self._construct_obj_from_path(path)
# For files corresponding to the horoscope types,
# return the stats for the file with st_size set appropriately.
stat = dict(self.stat_dict_file) # Create a copy before modifying
stat['st_size'] = self._get_file_size_from_path(path)
return stat
else:
# For all other files/directories, return the stats from the OS.
return self._convert_stat_to_dict(os.lstat(path))
The getattr()
method is responsible for returning a dictionary of
directory/file attributes.
We know that any directory in our virtual filesystem ends with the name
of the horoscope website. We use this information to return the attributes
of the temporary directory which we had created during __init__()
without
any modifications whatsoever.
For files, we need to do a bit more work. The attributes of a file depend
on its contents (size of the file, for example). So, we will first need to
fetch the contents (horoscope) of the file from the horoscope website before
we can decide what the file’s attributes should be. This is exactly what the
_construct_obj_from_path()
method does. Given the full path to the file,
it infers the horoscope website from the path and instantiates an object of
the class corresponding to that horoscope website on-demand. The act of
instantiating an object fetches the horoscope from the horoscope website and
stores it for future use (we’ll see how to fetch the contents from the website
later in this post). Note that an object is instantiated only if an
object for that horoscope website doesn’t already exist. Once the contents are
fetched and stored locally, we can calculate the file size depending on the
size of the fetched contents. This is exactly what _get_file_size_from_path()
does. We then override only the size of the temporary file we created during
__init__()
and return the file’s attributes.
def readdir(self, path, fh):
if any(map(path.endswith, horoscope_sites)):
# Each horoscope website directory contains one file for each
# horoscope type.
return self.dot_dirs + horoscope_types
else:
# Top level directory (mountpoint) contains one directory for each
# horoscope website.
return self.dot_dirs + horoscope_sites
The readdir()
method is used to list the contents of a directory.
If the directory’s name ends with the horoscope website, then we already know what files it should contain - The files corresponding to the “daily”, “weekly” and monthly horoscopes.
If the directory’s name is something else (in our case, it can only be the mount point because there are no other directories in our application), then, we return the names of the directories corresponding to the horoscope websites.
Ofcourse, we have to always return the two special directories (‘.’ and ‘..’) corresponding to the current and parent directories.
def read(self, path, length, offset, fh):
return self._read_data_from_path(path, length, offset)
read()
is used to read the contents of a file (the actual horoscope).
The _read_data_from_path()
method simply return the contents that were fetched
from the horoscope website when an object was instantiated for that website
in _construct_obj_from_path()
.
That wasn’t too difficult !!!. Now all that remains to be seen is how to fetch the horoscope from the horoscope website.
The business logic
Lets walk through fetching contents from one horoscope website. All others should work similarly.
Like I mentioned previously, each horoscope website is a class on its own. Contents are fetched from the website when an object of this class is instantiated. As long as the application is running, only one object of each class can be instantiated. Objects are instantiated on-demand i.e, when the user actually tries to list or read the contents of the directory corresponding to the horoscope website.
The name of the directory corresponding to the horoscope website is same as the name of the class corresponding to that horoscope website.
class Req(object):
"""Get HTML page using requests and parse it using BeautifulSoup"""
def __init__(self):
super().__init__()
def _get(self, url, timeout=30):
try:
response = requests.get(url, timeout=timeout)
response.raise_for_status()
return bs4.BeautifulSoup(response.text, "html.parser")
except:
return None
class Astrosage(Req):
"""Horoscopes from www.astrosage.com"""
def __init__(self, sunsign, moonsign):
super().__init__()
base_url = "http://www.astrosage.com/horoscope/"
self.horoscope = {}
for horoscope_type in horoscope_types:
url = "{}/{}-{}-horoscope.asp"
url = url.format(base_url, horoscope_type, moonsign)
self.horoscope[horoscope_type] = self._parse_html(url, horoscope_type)
def _parse_html(self, url, horoscope_type):
soup = self._get(url)
if soup:
if horoscope_type == "daily":
html_class_attr = "ui-large-content-box"
else:
html_class_attr = "ui-sign-content-box"
content = soup.find(class_=html_class_attr).text
content = textwrap.fill(content.strip()) + "\n"
return content.encode()
else:
return NA
__init__()
is called when we want to instantiate an object for a horoscope website.
This is when we have to fetch the contents (horoscope) from the horoscope website
and parse the HTML returned. This is exactly what the _get()
method inherited
from the Req
class does. It uses the requests
module to get HTML from the
horoscope website and then runs Beautiful Soup
to parse the returned HTML.
Finally, in _parse_html()
, we search for the specific HTML tag that contains the
horoscope contents we’re looking for and extract only that bit which is then
passed through the textwrap
module’s fill()
method to restrict the output
to 80 columns so that it is easier to read on the terminal. This is the content
that the read()
method of the HoroscopeFS
class will eventually get and cause
it to be displayed to the user.
That’s it !!!. We’ve now written a simple virtual filesystem in Python using FUSE. There are a lot of other filesystem functionalities that we didn’t need to use/override for our application, but, if you are ever in need of any of them for your own application, I’d recommend looking at the fusepy examples section.