Free space with freedup

A useful tool I came across recently is freedup. If you’re like me, you have potentially many copies of the same file in a number of locations on your disk (this is particularly true of me because I have version-controlled backups of my most important files). Whilst multiple copies of the same file make restoring from older copies easy, it also makes chewing through disk space easy. To solve this, it’s possible to link two identical files together so they share the same data on the disk. In essence, files on a disk are more like information on where to look for the data. That means it’s easy to get two files to point to the same location on the disk. This approach means you don’t need to have two copies of the same data, you just point the two files at the same data. These types of links are often called hard links.

Whilst it’s possible to find all duplicates manually and link the two files together (through the ln command), it’s tedious if you have hundreds or thousands of duplicates. That’s where freedup comes in handy. It can search through specified paths finding all duplicates, and hard link them together for you, telling you how much space you’re saving in the process. A typical command might look like:

freedup -c -d -t -0 -n ./

where -c counts the disk space saved by each link, -d forces modification times to be identical, -t and -0 disable the use of the external hash functions. Most importantly at this stage, -n forces freedup to perform a dummy run through. Piping the output into a new file or a pager (like less) means you can verify it’s looking in the right places for files and that it’s linking what you expected. Remove the -n and rerun the command and it’ll link those files identified in the dummy run. My experience was a several gigabyte space saving on my external disk — not something to be sniffed at.

Unison vs. Dropbox

A mature project by the name of Unison purports to synchronise directories between a number of computers. It’s cross-platform (at least Windows, Mac and Linux) so it seemed suitable for a test run. Given my adventures with Dropbox, then SpiderOak, this looked promising.

Unison differs in some important respects from both SpiderOak and Dropbox. Firstly, there’s no remote backup (a.k.a. storage in the “cloud”): you can synchronise directories across many machines but you have to have at least some form of access to them (usually SSH). Secondly, Unison doesn’t run as a daemon like SpiderOak and Dropbox do. Those two launch transfers based on input/output (I/O) activity (i.e. you’ve added or removed a file in the synced directories); Unison doesn’t do this on its own. Thirdly, Unison doesn’t do versioning, so you can’t view the changes to a file over a given time. In SpiderOak’s case, this versioning goes back forever whilst Dropbox does so only for the last thirty days. These limitations can be overcome through the use of additional tools (see here for more info), notably file monitoring tools.

Instead of this approach, however, I decided a more straightforward approach would suffice. I have essentially three machines on which I would like to synchronise a directory (and its contents). I decided that a star topology would work best, which is to say one of the machines would act as the master copy and the two other clients would connect to it to upload and download files. The advantage of this approach is that I need only run an SSH server on one machine; the clients need only have SSH clients on them. Since one of these machines is a laptop and the other is behind a corporate firewall, this made set up a lot easier.

The first thing to note is that in order for this to act as Dropbox-like as possible, key-based SSH logins are necessary. Once I’d successfully set up key-based logins for each client machine to the master, then setting up Unison was pretty straightforwards. Their documentation is actually pretty good, and I was able to set up the profiles on the Laptop and the Desktop machine with little hassle.

One point worth making is that Unison versions on the clients in the star network must be close to (or exactly the same as) that on the master in the network. Apparently there’s some differences in the way Unison checks for new files between versions. I’ve used versions 2.40.63 (Linux), 2.40.61 (Mac and Windows), and I haven’t received any error messages about conflicts. On the Windows machine, it was easiest to use the Cygwin version of Unison with Cygwin’s version of OpenSSH too. I didn’t have much luck with the GUI tool on any architecture. In fact, it was much easier to use the command-line and text files.

To set up a profile, Unison expected a .prf file in $HOME/.unison with the following format:

# Some comments, if you want
root = C:UsersslacksetUnison
root = ssh://user@remote-master.com:2222//home/slackset/Unison
sshargs = -C

As you can see, the syntax is pretty simple. Note the // after the remote-master.com which specifies the SSH port number. Omit the :2222 if using the default (22). This will synchronise the contents of C:UsersslacksetUnison on a Windows machine with a target of /home/slackset/Unison on the master. The process is the same on a Mac, but the .prf files live in $HOME/Library/Application Data/Unison. You can create as many profiles as you want, something more akin to functionality in SpiderOak, but missing in Dropbox (which can synchronise only one directory).

There’s a number of options for the command-line invocation of Unison to run this in a daemon-like manner:

/usr/bin/unison ProfileName -contactquietly -batch -repeat 180 -times -ui text -logfile /tmp/unison-“$(date +%Y-%m-%d).log”

The important flags are -batch and -repeat n which forces the syncronisation to occur without prompts and which will repeat every n seconds (in my case, 180, or three minutes). If you omit -logfile with some target, a unison.log will be left in your home directory (which is annoying). I put this in the crontab with the @reboot keyword (see man 5 crontab) on the Windows (through Cygwin) and Mac so every three minutes my files are synchronised. That’s not quite instantaneous, but if I’m feeling impatient, I created an alias to run that command without the -repeat 180:

alias syncme=’/usr/bin/unison ProfileName -contactquietly -batch -times -ui text -logfile /tmp/unison-“$(date +%Y-%m-%d).log”‘

It spits out a list of the files which will be updated (either uploaded or downloaded) to standard output. I could bind this to a keyboard shortcut (with AutoHotKey on Windows, for example) or as an icon somewhere, but since I have a terminal open all the time, it seems easier to just type syncme when I’m impatient.

So far, this is working pretty well over Dropbox, but I do miss the fallback of having versioning. I may eventually get around to setting up git on the master in the star network, which would give me good versioning, I think. Something for a rainy day, perhaps.

Dropbox vs. SpiderOak

There was a hoohah recently on the internet with regards Dropbox and its Terms of Service and the manner in which they encrypt your data on their servers. As a result, I heard talk of alternative systems for syncing files across a few systems. The prerequisites were it needed to be cross-platform (Linux, Windows and Mac), preferably at least the same or better encryption that Dropbox has and it needed to be relatively easy to use.

Top of the list of alternatives was SpiderOak, a service principally aimed at online backup (more often known as storing your data in the cloud these days) but which also provides functionality to sync backed up folders across several machines.

First problem came in the installation of SpiderOak on Slackware. Their website provides a package for Slackware 12.1. At the time, Slackware was 32 bit only, so there’s no 64 bit package available. There are 64 bit packages available for a number of other distros, including Debian, Fedora, CentOS, openSUSE and Ubuntu. I decided to avoid the Slackware package since I wasn’t sure it would play nicely on a more recent (13.37) version of Slackware. Instead, I set about writing a SlackBuild script to repackage one of the other distro’s packages. In the end, I settled on the Ubuntu .debs.

I modified the autotools SlackBuilds.org template to unpack the .deb files instead of the more usual tarballs. Running the SlackBuild produced a package, but after I installed it I received the following error message:

Traceback (most recent call last):
File “<string>”, line 5, in <module>
zipimport.ZipImportError: not a Zip file: ‘/usr/lib/SpiderOak/library.zip’

I found the relevant file in /usr/lib and the file utility confirmed it was not a zip file. After much head scratching, it turned out that the strip lines in the standard SlackBuild significantly alter a number of files in the SpiderOak package. After removing those lines, the package built, installed and ran as expected. edit: If you’re looking for a copy of the SpiderOak SlackBuild, I’ve uploaded it here, with the doinst.sh, spideroak.info and slack-desc too.

The next step was migrating all my data from Dropbox to SpiderOak. SpiderOak allows you to define a number of folders to back up, unlike Dropbox where only a single folder can be synced. I created a new diretory (~/SpiderOak seemed fitting) and copied the contents of my Dropbox folder over to the new SpiderOak folder. I added this folder as a folder to back up in SpiderOak and let it upload the files.

I changed all the relevant scripts which used to dump data into my Dropbox folder to do so into the new SpiderOak folder instead. After setting up similar folders on my desktop (Windows) and my Mac (laptop), I was able to create a sync whereby all three versions would be synced with one another, effectively emulating the Dropbox functionality.

After a few days of dumping data into the folder, all seemed well. A few days after that, however, I started to get worried that my 2GBs of free storage on the SpiderOak servers was filling up rapidly. After some investigation, it became apparent that the manner in which SpiderOak works is slightly, though in my case, crucially different, and it relates to the way in which overwritten or deleted files are handled.

Dropbox stores old versions of files for up to 30 days after they have been overwritten or deleted. This allowed my frequent writes to not fill up all my space. Conversely, SpiderOak do not ever discard overwritten or deleted files. This is a nice feature if you accidentally delete an important file. However, for my purposes, it merely served to swallow my free 2GBs of storage in pretty short order.

It is possible to empty the deleted items for a given profile (i.e. each machine currently being backed up to the SpiderOak servers). Irritatingly, however, it is not possible to empty the deleted items for one of the machines from a different machine; you can view the deleted files and how much space they’re taking, but you can only delete them from the machine from which they were deleted. Since most of my files are created and deleted on my headless server, using VNC to open the SpiderOak GUI and emptying the files every couple of days quickly lost its appeal. I searched through the SpiderOak FAQs and documentation (such that I could find), and only found one option to do this automatically from a machine on the command line. The command is SpiderOak –empty-garbage-bin, though it is accompanied by this dire warning: 

Dangerous/Support Commands: 

Caution: Do not use these commands unless advised by SpiderOak support. They can damage your installation if used improperly.

So, for my purposes, the Dropbox approach of removing anything you’ve deleted or modified after 30 days is much more usable. Since I don’t keep anything particularly sensitive nor critical on there, the hoohah about the privacy and encryption aren’t much of a concern to me. After a month of SpiderOak, I was fed-up of the requirement for me to delete all the deleted files over VNC, and so I moved everything back to Dropbox. I suppose the lesson there is “if it ain’t broke, don’t fix it”.

Vodafone Mobile IP ranges

I recently added a firewall to my server such that even though the router has one, if there’s some security hole in its firewall, the server inside the network is still protected by its own firewall. This server-side firewall is more restrictive than the router one, with a DROP policy set by default, and I have to punch holes in it in order to get access.

One of the things I like being able to do is access my files from my phone. Unsurprisingly, you get a dynamic IP address with mobile broadband. If I could find what the range of IP addresses Vodafone assign to their mobile broadband customers, I could allow some part of that range through the firewall. Obviously this only allows those IPs to try to connect, they still need to have the correct credentials in order to get through.

After a bit of searching, I came across this thread on the Vodafone forums. After some initial reticence on the Vodafone side, they eventually listed the IP ranges they used for their mobile broadband. To save some searching, this is the appropriate set of ranges:

212.183.140.48/28
212.183.140.102/31
212.183.140.16/28
212.183.140.98/31
212.183.140.32/28
212.183.140.100/31
212.183.140.0/28
212.183.140.96/31 

An initial connection from my phone indicates this range is valid (for the time being). This solves the problem of me being able to access my server from unknown networks as I can simply tether my phone to my laptop, and know that I’ll be able to get in to the server, from which point I can make temporary changes to the firewall to allow access to that specific new IP address.

I would, eventually, like to add port knocking to the server such that even though there are open ports on the router and firewall, given a simple scan, the ports should appear closed.

Data recovery

A Windows laptop bluescreened midway through a transfer of data from the internal disk to a 500GB (Michael Jackson) external disk. Windows refused to acknowledge the existence of the FAT32 partition, saying the disk needed to be formatted; my Mac fared no better, claiming I needed to initialise the disk.

This was a backup disk (and in fact, the computer was backing up when it bluescreened), but nothing on there was irreplaceable, so I decided to have a bit of a play with some data recovery tools.

The first thing I needed to go was get a disk image so that I could fiddle around to my heart’s content without worrying about damaging the disk. The disk cloning utility dd took care of that for me:

dd if=/dev/sdc of=./michael.img

I cloned the entire device (/dev/sdc vs. /dev/sdc1, for example) since the partition table appeared to be corrupted. I didn’t set any special options and since I was in no particular hurry, I let it do its thing overnight. Once I had a disk image, I tried testdisk to see if it could rebuilt the partition table, or at least let me copy the contents of the partition to somewhere else.

testdisk ./michael.img

For a more comprehensive look at testdisk’s functionality, check their wiki. In essence, I used the Advanced section (Filesystem Utils) to do a boot sector recovery, from which I could access the filesystem contents and select files I wanted to copy to a directory, ready for copying back over on to the external disk.

Although this is by no means an in depth look into testdisk, its tools are impressive. It can rebuild partition tables to allow corrupt disks to boot again. The sister program, photorec is aimed more at recovering images and other media based on the signature those types of files have in a filesystem.

For my purposes, however, the recovery was pretty straightforward and the data have been successfully recovered. The last job is to format the disk with a fresh (probably NTFS) filesystem, and then copy the recovered data back.

Virgin DNS

Virgin Broadband intercept invalid DNS requests and supply their own results. Whilst this is convenient to some extent, it does mean they’re interfering with the way things were designed to work. Fortunately, however, it’s easy enough to turn off.

Virgin provide a service to turn off this DNS interception here. A word of caution, you can only apply this change from your home connection.

Virgin Broadband Update

Following on from my last post about Virgin Broadband and their traffic management, I came to a realisation. I phoned them up and asked about the current usage for our account over the last few weeks or so. They gave me this data:

Daily Downstream Total:
Period Total (GB)
26/03/2011 0.432 GB
27/03/2011 0.362 GB
28/03/2011 0.748 GB

17/04/2011 0.591 GB
18/04/2011 0.863 GB
20/04/2011 9.496 GB
21/04/2011 9.912 GB

What is apparent is that around about the 18th of April, our usage dramatically increased. I couldn’t account for this usage in general browsing terms, especially as we were away for a few days over that period. Thinking back, I looked over the iperf commands I was using to measure the bandwidth. On the client side, I was running this command every ten minutes:

iperf -c www.example.com -P 4 -f k -w 256k -t 60

This was using a TCP window of 256k, a transmit window of 60 seconds and four threads in parallel (-P 4). The net result of this is that I was sending 70MB of data from the client to the server six times an hour. Multiply this by 24, and you get a number very close to 10GB (which is what Virgin indicated we were downloading). So, after a bit of reading, I changed the client iperf options:

iperf -c www.example.com -P 1 -f k -t 5 -x CMSV

Now, I only have one thread running for five seconds with a default TCP window length and a single thread. The addition of -x CMSV reduces the output from iperf and means I don’t have to grep SUM to get the relevant line. Furthermore, I think a 10 minute sampling frequency is a little on the high side, so I changed the cronjob to only run the iperf script every half an hour, which should still give the granularity I wanted.

Of note is that with a smaller TCP window and a shorter transmit time, the results are slightly more erratic. They are, however, still showing the same pattern as before.

So, it seems Virgin were correctly measuring our downloads and managing our traffic accordingly. What caused the initial slow-down that prompted me to measure the bandwidth remains a mystery.

Bandwidth with Virgin Cable Broadband

We recently switched from o2’s ADSL to Virgin’s cable broadband. I knew that Virgin did traffic management, but I hadn’t really paid it much attention. However, after one evening of particularly slow (think 56k modem-slow) internet, I decided to look into it a bit more.

Stumbling my way through the internet, I came across a tool called iperf. Given two Linux computers, it would tell you the network speed between those two computers. I installed iperf on my Slackware server from SlackBuilds.org and compiled a copy at work on a Red Hat 5 machine. I know that the connection at work is much much faster than my home one, so I was confident I’d be measuring the bandwidth as limited by my home connection rather than by work’s connection.

On the server, I launched the following command:

iperf -s -f k

By default, iperf runs on port 5001, so I opened the appropriate port on the Virgin Media Hub (a NETGEAR VMDG280 Wireless ‘N’ Cable Gateway modem and wireless router combo).

This starts the server (-s) and outputs results in kilobits per second (-f k). On the client at work, I wrote a little bash script (beware line-wrapping):

#!/bin/bash
OUT=$(iperf -c www.example.com -P 4 -f k -w 256k -t 60 | grep SUM)

DATE=$(date +%Y%m%dT%H%M%S)

printf “$DATE ” >> $HOME/logs/bw.log

echo “$OUT” >> $HOME/logs/bw.log

(I’ve replaced my home IP with www.example.com, obviously).This little script saves the output of iperf in $OUT. The switches do the following:
-c – client mode
-P 4 – runs four parallel client threads
-f k – outputs in kilobits per second
-w 256k – sets the TCP window size to 256k
-t 60 – sets a sixty second transmit window
The line of interest for the iperf output is the SUM line, so I grep for that to save the results in $OUT. The rest of the script just saves the result with a timestamp to a log file.
I run this in cron every ten minutes on the client:

*/10 * * * * $HOME/scripts/bw.sh

To graph the results, I use the Generic Mapping Tools (GMT) package (SlackBuild available here) to plot the results. Figure 1 shows the analysis for the last two days or so:
bw
Figure 1: Bandwidth at home for a 40 hour period. Visible is the 75% bandwidth reduction from 1900 to 0000 on the 19th April.
Since I can now see exactly what the bandwidth at home is every 10 minutes, I decided to try and find out how exactly Virgin were limiting the speed. Some deft searching revealed this page, which outlines their traffic management policy. N.B. this is for areas which have received the upstream upgrade, for those that haven’t, this is the appropriate page.
Although the table is a bit opaque, it can be summarised as follows for the Large bundle:
  • Between 1000 and 1500, downloads must not exceed 3GB (600MB/hr)
  • Between 1600 and 2100, downloads must not exceed 1.5GB (300MB/hr)
  • Between 1500 and 2000, uploads must not exceed 1.5GB (300MB/hr)
The rest of the time, any amount of traffic is permitted, and no traffic management will be imposed. Traffic management is a 75% reduction in connection speed (from 10 megabits per second to 2.5 megabits per second in our case) for 5 hours. Interestingly, from 1500 every day, you get one hour to download as much as you like, and it won’t count towards your total for traffic management.

Comparing the bandwidth in Figure 1 and this traffic management information, we were traffic managed on the 19th April between 1900 and midnight. However, I have to say, you get what you pay for the rest of the time: Virgin’s Large bundle (10 megabit line) really is very close to 10 megabits per second most of the time (generally around 9.8 megbits per second).

I do have a script running which backs up my server to my computer at work (when it’s on), but it does incremental backups, and those are scheduled to run at 0440 i.e. in the middle of the night, when there’s no limit on traffic.

Between 1500 and 2100 I don’t consider us to be a particularly vociferous household when it comes to consumption of material online, yet we still fall into the “very small proportion of customers who are downloading and/or uploading an unusually high amount“. There’s only two of us, and we don’t watch iPlayer very often. We might watch a video every now and then online, but we’re certainly not downloading 600MB of content every hour for five hours straight.

Temperature Monitoring with TEMPer

I got a USB thermometer from eBay called TEMPer. I found some code which creates a driver and reporting app for which, with some patching, I used in a little cronjob. The code can be found at SlackBuilds.org, and seems to work OK on Slackware 13.1. It does, however, output temperatures which are generally 9.3 Celsius higher than the actual temperature, so the figure below has erroneously high temperatures.

Below is a plot of last month’s temperature data from the TEMPer.

octave_analysis
Raw (blue) and filtered (green) temperature (Celcius)

I wrote a little Octave/MATLAB script which takes the raw data from the cronjob output and does a frequency analysis with the FFT tools in Octave/MATLAB. I use the frequency domain to lowpass filter the raw data (the blue line) with a cutoff at 1/7 per day to smooth the data (the green line). Comparison of this filtered line with some regional data from a meteorological station shows pretty good agreement:

monthtodate
Meteorological data from http://www.southamptonweather.co.uk/wxgraphs.php for the same time period as the graph above.
Meteorological data from http://www.southamptonweather.co.uk/wxgraphs.php for the same time period as the graph above.

The line of interest is the green line in the lowest panel. The duration of the two plots is the same, but the y-axes differ in scaling.

There are a number of interesting trends in the two data sets. Both graphs decrease for the first 5 or 6 days followed by a relatively slow increase in temperature which peaks at ~12 days before present (the 15th). From the 15th onwards, temperatures decrease until the 17-18th, where the temperature begins to rise again, until the 22nd, where the graphs diverge. The meteorological data continue on their upward trend, however the TEMPer data suddenly drop 2-3 Celcius.

The reason for this divergence is we turned our heating off around the 21st. Following this static shift of a few degrees, the two graphs track the same trend, but the TEMPer data is now lower.

Finally, the huge temperature spike in the TEMPer data for the last two days or so is from friends visiting and the heating being put back on for two late nights.

I’ve got a longer time series from the internal sensors on the PC, but hopefully this USB thermometer will give me less CPU-load-dependent temperatures.

PPTPd installation and configuration

Setting up a PPTP server (aka VPN in Microsoft Windows operating systems) on Slackware 13.1 with the aid of SlackBuilds.org (SBo) and sbopkg. Most of this is lifted from here, which was the most recent set of instructions I could find. Everything else dated from a few years ago, and that makes those documents about as useful as a chocolate teapot.

Install pptpd from SBo. Use sbopkg if you like, otherwise follow the instructions here.

Once that’s complete, edit /etc/ppp/chap-secrets with your favourite editor. I like vim, so:

vim /etc/ppp/chap-secrets

Add a new username and password to log in:

someusername pptpd somestrongpassword *

Replace someusername and somestrongpassword with the username and password you wish to use to connect to your VPN.

Now we need to tell pptpd how to handle the new connections’ IP addresses on the local network. Edit /etc/pptpd.conf with your favourite editor:

vim /etc/pptpd.conf

In /etc/pptpd.conf, add the following lines to give the remote machine an IP on the local network in the 192.168.111.0/24 subnet:

localip 192.168.111.1
remoteip 192.168.111.234-238,192.168.111.245

Moving on, edit /etc/ppp/options.pptpd

vim /etc/ppp/options.pptpd

In that file, replace ms-dns 192.168.1.1 and ms-dns 192.168.1.2
with Google’s DNS servers:

ms-dns 8.8.8.8 
ms-dns 8.8.4.4

The final step is opening up port 1723 on your router and setting up dynamic dns to provide a more easily remembered address to connect to from your remote host.

When all that’s done, launch pptpd as root and try connecting to your new PPTP/VPN server.

I tested this from a different machine on a different network and was able to browse just fine through my PPTP server. Browsing to www.whatismyip.com gave me my PPTP server IP address, so it worked just fine. What I need now is more bandwidth at home!