Usually a Debian software repository is around 50 GB, it contains lots of software, documentation and debug packages, unless you are providing services to a wide range of clients then is likely you only need a fraction of those packages.
Most people use apt-mirror to download their repos but althought is pretty fast due to the multithread support (actually this only matters if you have a really good connection) I haven't found a way to filter packages, so we'll stick with debmirror for now which has amazing results in this area.
The debmirror script has two main options for filtering the repo, exclude-deb-section and exclude, the former filters by package section and the later by url, both options use regular expressions which can be tricky sometimes but we'll keep it simple.
With a script like this you keep pruning the repository until the desired size is reached, thanks to this technique all my repos are smaller than 15GB nowadays.
debmirror --method=http --config-file=/opt/get-wheezy/wheezy.conf \ --nosource /ftp/repos/debian/wheezy/debian/ --ignore-release-gpg \ --nosource --no-check-gpg --postcleanup --allow-dist-rename \ --root=debian --rsync-extra=none -d wheezy \ --exclude-deb-section='(games|debug|news|gnustep|ocaml|hamradio|gnu-r)' \ --exclude='(/i18n/Translation-.*\.bz2)' \ --exclude='(kfreebsd-.*)' \ --include='(/python[0-9.]*-doc)' \ --include='(/.*coinor.*)' \ --exclude='(.*java.*doc.*)' \ --exclude='(.*debian.*reference.*)' \
To save you some du and find hacking, I've set up a github repo with all my working filters, take into account I'm running Debian Stable (or testing sometimes) with KDE and my main programming language is Python, however I do not filter necesary packages to compile other software.
If you find any new package to filter please make a pull request and I'll take it into consideration.
Warning: This config works ok for me, you may have to unfilter some packages according to your own needs.