Last-Modified-Date: 2004-11-05 "Scripting Language" My Arse: Using Python for Voice over IP ============================================================ Anthony Baxter, Abstract -------- A common complaint made of Python is that it is not suitable for serious application development, and is only suitable for "scripting" or "prototyping" tasks. The Shtoom toolkit (http://shtoom.divmod.org) is a Voice over IP (VoIP) toolkit implemented in Python using the Twisted framework. It includes 'shtoom' itself, a software phone using the toolkit, as well as code for creating voice applications, known as 'Doug'. This paper covers the basics of SIP and RTP (the protocols underlying Voice over IP), examines some of the issues relating to the implementation of Shtoom (with a digression on issues relating to timing), and will hopefully help demonstrate why implementing applications in Python is perfectly feasible. Introduction ------------ Why would I choose Python for VoIP? Python's a high-level language, with many constructs that make it extremely pleasant to work with. In addition, the Twisted framework provides an efficient and elegant model for implementing network protocols. In implementing the software phone, a nice-to-have was that the phone would work in a cross-platform way - I am not aware of any existing cross-platform software phones. Why would I not choose Python for VoIP? The primary reason would seem to be performance - VoIP is a complex beast, with requirements for throwing around packets of audio at some speed. It would seem from a first look that an interpreted language like Python would not be suitable for this task. Why Shtoom? ----------- There were a number of reasons for starting a voice over IP project. I wanted to investigate new approaches to writing voice applications. In addition, I had a need for a VoIP client that could be scripted for automated testing of our cisco gateways running a number of large, complex IVR scripts (an IVR is one of those automated phone systems you interact with via the phone, pushing phone buttons to respond to menus). I was looking for a replacement for the current conference calling application we use (stupidmcu, derived from OpenH323's openmcu). I also wanted to investigate writing voice applications on a platform that was less limited than cisco's embedded Tcl engine. We had been using Openh323 [openh323] internally for a number of applications, so my first approach was to examine it's suitability for wrapping in Python. It's a large complex C++ library and, as seems mandatory for any large C++ library, it implemented it's own basic types. I started down the path of using Boost.Python to wrap this library, but abandoned this after a few days work and pain. Just wrapping the basic types it needed would have been a couple of weeks tedious work. As a programmer who prefers to code in Python, this struck me as a very very boring approach. In addition, the openh323 libraries were (in my experience) extremely awkward to debug -- this is largely because the underlying H.323 protocol is itself a nightmare. I'll come back to H.323 in a bit. I then investigated using SIP (the Session Initiation Protocol) instead of H.323. SIP is the Internet's answer to H.323 (much more on SIP in later sections). There was a partial implementation of the SIP protocol as part of Twisted (enough to implement a SIP Registration server), so this was a good base to begin with. I'd already had experience with implementing RTP (Real-Time Protocol), the UDP-based protocol that provides the underlying transport of audio over the Internet portion, in C code so felt I was up to the task. Why Python? ----------- There's a few obvious reasons for choosing Python [python] for Shtoom: It's easy to work with, and to debug. For implementing a network protocol from scratch, Python is hard to beat. When you add in the Twisted framework, the choice was pretty obvious for me. It's cross platform - while my initial requirements were for something that would work on Linux and Solaris, having it work on other platforms would be a nice-to-have. There's a variety of UI toolkits available from Python, as well as two (Tkinter and wx) that are cross platform. And finally, of course, Python is fun to code. Why not Python? --------------- The first concern I had was whether Python would be fast enough to handle VoIP. VoIP traffic consists of a lot of little packets flying back and forth with some fairly harsh timing requirements, and with certain applications (such as conferencing) you need to do software mixing of multiple audio samples down to a single sample. The next concern about Python was the interfaces to the audio hardware, in particular, capturing audio. We'll cover this more, later. There's no single user interface for Python. I regard this as something of a positive - Shtoom has a pluggable user interface layer. Currently the code has Qt, GNOME, Tk, Cocoa, wx and command line user interfaces. An MFC (Windows) interface may be added at some point in the future. The quite rigourous timing requirements of the RTP protocol - you need to send a packet of audio every 20ms, and very little delay is acceptable - was the major concern I had about Python's suitability for this task. We'll come back to this, later. Why Twisted? ------------ A big advantage in using Python is the Twisted framework [twisted]. Twisted is an open-source Python framework for writing network applications, using an asynchronous event model. I'd previously used Twisted in another project [pydirector] and was impressed with the stability, flexibility and performance of the core library. Twisted also includes a whole pile of useful code that was already written for me - this meant I could concentrate on the interesting bits of the problem, rather than re-inventing every single wheel. Voice over IP - A Short and Biased Summary ------------------------------------------ Voice over IP (VoIP) refers to the carriage of telephone calls over the Internet, rather than the traditional public switched telephone network (PSTN) -- the copper wires and fibres that connect every house together. VoIP is used heavily by carriers (telephone companies) for their internal networks, and is gaining increasing popularity as high-speed Internet links to the home become more common. As well as being considerably cheaper than traditional phone calls (effectively free, assuming your Internet link is already paid for), VoIP allows for more sophisticated telephone services, such as video, multi-party communications (conferencing), and, well, pretty much anything you can think of. This is one of the most exciting aspects of the Internet absorbing the telephone network - it takes control of the network off the existing carriers, and allows for a wide variety of people do come up with new and interesting services. Once your phone call is being routed over the Internet, it can, in theory go anywhere. Well, anywhere that's on the Internet. This of course probably doesn't include your mother, or the friend who's walking down the street with a mobile phone. To get around this problem, many people provide gateways to the PSTN from VoIP. Most of these gateways are commercial, but they are usually much cheaper than the phone call over a landline would be. The standardisation of the VoIP protocols also mean you have a large variety of companies who can accept your business. There are also a number of providers of free PSTN gateways, such as Free World Dialup. So how do you use this wonderful VoIP thingy? Well, obviously you're going to need an Internet link. And then you need a device that allows you to enter a phone number or net address, connects you to the other end, and then transmits the audio over the Internet. In the non-VoIP world, we call this a "telephone". In the VoIP world, telephones can be broken into two basic categories. The first is the hardware phone. These have gone from being an expensive toy requiring extensive infrastructure and used only by large corporates only a couple of years ago, to a much more affordable consumer item that you can pick up for around US$100-US$200 today. SIPphone [sipphone], started up by MP3.com's Michael Robertson, sells an adapter that has a phone jack one side and an Ethernet port on the other side. You simply plug an existing handset into one side of the adapter, an Ethernet cable into the other side, and you're ready to go. Other carriers, such as Vonage, also provide these interfaces. Vonage [vonage] have a similar device, targetted at the home consumer, and seem to be doing very nicely with these devices. There are also dedicated SIP phones - this looks something like a regular phone, but with an Ethernet port on the back. The second category are known as soft-phones, or, to use a term most people should be familiar with, a computer program. (Telephony types *love* their terminology). It uses the existing PC sound hardware (speakers, microphone) and communicates via an existing Internet connection. There are a few free softphones out there, as well as many commercial phones. I had a look through the existing phones before I started on the implementation of Shtoom, to figure out what I liked and disliked. At the moment, the most polished of the free phones that I looked at is XTen's X-Lite. This is a closed-source phone (for Windows and OSX), so was only useful to me for interoperability testing. In addition many chat programs, including Microsoft's Messenger and Apple's iChat, are in fact SIP clients - they use SIP under the hood for voice chats. VoIP: The Protocols ------------------- H.323 ~~~~~ Once upon a time, the only VoIP protocol was H.323 [h323]. This was a standard created by the ITU-T, the same organisation that gave us the runaway success of the X.500 directory service and X.400 email. H.323 has much in common with other ITU-T standards - it features a complex binary wire protocol, a nightmarish implementation, and a bulk that can be used to fell medium-to-large predatory animals. OpenH323, an open-source implementation of this protocol, consists of over 7 MB of C++ code (the UNIX utility 'wc' reports that it's over 2.4 million lines of code). This doesn't include the code to actually encode and decode the audio. I don't intend to cover H.323 in detail in this paper - there are many fine resources on the net for you to peruse if you wish to inflict this on yourself. It should be noted, though, that H.323 is only one of a suite of protocols - it depends on H.225, H.245 and a swarm of other protocols, clustered around H.323 like the world's ugliest remora fish. I'm unaware of anyone having implemented even a fraction of H.323 in Python. Doing so would require a special kind of dedication, and quite possibly a large amount of whiskey and prescription medication. SIP ~~~ SIP (the Session Initiation Protocol [sip]) is a creation of the IETF, the organisation that produces Internet standards. While it is a complex protocol, it features many advantages over H.323: - It uses text message bodies, in a format that should be familiar to anyone who's looked at the headers of an email message or a web request. - It is based on a variety of existing IETF protocols, including SDP and HTTP. - It wasn't designed by an organisation of telcos based in Switzerland. One common complaint of SIP regards its complexity. While it is on the large end of a typical Internet protocol (the base RFC, 3261, comes in at 269 pages, and there's dozens of related RFCs and Internet-Drafts), the problem it's solving is a complex one, and to supplant H.323 it needs to support ridiculous number of options. But again, it's only complex compared to other Internet protocols. Compared to ITU protocols, it's a work of austere elegance. An aside on Standards ~~~~~~~~~~~~~~~~~~~~~ Standards are good. Standards make a lot of pain go away, and make everything easier. This is particularly true for VoIP -- the whole point of VoIP is being able to talk to other people. This obviously gets somewhat tricky if the phones don't talk the same protocol. There's a number of non-standard approaches out there. The most visible is Skype [skype], a Windows softphone that uses a proprietary (and secret) protocol. Skype claim a whole pile of benefits to having their own protocol - it works better with a variety of firewalls, it's more efficient, blah blah blah. Unfortunately the trade off for this is that you can only talk to other Skype users - while there's now clients for most major operating systems, it's still a pretty limiting thing. Skype's unlikely to have the variety of hardware phones and phone adapters that you can get for SIP, and certainly there's never going to be the same range of vendors and implementations. In addition, when you want to call from your Skype client to the existing PSTN network, your choice of gateway is Skype, Skype, or Skype. Another non-standard is Asterisk's IAX. IAX is designed to be more network-friendly and network-efficient than SIP. While this is an open protocol, it was only relatively recently documented, and as yet it's not standardised, aside from "the implementation in Asterisk". Nonetheless, I do plan to investigate an implementation of IAX in the future, mostly for my own amusement. One more point on standards -- there seems to be a push by a number of vendors to "destandardise" SIP, through undocumented and unstandardised extensions to SIP. Vendor-specific extensions to SIP run the very real risk of locking their customers into one particular vendor's systems, and have the potential to cause immense damage to the takeup of VoIP. If, in your dealings with a vendor, they start pushing the advantages of "their" extensions to SIP, run away. A Peek under SIP's hood ----------------------- The two main divisions of work in implementing a VoIP application are the implementation of SIP, which controls the call negotiation and setup, and the implementation of the underlying protocol that passes the audio back and forth. The latter uses a protocol known as RTP, the Real Time Protocol [rtp]. This is a quite venerable Internet protocol, initially developed for use in Multicast applications. RTP consists of small packets of audio, transmitted as UDP. A typical packet size is just 20ms of audio. There is a companion protocol, RTCP (Real Time Control Protocol) that is used to communicate information such as delivery reports. The audio can be in a number of different formats - the format negotiation is explicitly *not* part of RTP, but is left to a higher level protocol, such as SIP. One interesting aspect of implementing SIP is that every SIP implementation is both a client and a server. Either end of a SIP conversation can initiate a request or reply to a request. This is quite different to HTTP, which SIP superficially resembles. The protocol itself is also quite stateful - in the implementation there's a number of state machines for handling the various states of a call. Shtoom Details -------------- Shtoom Architecture ~~~~~~~~~~~~~~~~~~~ Ooo. ASCII art:: +-----------+ | UI | +-----------+ | +-------+ +------------- /| SIP | | |/ +-------+ | application | | |\ +-------+ +-------------+ \| RTP | | +-------+ +-----------+ | audio | +-----------+ The application is the core element of a Shtoom application. It controls the flow of calls, handles the (high level) incoming events, and deals with the flow of data between the other components (for instance, between the audio layer and the RTP layer). The audio layer is an abstraction on top of the audio hardware and any audio codecs that might be present. The application calls into the audio layer to query and select audio formats, and to deliver and retrieve audio. The UI layer is only present on those applications that require a user interface (currently only the phone). The application passes requests to the UI (for instance, when an incoming call arrives) and the UI calls into the application when the user requests something (for instance, when the user enters an address and hits 'call'). Some UI toolkits (such as wx, or MFC) don't play well with other event loops - for these, we run the UI layer in it's own thread (with the network and audio side all in one other thread). The SIP layer is an implementation of SIP. It listens for requests and responses and passes higher level requests to the application. At the moment Shtoom's SIP implementation is not complete - I'm adding to it as I hit a requirement for a new feature. An RTP layer is created for each incoming or outgoing call. It merely passes the audio to and from the network. Each RTP layer is responsible for its own timer loop. In the future, it would be possible for an RTP layer to be instantiated on a different machine, to allow load spreading. Multiple User Interfaces ~~~~~~~~~~~~~~~~~~~~~~~~ One nice thing about Python is the wide variety of user interfaces available, and the ease of working with them. I don't think any application implemented in a lower-level language would attempt to ship with 4, 5 or 6 user interfaces. In Python, though, this is really quite easy. In addition, I've made efforts in Shtoom to produce a higher-level API to reduce my workload. One thing that's reduced my workload significantly is the Preferences interface. Trying to maintain various preferences dialogs and keep them in sync for the different platforms struck me as a very boring task, so instead I developed code that described the preferences available in an application, and then the user interface layer inspects the options object to build the preferences UI. This allows me to tweak the preferences without having to rebuild the dialogs in each UI. There's additional code that works from the same options object to build a command-line parser (using optparse) and to load and save from Config.ini-style settings files. This is probably useful enough that I'll look at releasing this independently of Shtoom. Another reason for Shtoom's multiple user interfaces (aside from indecision on my part) was a desire to have a nice example of the different user interface toolkits and how they interface with Python. Hopefully this will be useful in the future - both for people trying to choose between UI toolkits and for people wondering about converting from one toolkit to another. I'm not aware of any projects that provide the same UI using a number of different toolkits. However, I'm not silly enough to offer an opinion as to which one I consider "the best" - no matter which I choose, someone will disagree violently, and attempt to engage me in a long and tedious discussion about the merits of their toolkit of choice. I really don't care enough to put myself through this. My only comment would be that while Tk is very simple and easy to code, it's... very simple. A lot of things you take for granted in a more modern toolkit require additional packages on top of Tk. Other Shtoom applications ------------------------- While the most visible part of shtoom is the phone application, there are a number of other applications in the package. Initially these were written as standalone applications - over time, they've been rewritten to use the Doug framework for applications -- more on Doug, soon. The first two are a simple announcements server (available by placing a call to 'sip:testcall@divmod.com') and a basic voicemail server. The latter plays a per-user announcement, then records the audio from the person calling. When the person hangs up the call it then emails the audio to the user. There's also a simple echo server - it simply replays the audio sent to it back to the caller. This is extremely useful for debugging. A recent addition to Doug is support for conferencing. Multiple people call into the conferencing server and can talk to each other. This is less complex than it sounds - you simply keep track of all participants in a conference, and when a bit of audio comes in, you pass it to the other users. The tricky bit is mixing audio - when multiple people are talking, you need to make sure that you do the right thing and mix the audio samples together. I'll come back to this a bit later in the paper in a discussion on performance. It's very early days for this code, and it's quite rough around the edges. A bit further down the track, the conferencing will also be exposed in the phone program - this will allow users to connect together multiple calls into a single multi party call. Voice Applications ------------------ I've thrown the term 'voice applications' around a bit so far, so it seems only fair to describe what I mean. First off, voice applications are not the same thing as speech recognition, although a voice application might *use* speech recognition as a way of getting input from the user. I tend to think of two categories for voice applications - the first is the handling of calls, whether it be switching them to the correct person, conferencing multiple calls into a single call, or passing off to a voicemail system if the user hasn't replied. The second category are the more interactive systems, typically referred to as an IVR. These are the phone menus that everyone is no doubt familiar with. An IVR is an interesting exercise - you're extremely limited in your user interface (12 buttons, audio prompts), and designing a good IVR is more art than science. I've done a far too much work with IVRs, and will no doubt do more in the future. This brought on the next major part of Shtoom - Doug, the voice application server. Doug: The Shtoom Application Server ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Doug is designed for writing voice applications. It's been influenced by a lot of experience with the cisco Tcl engine (embedded in their calling gateways such as the AS5300 and AS5400). A significant part of the design comes from my frustrations at dealing with the very limited possibilities of the cisco engine. Doug's an event-driven application server for writing voice applications. So far it supports prompts and menus, collecting input both from the out-of-band DTMF (touch-tone) signals, and detecting the in band tones from phones. There's support for conferencing calls into a single conference and bridging incoming and outgoing calls together. It can both receive and make calls (the latter is used in the automated testing of the cisco gateways at my day job). In addition, because it's written in a full programming language, on a standard computer, you have access to pretty much any network protocols you might want (as a counter-example, the only real queries supported in the cisco engine are via RADIUS). There's a pile of other interesting protocols supported, such as RADIUS and tftp (so I can re-use our existing infrastructure, built for the ciscos). The goal of Doug is to make it easier for people to write voice applications. I don't pretend to know what the next great voice applications will be - but I'd like to make it easy for people to write them. Doug is still under development, but is useful enough already. Implementation of VoIP ---------------------- The next section discusses implementation details of shtoom, showing some of the issues I confronted, and the approaches I took to solving them. Timing and Buffering -------------------- There are a issues you encounter as part of implementing RTP (the lowlevel protocol used to transmit the audio data). The first is the simple trade-off of buffering vs latency. Simply put, the more you buffer before playing, the more robust you are in the face of a glitchy network that delays individual packets of audio, but the more delay there is in playing the audio. Initially, shtoom took a brute force approach and did no buffering at all - as soon as a packet arrived, it was sent to the audio device. And every 20ms, a lump of audio was read from the audio device and sent to the network. This gave adequate performance on a local network, but when it was exposed to the vagaries of the Internet, it turned out to be, well, awful. The audio from the network would gradually slip further and further behind, leading to incredible frustration for the users. There's now a couple of simple playout buffer algorithms implemented - one that uses a fixed buffer size of 20-40ms, and a more sophisticated one that dynamically adjusts it's buffer size. The next issue is that RTP requires a reliable source of audio - you need to send the audio every 20ms. The real problem with this is that most modern computers have a timer clock that runs at only 100Hz. This means that the resolution of the timer is just 10ms. This has the unfortunate implication that if you miss the 20ms clock tick (even by a single millisecond) you get a 10ms delay. This delay is quite obvious to the listener and can render an audio stream unusable, even if only one in 10 samples is delayed in this way. I initially assumed that this real-time requirement would make Python an unsuitable language for implementing RTP - indeed, in a previous (non open-source) project, I just assumed that this would be the case, and implemented the RTP component of the application in C. This time around, though, I tested my assumption first. I was pleasantly surprised. Timing Strategies ~~~~~~~~~~~~~~~~~ The first and most obvious approach to this sort of timing is to use a timer signal - on UNIX, for instance, there is a setitimer() call that allows you to specify a repeating timer loop, implemented via signals. This has a few problems - it's non-portable, it relies on signals, and doesn't work if you have multiple timer loops in a single application. (Did I mention that it relies on signals?) Nonetheless, this was the first approach I took, to determine whether Python was able to package up a bundle of audio and send it out within the 20ms available. I was quite happy to find that this was in fact extremely easy. On my (admittedly overpowered) laptop this takes less than a third of a millisecond. So, having determined that this wasn't going to be a problem, I went back to the original problem of getting the timing right. The second approach is to schedule a call, and have the call reschedule itself immediately. Something like:: def nextpacket(self): reactor.callLater(0.020, self.nextpacket) # Send the current packet # read the next audio for the next packet The problem here is that if there is a delay in calling the nextpacket() routine for some reason, the next packet might miss the 20ms timer and instead hit a 30ms timer. You can do a hacky workaround for this by setting the timer to, say, 18ms, and hoping that any delay will fit inside this 2ms window of error. This is extremely ugly and rather brittle. The approach Shtoom now uses is to use a construct called LoopingCall, developed by JP Calderone. The guts of the LoopingCall are as follows:: def _loop(self): # Call the function, with the stored args and kwargs self.f(*self.a, **self.kw) # Now re-calculate the next timer delay self.count += 1 # What's the current time? fromNow = self.starttime - time.time() # When should the next timer be scheduled? fromStart = self.count * self.interval delay = fromNow + fromStart: if delay > 0: self.call = reactor.callLater(delay, self._loop) return The approach is that the LoopingCall calls the function, then determines when the next timer call is due. It then schedules a timer call for the delay needed. This approach has proven rock-solid in use, and remnants of the previous code that used setitimer() have been removed from the codebase. This removed the first major concern I had about implementing SIP in Python. The next, mixing together audio, seemed like a harder problem. Performance ----------- Many people in the computer industry are obsessed with performance over everything else. This obsession often misses a fundamental point - the question that should be asked for most applications is not "how fast is it?" but instead "is it fast enough?" For most of shtoom, Python is easily fast enough. To really put it to the test, though, I concentrated on one of the most CPU intensive components - the mixing of audio for conference calls. In a conference call, we have many audio sources contributing to the audio transmitted out. For each user, we want to find the "loudest" N audio sources (not including the user) and mix the audio samples together. A simple approach to take is to take each of the contributing audio sources, estimate their (power) volume (using a root-mean-squared of all samples), sort the samples by this power number, and then take the top N (for my example, N is 4). We then scale each audio signal by 1/N and add them together. Initially, I tried an implementation in straight Python. [listing mixPython] In this (and all following examples) the input is a list of 320 byte strings - these are each 160 16 bit signed sample values, representing 20ms of audio. The output should also be a 320 byte string, in the same format. We first take the RMS of each audio chunk and sort them by this value. We then take the top 4 samples, scale them down, then add them. On my test machine, feeding in 18 audio samples, selecting the top 4, and then mixing them together took around 2.2ms. This is purely in Python, and only minimal efforts to optimise this were taken (I'm sure people can point to obvious speedups). The second approach I tried was to use the Numarray [numarray] (formerly Numeric) Python extension. This is shown in [listing mixNumeric]. This turned out to be slightly slower than the pure-Python implementation (around 2.4ms). Examining this closer, it showed that while the scaling and adding were about 3 times faster, this was outweighed by the increased time taken in constructing the arrays. A discussion with the numarray folks after my pycon presentation showed that I could avoid this by re-using the array objects by assigning to them using slice notation. Psyco [psyco] was also tried - this produced only minimal speedups. I'm open to ideas as to why. Some brief discussions with people knowledgable in the inner workings of psyco suggest that if the code was reorganised into a form that pysco could better recognise, I'd get much better results. Psyco's still not a full python compiler, so I guess it's not entirely suprising to see this result. Next was to start implementing sections of the code in Pyrex [pyrex]. Pyrex is a dialect of Python that is translated directly into C code. It's by far the most pleasant way to write C extensions for Python. Profiling the Python code revealed that the most expensive part of the calculation was the RMS - it was responsible for around 65% of the time taken. Moving just the RMS operation to Pyrex reduced that component from around 1.4ms to just 0.35ms - taking the overall time to just over 1ms. Having done all this work, I noticed that the standard Python module 'audioop' had most of the functions I needed, implemented in C code. Using these reduced the time taken to around 120 microseconds (0.12ms). This is an impressive 20 times speedup, and as an added bonus, the code is considerably smaller and easier to understand. This isn't *quite* the end of the calculations, though - this only does mixing for a single output. We need to do this for each participant in the conference. We can re-use a lot of the calculations at each stage - we only need to calculate the power once for each sample, and all users that are not one of the N+1 loudest samples can re-use the same output sample. For comparision's sake, times were taken for the stupid approach (recalculating the scaling and adding for each user) and for the smart method which does the minimum work necessary. 1 channel 18 channels 18 channels (dumb) (smart) Python 2.2ms 8.7ms 2.7ms Numeric 2.4ms 5.0ms 2.7ms Pyrex 1.1ms 7.7ms 1.6ms audioop 0.12ms 0.80ms 0.18ms The "smart" approach also has the benefit that it scales up to a large number of participants very well. So, back to the original point. Is this "fast enough" - well, this is for an audio sample of 20ms duration. The above code shows that we can produce audio output in around 1% of the time limit we have to meet. In the real world this would be even better performance - particularly with VoIP clients that have silence suppression, where they don't send audio if the user isn't talking. Even taking the pure-python implementation of the stupid mixing approach, we're "fast enough" (but there's not much spare CPU time). With a small amount of optimisation we can produce code that's easily fast enough. All of the above methods were produced in about 3 hours of work. At least 45 minutes of that was reading Numarray documentation, as it'd been a long time since I'd used it for anything. It should also be noted that for most cases, even the stupid approach using straight Python is probably "fast enough" for a case with just a handful of users. If the user count is 4 or below, we can skip the entire RMS calculation and mix in all the user audio. These results suggest that a whole host of other audio manipulation tasks (such as silence detection) should also be quite possible in Python. DTMF/Touch-tones ---------------- When dealing with a telephone user, the only user interface available is hitting keys on the keypad of the phone. Doug needs to be able to detect these and pass them onto the application. The standard way to carry DTMF in RTP is as a separate media type, alongside the audio (marked with a different payload type). Carrying them in this way means that you don't need to do relatively expensive signal processing all through the network, just at the edge connected to the PSTN. These edge gateways detect the tone coming in from the phone, and generate the "Button 3 start", "Button 3 stopped" packets. An interesting wrinkle is that because RTP is based on UDP, it's necessary to send multiple start/stop packets, to make sure one gets through. But beware - when dealing with cisco's implementation of RTP, a duplicate stop packet generates a new event. Unfortunately, not all implementations provide these out-of-band DTMF signals - sometimes they're in band as audio tones. There's a fairly simple, but robust, DTMF detection module included in shtoom. That uses an FFT of the incoming audio to detect the DTMF (the FFT is implemented using numarray). All DTMF signals are sent as a combination of 2 frequencies. The ugly problem here is that when using a heavily compressed voice codec, the DTMF may not survive the conversion in a recognisable form. Voice codecs are designed to carry voice, and there's a lot of leeway for loss of audio quality. The human ear is very good at working with partially mangled audio, but detecting DTMF from this damaged audio stream is much, much harder. In addition, most codecs are optimised for speech, and do fairly horrible things to data signals such as DTMF. One final digression - the shtoom phone program can, of course, generate DTMF, in the out-of-band format. Many other phones either don't support this at all, or have buggy implementations. This discovery (that existing phones were sucky) was one of the original reasons I looked at writing my own. Python and Audio Recording -------------------------- There's no portable approach to capturing audio in the Python standard library. The library ships with an 'ossaudiodev' module, which works on Linux and the *BSDs. Caspar Wilstrup contributed a Python wrapper for ALSA [alsa], and Donovan Preston contributed one for CoreAudio, the Apple's OS X audio interface. There's also a DirectSound driver that will be integrated into shtoom in the near future. In addition, shtoom can use a PortAudio [portaudio] driver. PortAudio is a platform independent library for accessing audio hardware, with an existing Python wrapper [fastaudio]. (Some minor problems: the current release of PortAudio (v18) doesn't work with ALSA, and the fastaudio wrapper doesn't work on OSX, even though PortAudio itself works on OSX) So we've no shortage of audio drivers. In keeping with the shtoom approach of "the more the merrier" I've support for all of these, with a standard interface wrapped around each lowlevel driver. Audio Encoding -------------- As already mentioned, RTP supports multiple audio encodings. SIP negotiates a common set of encodings that all participants in the call can handle. Shtoom's underlying audio layer reads and writes audio as signed 16 bit PCM at 8KHz. This is then converted to whichever format is required for the call. Requiring the barest minimum from the audio device leads to much fewer headaches. In future, it might be worthwhile supporting higher sampling rates for improved audio quality - this isn't planned for any time soon. The easiest audio codec to support is G.711 ULAW. This is 8 bit ULAW at 8KHz (and is also the format used in an ISDN call). The standard python 'audioop' module can convert to and from this format. The downside to this codec is that it consumes 64kbit/sec for each direction for the data (by the time you include UDP/IP and Ethernet headers, you're up to nearly 87kbit/sec!) The next codec supported is GSM 06.10. This is a rather complex beast to implement - fortunately, though, other people have already done the work [gsm]. Itamar Shtull-Trauring wrote a simple wrapper around this library - it takes a sequence of 13-bit samples (the least significant three bits of each sample are discarded) and produces 33 bytes of output for each 20ms of sound. GSM 06.10 consumes about 13 kbits/sec. A recent addition to shtoom is support for the Speex codec [speex]. Speex's main claim to fame is that it is efficient and unencumbered by patents. In addition, the speex codec can handle more sophisticated problems such as silence suppression (which avoids sending data if no-one is talking). We're not currently using these added features - something for the future. There's a variety of other codecs in the standard G.72x family - unfortunately they are all patented up the wazoo, and require you to purchase licenses to use them. Shtoom will support these with C-code wrappers around the reference implementations, but obtaining a license before using them is obviously up to the end-user. (Lawyers are welcome to let me know whether simply providing an interface to a patented codec is going to get me sued - I hope not!) A Call with SIP --------------- So, how does this SIP thing all hook together? A simple example is probably in order, showing how shtoom handles the calls. We'll show a small example here with one user calling another. (Under international treaties describing technical papers these users must be called "Alice" and "Bob".) We'll assume both Alice and Bob are using shtoom - Alice is at home looking after a sick cat, while Bob is sitting at his desk at the office goofing off and looking for a reason to avoid work. When Bob fires up his copy of shtoom after getting back from lunch, the first thing shtoom did was register Bob with a SIP location service. This consists of a message saying "Any call for Bob@divmod.com, send them to this IP address". The SIP proxy might request authentication (using HTTP's Digest Authentication), then registers the user. Alice is sitting at home bored and decides that boredom shared is boredom halved, and places a call to her friend Bob@divmod.com. This isn't Bob's work address - divmod.com in this case is a SIP location server that Bob has an account on. Alice enters the address, and hits Call. The Shtoom UI calls into the application layer - this in turn calls into the SIP layer. The SIP layer creates a new Call object. This first determines which encodings are available, creates an Invite message, and then sends this to the divmod.com SIP server. This contains a few pieces of information: - The destination for the INVITE - Who's doing the calling - The network ports to use for the low level audio (SIP uses dynamically allocated ports) - A description of the media that the caller can use (audio encodings, video encodings and the like). The divmod.com SIP server looks up it's internal database to figure out how to currently contact Bob, and forwards the SIP invite on to Bob's computer. The SIP layer in Bob's shtoom creates a new call (based on the Call-ID header in the invite) and hands control off to this new call. The first thing it does for a new call is pop up a message saying "Alice is calling, answer?" Bob clicks 'Yes', and this is passed through to the newly created call. It sends back a 200 OK response to the proxy. (SIP shares much with HTTP, including the formatting of messages and many of the response codes. OK in SIP is 200, just like HTTP). The response includes Bob's real network address, the list of media that Bob can handle from Alice's original invite, and the network ports that Bob's phone program will be using. The proxy forwards the response back to Alice. Alice's phone receives the response, looks up the relevant call, and passes the response to it. It next replies with an ACK request directly to Bob's computer, using the network address in his response. After the ACK is sent, the connection starts up - audio flows between the network ports negotiated in the INVITE/OK messages. Eventually one party or the other will terminate the connection - at that point, they click 'Hang up'. The application passes this to the SIP layer, along with the current call id. The SIP layer asks the relevant call object to format a BYE message, and sends it to the other phone. On receiving a BYE, the receiving phone hangs up the line and sends back an OK response. Conclusions ----------- So, the conclusions that can be drawn from this effort? Well, the first, and most obvious, is that Python's a hell of a lot more capable than you might think. I was actually surprised at how easily Python was able to handle what I was throwing at it - even without "cheating" and using code from the standard library, it was actually fast enough. Pyrex makes it pretty trivial to remove the bits of the code that are potential bottlenecks. Pyrex rocks. Once you remove the "performance" reason for avoiding Python, the list of reasons for not using Python is pretty thin. "It makes the C++ coders cry", while true and fair, isn't a suitable justification. (And yes, I've had people tell me that I should have coded this in C++, or Java, because they're "real" languages. It's safe to say that I do not share this opinion.) The final point to be drawn from the performance section is that when you're looking at a problem that seems "hard" - examine the standard library. Python's "batteries included" philosophy, with it's extensive library, means that someone's quite possibly already done the work for you. And there's no software methodology in the world that will produce a result faster than "someone already did it for me". Overall, the number one thing I've figured out from this whole exercise is that people who refer to Python dismissively as "just a scripting language" probably don't know what they're talking about. Future Work ----------- In no particular order here's some future work that I, or someone else, might look at. Video ~~~~~ My initial thoughts were that video would be completely impossible to handle with Python. Having been surprised once, though, I'm going to check this. I suspect that the problems of platform-independent audio interfaces will be even worse for video capture, and that this will be the major pain with implementing video. More of SIP ~~~~~~~~~~~ The SIP implementation is not yet complete - my goal was to build something that works first, then fill out the details. There's still a lot to be done - SIP is a big protocol. Largely the implementation of these features is driven by necessity - as a new wrinkle is discovered, it's implemented. There's also a mind-numbing number of extensions to SIP that are either standardised, or proposed as standards. These will be implemented if and when I find them useful, or if someone else finds them useful and contributes the code back. Additional Platforms ~~~~~~~~~~~~~~~~~~~~ There's already support for many different platforms either checked in, or planned (shtoom even works on WinCE, or whatever the slightly-less-bloated version of Microsoft Windows for handhelds is called this week). There's always more, though. Additional Interoperation ~~~~~~~~~~~~~~~~~~~~~~~~~ It would be nice to interoperate with programs such as Messenger and iChat - I've not begun this task. It appears that they use their own protocols for some of the call setup work - hopefully this won't require extensive protocol reverse engineering. iChat in particular is a bit interesting, as it uses an AIM message to signal that a call is starting - until then, it doesn't listen for SIP messages. I can only test Shtoom against implementations that I have access to - as time goes on, I hope to be able to expand the list of tested systems. Instant Messaging (SIMPLE) ~~~~~~~~~~~~~~~~~~~~~~~~~~ The IETF is moving towards standardising an instant messenger protocol based on SIP. This could be an interesting direction to explore. Additional Phone Features ~~~~~~~~~~~~~~~~~~~~~~~~~ Adding the ability to handle multiple calls is an obvious step for the phone application. Most of the work is in the user interface side for this. One nice extension would be an ad-hoc conferencing facility. There's no reason that any given phone couldn't patch two or more incoming calls into the same audio session. Or allow the phone to make an additional outbound call, and patch it into the existing call. There's a bunch of bells and whistles that can and probably will be added to the phone - playing ringing sounds and the like. At some point, it'd be nice to have someone who has a clue about UI design to help come up with a better phone UI. I'm a software guy, not a UI guy, and the existing UIs show this. More Native AddressBook Support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The shtoom phone program already has a basic address book. Hooking into the platform-specific address book (e.g. Outlook on Windows) would be neat! Doug ~~~~ There's still a fair amount of work to get Doug to a state where I'm completely happy with it. This will largely be an iterative process - as new requirements come up, the platform will change. Dinsdale ~~~~~~~~ At some point in the future I'll be implementing VoiceXML as an alternate to writing applications in Doug. The VoiceXML server will be known as Dinsdale. I'm not a huge fan of VoiceXML, in general, but it will be an interesting exercise to implement it. A Long Footnote: Firewalls and SIP ---------------------------------- Firewalls and Network Address Translation (NAT) boxes are the bane of VoIP. With dynamically allocated UDP ports, and an announcement protocol that needs to know what ports the traffic is going to be on, it's almost certain that every firewall known to man will screw it up in some way. There's a few solutions to this. Proxies ~~~~~~~ One solution is to have a SIP proxy server. This is a pain in the backside for all concerned - it's also not very practical. Most people behind a firewall don't have the ability to run a proxy server. One day I might look at implementing this, but it's pretty low on my list of things to do. Fixed Port Numbers ~~~~~~~~~~~~~~~~~~ Another approach is to have fixed local port numbers, and have the firewall forward those ports onto the inbound system. This is OK if it's only one or two people using the system, but quickly becomes a nightmare to manage for any larger number of users. SIP-aware firewalls ~~~~~~~~~~~~~~~~~~~ It would be nice if firewalls gained knowledge of SIP, and would do the necessary magic to allow packets to flow through. This of course then means you're relying on the firewall vendor to get it right. This could be considered unlikely. Certain variants of Cisco's IOS implement this, and it's been reported to me that they actually work OK. A Different Protocol ~~~~~~~~~~~~~~~~~~~~ One solution might be to use a different protocol that's a little more forgiving of firewalls. The problem here is getting the protocol standardised and deployed widely. This might be a long term approach - I've not seen much happening in this area. STUN ~~~~ STUN [stun] is a UDP protocol hack to help you determine what your firewall is doing. Briefly, a STUN request involves sending a packet from the port you're going to be communicating on, to a STUN server outside your firewall. The STUN server examines the packet, and replies with the IP address and port number that it saw the packet come from. STUN is only half of the solution. You also need your firewall to do stateful UDP filtering - that is, if packets go out from a port, allow the replies to come back in. A side-benefit of STUN is that the outbound request should allow traffic to flow back in, for most firewalls that handle this. There's some complications here that I've elided for space reasons - email me if you're interested. Note that STUN doesn't help you if your firewall doesn't handle stateful UDP filtering. In the words of one correspondent "STUN just lets you discover how screwed you are". That is, it allows you to figure out whether your firewall is usable, and how you can work with it. Shtoom implements STUN for both SIP and RTP traffic. The STUN implementation will eventually be folded back into the core Twisted framework - it's useful for any UDP protocol that needs firewall traversal. UPnP ~~~~ Microsoft's entry in this cavalcade of horrors is Universal Plug and Play (UPnP). This is a protocol that allows networked devices to discover and control aspects of their local network. In the case of a firewall, it allows an end-user system to request a dynamic port-forwarding from the firewall to the box. Many network administrators will probably (rightly) recoil at letting applications on a Windows box dictate firewall policy. UPnP, while implemented initially on Windows, is now an open protocol. As an aside, UPnP's implementation (which features SOAP, HTTP over multicast/broadcast UDP, and extremely odd XML) is a must-read for fans of unnatural and baroque network protocols. There's a partial implementation of UPnP in Shtoom - I hope to finish it in the not too distant future. It's not clear how useful this will be - the first worm/virus that uses UPnP to punch holes through firewalls will probably result in UPnP being disabled everywhere. Most of the routers that implement UPnP are also capable enough that it's unnecessary. Acknowledgments ---------------- Thanks to the entire Divmod and Twisted teams for assistance in the development of Shtoom. Special thanks to Amir Bakhtiar for providing hardware for testing, patient Windows user feedback, and for getting me along to PyCon in March 2004, which provided me with the impetus to write the original version of this paper. Thanks also to Andy Hird, Dougal Scott, Cam Blackwood, Benno Rice, Toby Sargeant and Damien Moore for feedback on various versions of this paper. Any mistakes remaining are, of course, entirely my fault. And my cats' fault. They're always messing things up. More Information ---------------- The tragically-in-need-of-an-update website is at shtoom.divmod.org. There's a mailing list at shtoom@python.org, and a #shtoom channel on irc.freenode.net. At the moment, shtoom is heading towards another release - this is planned for before OSDC, but it might not make it. References ---------- Web URLs referenced in the paper: [alsa] http://www.alsa-project.org/ [fastaudio] http://www.freenet.org.nz/python/pyPortAudio/ [gsm] http://kbs.cs.tu-berlin.de/~jutta/toast.html [h323] http://www.packetizer.com/voip/h323/ [numarray] http://www.stsci.edu/resources/software_hardware/numarray [openh323] http://www.openh323.org/ [portaudio] http://www.portaudio.com/ [psyco] http://psyco.sourceforge.net/ [pydirector] http://pythondirector.sf.net/ [pyrex] http://nz.cosc.canterbury.ac.nz/~greg/python/Pyrex/ [python] http://www.python.org/ [rtp] http://www.cs.columbia.edu/~hgs/rtp/ [shtoom] http://divmod.org/Home/Projects/Shtoom [sip] http://www.cs.columbia.edu/sip/ [sipphone] http://www.sipphone.com/ [skype] http://www.skype.com/ [speex] http://www.speex.org/ [stun] http://www.ietf.org/rfc/rfc3489.txt [twisted] http://www.twistedmatrix.com/ [vonage] http://www.vonage.com/ Books: The must-have book on RTP is "RTP: Audio and Video for the Internet", by Colin Perkins (Addison Wesley, 2003). If you're interested in this topic, this is _the_ book to get. There's quite a large number of books available on VoIP - the older ones mostly discuss H.323, usually with a brief mention of SIP. I've not yet found any books on SIP that I'd care to recommend. If you know of one, I'd love to hear about it. I'd also love to be able to recommend a good book on audio processing for software people. The books I've found (and use) vary between solid signal processing textbooks (heavy on the math, not so good on the implementation side) or thumb-suckers for people who have no idea about audio ("See Jack run. See Jack record an audio sample.") Again, I'd love to know of anything better. The Perkins RTP book has a fair amount on dealing with packet-based audio. If you're interested in Python, consider either or both of Alex Martelli's "Python in a Nutshell" (O'Reilly, 2003) and Mark Pilgrim's "Dive Into Python" (Apress, 2004). The latter is also available online at http://diveintopython.org/