Kaffeine: Smart Video Editing using Machine Learning
“The difference between something good and something great is attention to detail.” Kraenion's automatic video editing, based on speech and image analysis, makes video presentations go from good to great with one click.
Even experienced professional speakers, like professors and realtors, are increasingly competing for viewer attention online. There are hundreds of online courses on advanced topics from Artificial Intelligence to Quantum Physics, dozens of dentists in every zip code explaining how they fit braces for teenagers, and realtors offering tours of every property in any neighborhood. In real life, audiences might be patient and respectful; however, when online, they skip through videos or click on a competitor's video if your video delivery fails to keep them constantly engaged. Because your audience is valuable, you want to improve your video presentations to meet this critical engagement need. Unfortunately, editing videos frame-by-frame takes a lot of time and it is expensive to have someone else do this for you.
Our automatic video editor energizes your presentations by making them concise without losing relevant content. Kaffeine algorithms can eliminate drawls, pauses, and fillers; detect and cut sirens and dog-barks; silence noisy air-conditioners, re-frame your words, and more. Kaffeine makes hundreds of millisecond-level edits and it is this scale that takes video-editing to a level beyond human capabilities. If you desire a greater degree of control, Kaffeine provides granular feedback such as segments with jittery video, extreme closeups, too little speech for the segment duration, etc. and you can decide what to keep or snip. Kaffeine also enables you to set a target time (say 10 mins) or a time to shrink (say 45 seconds) and the algorithm automatically edits to that requirement; and you can iterate on these quickly. Once you watch your own presentation using Kaffeine editing, there is no going back. You will never expose yourself on Facebook or Youtube without it. Why would you not want to be your best self online?
Use cases: Video lectures and MOOCs, professionals with a YouTube channel, how-to videos, nearly live interviews and broadcasts, post-edit polishing tool for professional video editors, corporate training material, earnings calls, etc.
Kaffeine's machine learning system analyzes both the audio and video stream as showin in Figure 1.
- The audio stream is analyzed for background noise (HVAC, car, wind sounds), interruptions (pet sounds, alarms, sirens), filler words, pauses and drawls and and editing opportunities are annotated.
- Presence of music, and clicking/popping sounds (e.g.: chalk on board, key clicks) are also identified.
- Faces are detected and tracked in the video stream whenever the sizes are large enough for viewers to track lip movements.
- Scene changes are determined and frames that correspond to titles, credits etc are annotated. All of these annotations provide opportunities for condensing the video.
Maintaining alignment between the audio and video tracks is crucial. When a speaker is not in the frame, but may be narrating off-frame we allow the video and audio tracks to diverge for a few seconds at most, but our automatic re-timing and synchronization technology brings the tracks back into alignment every few seconds. When the speaker is in the frame, we maintain a much tighter alignment between video and audio, but still edit out fillers and nuisance noise. Even after identifying all the editing opportunities one cannot just delete the offending segments without considering the impact on the overall video. If video frames are dropped indiscriminately, the playback will appear jerky. Audio is quite unforgiving: deleting segments can often lead to click and pop sounds or unpleasant disturbances. We use our estimates of scene change along with audio signal processing and filtering technology to avoid such undesirable effects. Alignment of audio and video while controlling lag requires solving a problem known as time-warping. It is possible to solve this using a machine learning approach, but an algorithmic approach affords a greater degree of control. Our technology uses a combination of techniques: dynamic time warping, time warping by sampling from the cumulative distribution function of the difference distribution, and uniform sampling in suitable areas. For sections with background music only, we analyze the scene progression and some times shorten the video using a time lapse and shorten the music track.
It is the simultaneous application of machine learning, digital signal processing, and lag management that makes the Kaffeine technology stand apart from traditional editors. For most videos, we achieve a speed-up of approximately 1.5x without dropping intended content or affecting the user experience. The end result is that the speaker comes across as confident and professional.
Kaffeine editing opens up new and unprecedented opportunities in areas like near-live broadcasting, video lectures and professional YouTube channels.
For live broadcasts, human-editing is not possible and viewers often have to suffer through segments where nothing significant happens (e.g.: walking around, applause). Many news and talk shows have live segments with untrained speakers such as first responders at an accident, witnesses at an event and legal experts providing opinions. Kaffeine's video editing system provides new revenue opportunities for broadcasters of such events.
When a few minutes of delay is acceptable, the live show can be started slightly before the scheduled broadcast time (run-ahead time) and can be piped through Kaffeine prior to broadcast. As shown in the demonstration videos, Kaffeine maintains accurate estimates of time saved. When the saved time exceeds the run-ahead time, broadcasters can opportunistically insert previously recorded advertisements, announcements, etc. while Kaffeine continues to buffer up the live show. While the ads run, the saved-time gets depleted. When enough of the saved time has been used up, the broadcast can switch back to the edited show. The broadcaster can also sell opportunistic ads on a network where the price is set based on what occurred in the news and use our system to create new slots to show the ads. For example, if a segment had a dentist as a guest, local dentists may want to bid to air previously prepared ads. On the web, this is often done through auctions where software agents bid on behalf of users. In the worst case, if the publisher has no alternate material available, a few seconds before the system exhausts the run-ahead time, we can stop or reduce the aggressiveness of our automatic editing and the show will continue as a nearly-live stream.
All of these benefits are also applicable to broadcasts of previously recorded shows. In addition, Kaffeine can be used as a power-assistant to a human video editor by automating the more routine tasks, thereby freeing up the human editor to focus on the higher-level cognitive tasks that enables them to make videos even more engaging for the audience.
Educational Content (MOOCs & Lectures)
“Watching an 80 minute lecture online is like poking your eyes with needles” says Prof. Aswath Damodaran of the NYU Stern School of Business.
This is a huge need since most universities now offer lectures online, and educational content is a big part of YouTube, Apple iTunes U, Udacity, Coursera, etc. The key issue for educational content is that the pace of a live lecture is much slower than the ideal pace for an online audience. For example, students who are preparing for an exam likely need to watch hundreds of hours of recorded lectures. Another problem is that the slow pace of online video lectures likely reduces engagement for professionals (busy adults) who drop out and never finish the course. Kaffeine provides the ability to address these needs by publishing edited, and condensed, content that meets the needs of these audiences.
We present an original lecture clip that has been shortened from 1:20 minutes to 0:59 minutes -- overall speedup of 1.4x, without losing any relevant content. The experience from this clip is superior to that of watching the original at 1.5x on YouTube: The Kaffeine edited-version is much more understandable because it is smart about what to leave out (e.g. notice that it speeds up where the lecturer is writing on the board) and where to play at original speed (e.g. during spoken words).
For comparison, the standard youtube speedup of 1.5 is hard to understand:
Individual publishers (Professionals, How-to videos, YouTube Channels)
Most of us search YouTube for instructions on how to assemble furniture, fix appliances, tour a new city, etc. Many publishers of such content are individuals who may not have access to professional video editors. Some individuals make a livelihood from this, while others use such videos to attract new customers and build/maintain their brand. The following video-tour is an example made by a professional realtor for potential buyers to view a property that is available for purchase. Kaffeine edited out segments when the realtor walked up stairs or paused too long; and the effective speed-up is 1.54x. In this case, since the speaker is not in the video frame, audio and video content can be adjusted more easily without requiring strict alignment. Kaffeine can save professional YouTube producers hours of tedious editing labor and yield polished content leading to a competitive advantage. In the realtor scenario, the first realtor to reach a property can use a smartphone to record the tour, pipe it through Kaffeine's editor (in the cloud), and publish the edited version within minutes ahead of peers using traditional methods.