FaceFusion语音+视频口型同步功能,本地安装升级详细步骤。

AI探索与发现
26 Feb 202412:00

Summary

TLDRThis video explores the latest updates to FaceFusion, a tool that now enables video characters to speak by syncing audio with lip movements. It also features an improved face recognition model, YOLOface, enhancing accuracy in low light and for side or down-turned faces. The tutorial guides users through upgrading from older versions, including updating files, dependencies, and ensuring GPU acceleration. It demonstrates how to use the new voice-driven lip-syncing feature and provides tips on optimizing settings for better performance. Additionally, it discusses the nuances of using YOLOface for face swapping in various scenarios.

Takeaways

  • 😀 FaceFusion now has a new feature that allows video characters to speak by synchronizing voice and lip movements using a provided audio and a frontal video.
  • 😎 The update includes a new YOLOface recognition model, which significantly improves accuracy in poor lighting and for low-angle and side-face shots.
  • 🛠️ To upgrade from an older version (2.2.1 or earlier), users need to use Git commands and update dependencies, ensuring they have the correct CUDA version for NVIDIA GPUs.
  • 🌐 Users in Mainland China are advised to use a proxy for smoother updates and model downloads due to potential connectivity issues.
  • 💻 After upgrading, users can verify the success by checking the version number and ensuring the new features appear on the interface.
  • 🎨 The new version supports manual download of models if automatic download is slow, by accessing the official model list and placing them in the models directory.
  • 🚀 A one-click run script can be created to simplify the launch process by modifying a text file into a batch file with the necessary run commands.
  • 🗣️ The voice-driven lip-sync feature works best with English audio and mid-shot videos, though it can also handle songs, though results may vary.
  • 🔍 The YOLO model offers better performance in recognizing multiple faces in a scene compared to the previous RetinaFace model, with optimizations for various angles and lighting.
  • 🛠️ For complex facial scenes, users may need to experiment with different model and mask combinations to achieve optimal results, as some extreme angles may still pose challenges.
  • 🎵 The script concludes with a demonstration of the new model’s effectiveness on a video with many low-angle and side-face shots, inviting viewers to spot any imperfections.

Q & A

  • What are the new features introduced in the latest update of FaceFusion?

    -The latest update of FaceFusion includes the ability to make video characters speak by syncing provided audio with their lip movements, and a new facial recognition model that improves accuracy in low light conditions and for identifying faces in low angles or side profiles.

  • What are the prerequisites for upgrading FaceFusion from an older version?

    -To upgrade FaceFusion, you must have successfully installed version 2.2.1 or earlier using the git tool for manual installation. If you haven't installed any version before, you can refer to the video to install the latest version directly.

  • What steps are involved in upgrading FaceFusion using git?

    -The steps include: 1) Navigating to the FaceFusion installation directory and opening the command window. 2) Running the command 'git pull' to update the program files. 3) Activating the virtual environment. 4) Updating the installed dependencies using a specific command. 5) Verifying the upgrade by running the program and checking the version number.

  • Why is it recommended to use a proxy when updating FaceFusion in Mainland China?

    -Using a proxy is recommended to ensure smooth updates, especially when downloading dependencies and models, as it can help bypass potential network restrictions and improve download speeds.

  • How can users enable GPU acceleration after upgrading FaceFusion?

    -To enable GPU acceleration, users need to check the CUDA version used in the original installation (e.g., cu118 for CUDA 11.8) and then run the 'python install.py' command, selecting the appropriate CUDA version during the installation process.

  • What is the purpose of the YOLOface model in FaceFusion?

    -The YOLOface model is designed to improve facial recognition accuracy, especially in challenging conditions such as low light, low angles, and side profiles. It is more efficient and accurate compared to the previous RetinaFace model.

  • How can users manually download and install the new models if the automatic download is slow?

    -Users can visit the official project's model list page, download the 5 new models for version 2.3, and then copy them into the 'models' directory of the FaceFusion installation folder.

  • What are the recommended settings for using the voice-driven lip-syncing feature in FaceFusion?

    -For voice-driven lip-syncing, it is recommended to use the YOLO model for facial recognition, enable facial enhancement to improve lip clarity, and adjust the thread count to optimize processing speed based on GPU memory.

  • How does the new YOLO model perform compared to the old RetinaFace model in different scenarios?

    -The YOLO model is faster and more accurate in general facial recognition tasks, especially for low angles and side profiles. However, it may perform slightly worse with closed masks compared to RetinaFace. Users should choose the model based on the specific requirements of their video content.

  • What are some limitations of FaceFusion when dealing with complex facial angles?

    -FaceFusion may struggle with extremely complex facial angles, such as when the head is turned close to 90 degrees. In such cases, the model may produce deformed results or fail to handle the facial features correctly.

  • How can users create a one-click script to run FaceFusion easily?

    -Users can create a one-click script by creating a new text file in the FaceFusion installation directory, renaming it with a '.bat' extension, and pasting the appropriate command inside the file. This allows them to run FaceFusion directly by double-clicking the script.

Outlines

00:00

🚀 Upgrading Facefusion and Exploring New Features

This paragraph provides a comprehensive guide on upgrading Facefusion to version 2.3.0 and introduces new features such as voice-driven lip-syncing and an improved YOLOface recognition model. The process involves using Git commands to update the software, activating a virtual environment, and updating dependencies. It also highlights the importance of checking for GPU acceleration and manually downloading models if necessary. Additionally, it explains how to create a one-click script to simplify the launch process. The new features include voice-driven lip-syncing, which requires selecting the appropriate options and adjusting parameters like thread count for optimal performance. The paragraph also discusses the improved YOLOface model's accuracy in low-light conditions and its ability to handle different facial orientations.

05:01

🔍 Detailed Usage of New Features and Model Comparisons

This paragraph delves into the detailed usage of the new features in Facefusion, focusing on voice-driven lip-syncing and the YOLOface model. It explains the importance of adjusting parameters such as thread count to improve video processing speed and notes a potential bug with direct input values. The paragraph also highlights the YOLOface model's performance, comparing it with the previous RetinaFace model in terms of speed, memory usage, and accuracy in various scenarios. It provides specific recommendations for using different models and masks based on the video content, such as using YOLO with face enhancement for voice-driven lip-syncing and RetinaFace with closed masks for heavily occluded faces. The paragraph concludes by emphasizing the need for multiple attempts to achieve the best results, especially for complex facial scenes, and notes limitations in handling extreme head rotations.

10:01

🎵 Conclusion and Demonstration of Model Effectiveness

This paragraph concludes the video script by inviting viewers to listen to a music segment and observe the effectiveness of the new model in handling a video with numerous low-angle and side-face shots. It challenges viewers to spot any imperfections in the face-swapping results and expresses confidence in the new model's capabilities. The paragraph also includes a promotional statement about Facefusion being a next-generation face swapper and enhancer, emphasizing its superiority over other tools. It ends with a thank-you note to the viewers and a hint for future content.

Mindmap

Voice-Driven Lip Sync
Enhanced Face Recognition Model (YOLOface)
New Features Overview
Cost-Effective Digital Human Creation
Improved Accuracy in Low-Light Conditions
Better Handling of Side Profiles and Downward Faces
Benefits of Updates
Introduction to Facefusion Updates
Installed Version 2.2.1 or Earlier
Manual Installation via Git
Prerequisites
Navigate to Installation Directory
Execute 'git pull' for Program Update
Activate Virtual Environment
Update Dependencies
Verify Upgrade by Running Program
Upgrade Steps
Check CUDA Version via 'pip list'
Run 'python install.py' to Update Dependencies
Additional Steps for NVIDIA GPU Users
Automatic Model Download on First Use
Manual Download for Slow Network Speeds
Copy Models to 'models' Directory
Model Download and Manual Update
Upgrade Process
Synchronize Voice with Video Lip Movements
Supports Both Speech and Music
Functionality
Disable Default Face Swap Option
Upload Voice and Video Files
Adjust Thread Count for Performance
Optional Face Enhancement for Better Results
Usage Steps
Better Results with English Voice
More Natural Lip Movements in Medium Shots
Performance Observations
New Feature: Voice-Driven Lip Sync
Developed by Google
High Speed and Accuracy
Model Overview
Recommended for Voice-Driven Lip Sync
Better Handling of Multiple Faces in Frame
Usage in Facefusion
Faster Replacement Speed
Higher GPU Memory Utilization
Performance Improvements
Age and Gender Prediction
Increased Feature Points for Recognition
Additional Features
Potential Performance Drop with Closed-Mask
Limitations
Enhanced Face Recognition Model: YOLOface
YOLO + Face Enhancement for Voice-Driven Lip Sync
YOLO + Box Mask for Unobstructed Faces
Retinaface + Closed Mask for Heavily Obstructed Faces
YOLO + Box Mask for Partially Obstructed Faces with Side Profiles
Model and Mask Combination
Multiple Attempts Required for Optimal Results
Limitations with Extreme Head Angles
Handling Complex Scenarios
Practical Application and Recommendations
Summary of Updates and New Features
Encouragement for Continued Exploration and Testing
Conclusion
AI Exploration and Discovery: Facefusion Updates
Alert

Keywords

💡Facefusion

Facefusion is a software tool that allows users to swap faces in videos and now also enables the synchronization of speech with lip movements. It is central to the video's theme as the entire script revolves around its new features and upgrades. The script mentions how Facefusion has evolved from just face swapping to incorporating speech-driven lip-syncing, making it a more versatile tool for creating digital human videos.

💡Speech-Driven Lip-Syncing

This refers to the new feature in Facefusion that allows the software to match the lip movements of a person in a video with a provided audio track. It is a significant update discussed in the video, enabling more realistic and engaging digital human videos. The script explains how users can now provide a voice clip and a frontal video, and the software will synchronize the speech with the lip movements, enhancing the capabilities of low-cost digital human video creation.

💡YOLOface

YOLOface is a new facial recognition model introduced in the latest version of Facefusion. It is designed to improve the accuracy of facial recognition, especially in challenging conditions such as poor lighting or when the face is in a low-angle or side profile. The script highlights how YOLOface enhances the software's ability to identify and replace faces more accurately compared to the previous model, RetinaFace, particularly in complex scenarios.

💡CUDA Acceleration

CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. In the context of Facefusion, CUDA acceleration allows the software to leverage the power of NVIDIA GPUs to speed up processing tasks. The script mentions that users with NVIDIA GPUs need to perform an additional step to enable CUDA acceleration, which significantly improves the performance and speed of the software, especially when processing high-resolution videos.

💡Virtual Environment

A virtual environment is an isolated space where software dependencies can be managed without affecting the system-wide settings. In the script, the virtual environment is used to manage the dependencies required by Facefusion. The update process involves activating the virtual environment to ensure that all necessary packages are updated correctly, allowing the software to run smoothly with the new features.

💡Dependency Packages

These are software libraries and modules that Facefusion relies on to function properly. The script explains that after updating the software using 'git pull', users need to update the dependency packages. This step ensures that all the necessary components are compatible with the new version of Facefusion, allowing it to utilize its new features such as speech-driven lip-syncing and the YOLOface model.

💡Git

Git is a version control system used for tracking changes in source code during software development. In the context of this video, Git is used to manage the installation and updates of Facefusion. The script instructs users to navigate to the Facefusion directory and use the 'git pull' command to update the software, highlighting Git's role in keeping the software up-to-date with the latest features and improvements.

💡Digital Human Videos

These are videos that feature digital representations of humans, often created using software like Facefusion. The script discusses how Facefusion's new features, such as speech-driven lip-syncing, contribute to the creation of more realistic and engaging digital human videos. This concept is central to the video's theme as it showcases the potential applications of Facefusion in various fields, including entertainment, education, and social media.

💡Face Enhancement

Face enhancement is a feature in Facefusion that improves the visual quality of faces in videos. The script mentions that when using the speech-driven lip-syncing feature, the mouth area may appear blurry, so adding face enhancement can improve the overall visual quality. This feature is important for creating more realistic and visually appealing digital human videos, especially when dealing with low-quality source videos.

💡Model Download

In the context of Facefusion, model download refers to the process of obtaining the necessary pre-trained models required for the software to function properly. The script explains that the new version of Facefusion will automatically download the required models during the first use. However, due to potential slow download speeds in certain regions, users are advised to manually download and place the models in the appropriate directory. This ensures that the software can access the necessary components to utilize its new features effectively.

💡Batch Script

A batch script is a text file containing a series of commands that can be executed automatically. The script mentions creating a batch script to simplify the process of running Facefusion. By creating a batch file with the necessary commands, users can easily launch the software without having to manually enter the commands each time, making the usage of Facefusion more convenient and efficient.

Highlights

Facefusion now supports not only face swapping but also lip-syncing with voice input.

The new version includes a YOLOface model, improving face recognition accuracy in poor lighting and for low-angle or side faces.

Upgrade from previous versions requires git pull, activating virtual environment, and updating dependencies.

For Nvidia GPU acceleration, additional steps are needed to confirm and update CUDA versions.

New models will be automatically downloaded on first use, but manual download is recommended for faster setup in China.

A one-click run script can be created to simplify the program launch process.

Voice-driven lip-syncing works best with English and in mid-shot videos rather than close-ups.

The YOLO model offers faster face replacement speed compared to the previous RetinaFace model.

The new YOLO model includes age and gender prediction features.

Face replacement with YOLO model is recommended for videos with unobstructed faces and box masks.

For videos with partially obstructed faces and many low-angle shots, YOLO with box masks is suggested.

For heavily obstructed faces, the old RetinaFace model with closed masks is recommended.

The new model struggles with extreme head rotations close to 90 degrees.

The new version's YOLO model has optimized face recognition with additional feature points.

Despite optimizations, YOLO's performance may decrease when using closed masks.

Transcripts

00:00

Hello!大家好!

00:00

欢迎来到AI探索与发现

00:02

facefusion又有新功能了

00:04

现在不仅可以换脸

00:05

还能让视频里的人开口说话

00:07

只要你提供一段语音

00:09

和一个正脸拍摄的视频 经过融合之后

00:12

就能实现语音和口型的同步

00:14

低成本视频数字人又多了一个可选方案

00:17

这次更新另一个重大提升是

00:19

加入了新的人脸识别模型

00:21

大大提高了在光线不好的情况下

00:23

人脸识别的准确性

00:25

特别是识别低头 侧脸的画面时更加精准

00:28

今天视频就来详细介绍

00:30

如何从老版本升级

00:32

并演示新增功能的使用和效果

00:34

从老版本升级的前提是

00:36

已经成功安装过2.2.1或者更早的版本

00:39

并且是使用git工具克隆的方式手动安装的

00:42

如果你没有安装过任何版本

00:44

也可以参考这期视频

00:46

直接安装最新的版本

00:48

首先到安装的facefusion目录下

00:50

比如我的安装目录是D:\AI\facefusion

00:54

路径栏输入CMD打开命令窗口

00:58

输入第一条指令 git pull

01:01

这里提醒一下 中国大陆的朋友

01:03

如果能科学上网 建议先打开代理再更新

01:07

成功更新程序文件以后

01:08

输入第2条指令 激活创建的虚拟环境

01:12

然后输入第3条指令

01:14

更新安装的依赖包

01:17

更新依赖包的过程如果没有出错

01:19

那程序升级就顺利完成了

01:21

最后我们来运行程序

01:23

验证升级是否成功

01:25

输入运行指令

01:26

并加上这个跳过下载的参数

01:29

网页里输入本地运行地址

01:32

操作主界面能成功打开

01:34

这里版本也显示为2.3.0

01:37

这个选项就是新增的语音驱动口型的功能

01:41

拉到底部

01:42

也可以看到新增的人脸识别模型YOLOface

01:46

整个操作界面看上去

01:47

好像已经升级成功了

01:49

但仔细观察一下

01:51

这里的选项是不对的

01:52

没有显卡加速

01:55

因此如果你的是英伟达显卡

01:57

还要再进行升级的最后一步操作

02:00

切回到命令窗口

02:02

按Ctrl+C终止运行

02:04

先输入 pip list

02:06

查看一下原来安装时用的CUDA版本

02:09

只需要确认torch这一行

02:11

看加号后面的字符标识

02:14

这里显示是cu118

02:16

代表我原来版本安装的是cuda11.8的Pytorch

02:20

确认好以后

02:21

再输入这条指令 python install.py

02:24

这里会出现第一个安装选项

02:26

要我们选择CUDA版本

02:28

前面已经确认过cuda是11.8

02:31

所以选最后一个

02:33

然后出现第二个选项

02:35

也同样选择

02:36

跟之前安装时一样的cuda版本

02:39

选好以后回车

02:40

安装程序就会自动更新相关的依赖包

02:43

同样这里的更新

02:44

在中国大陆的朋友也需要科学上网

02:48

整个更新过程不需要干预

02:50

更不要用鼠标点击命令窗口

02:52

只需要耐心等待全部更新完成

02:56

整个过程如果没有出现错误

02:58

就代表更新成功了

03:00

输入运行指令

03:02

刷新一下网页

03:04

现在这里就能看到cuda选项了

03:07

新版本增加的模型

03:08

会在第一次使用时自动下载

03:10

比如现在选择语音驱动口型的功能

03:13

程序如果在本地找不到相关模型

03:15

就会先下载

03:17

在命令窗口可以看到下载进度

03:20

不过在中国大陆的朋友

03:21

这里即使开了科学上网

03:22

下载的速度也非常慢

03:24

解决办法是手动下载模型

03:27

打开官方项目的模型列表页面

03:30

红框标识的就是2.3版本新增的模型

03:33

总共有5个

03:34

下载好以后 我们把它剪切粘贴到

03:37

facefusion安装目录下的models目录里

03:40

到这里新版本升级和模型的更新就全部完成了

03:44

重新运行程序就能使用新版本了

03:47

如果嫌每次输入运行指令太麻烦

03:50

这里跟大家介绍 一个创建一键运行脚本的方法

03:53

首先到facefusion安装目录下

03:56

新建一个文本文件

03:58

然后把它重新命名

03:59

比如叫一键启动

04:02

双击打开它

04:03

复制这段指令粘贴过来

04:06

保存关闭

04:08

最后把这个文件的后缀TXT改成BAT

04:13

如果在你系统上看不到TXT后缀

04:16

可以点击这里的[查看]

04:18

勾选上[文件扩展名]就能看到了

04:21

现在双击一键启动

04:23

就能直接运行facefusion了

04:27

下面来介绍新增的两项功能

04:29

首先是用语音驱动视频里的人物口型

04:33

先勾选这个选项

04:35

然后去掉默认的换脸选项

04:37

因为口型同步和换脸功能是不能同时使用的

04:42

把准备好的语音文件拖放到这里

04:45

我就用这段8秒的语音来测试

04:57

然后再把视频拖进来

05:00

勾选上cuda加速

05:03

调整线程数量

05:05

这个参数很重要

05:06

直接影响到视频的处理速度

05:09

不过程序在这里有一点bug

05:11

有时直接输入值会无效

05:13

建议用后面的上下按钮来调整大小

05:16

一般最大可以调整到显存的两倍

05:18

比如8G显存最大可以调到16

05:21

下面的参数基本不用改了

05:23

默认值就是最佳设置

05:26

然后是右边参数

05:27

可以看到这里默认的人脸识别模型是YOLO

05:31

这个模型在deepfacelive这一期视频有过介绍

05:34

全称叫You Only Look Once(你只需看一次)

05:36

谷歌出品

05:37

无论是识别速度还是准确性都非常优秀

05:40

使用语音驱动口型功能时建议就选它

05:44

其他所有参数也不用改默认就好

05:47

不过在预览这里可以明显看到

05:49

使用音频驱动口型以后

05:51

嘴巴部分变得非常模糊

05:53

所以建议再加上脸部增强

05:56

我们来替换一下看看效果

06:10

口型基本对上了

06:11

不过牙齿的效果不是太好

06:14

这里的语音如果换成歌曲也是可以的

06:17

融合以后是这样的效果

06:30

这个功能我测试下来

06:31

发现使用英文语音的效果明显好于中文

06:39

如果视频拍摄的是中景

06:40

替换以后的唇部动作

06:42

要比脸部特写的视频更加自然

06:51

最后来介绍下使用新的YOLO模型

06:54

换脸时的一些操作细节

06:56

左侧参数的设置跟之前版本一样

06:58

没什么变化

07:00

添加人脸增强

07:01

设置cuda加速

07:03

去掉默认的CPU选项

07:06

增加线程数量提高换脸速度

07:09

右侧这个参数

07:10

人脸选择模式这里有细微变化

07:14

如果视频画面中

07:15

同时出现两个或两个以上的人脸时

07:18

只替换其中一个人面部

07:20

建议使用[one模式]来指定

07:22

默认的YOLO模型

07:24

在画面中出现多张人脸时

07:26

这里如果使用默认的第一个reference模式

07:29

指定替换人脸有时会不起作用

07:32

我们来替换一下看看效果

07:35

新模型在替换速度上

07:36

稍快于之前版本的retinaface

07:39

另外 替换时显存的使用率也有所提高

07:43

当设置了12个线程时

07:44

8G显存接近拉满

07:47

替换完成

08:22

在不使用封闭遮罩的情况下

08:24

YOLO模型的整体效果

08:26

要好于老版本的retinaface

08:29

主要是新版本在人脸识别上做了很多优化

08:33

我们可以打开调试选项

08:35

就能看到这些变化

08:37

包括增加了5个特征点的人脸识别

08:39

原来是68个特征点

08:42

虽然特征点越多

08:43

在识别正面人脸时更加精确

08:46

但识别低头侧脸

08:47

或者光线不好时的人脸

08:49

就容易误判

08:51

另外这次还增加了人脸的年龄预测和性别预测功能

08:56

不过新的YOLO在配合封闭遮罩时

08:59

感觉整体性能有所下降

09:01

所以可以根据要替换的视频内容

09:04

来灵活选择不同的人脸识别模型

09:06

以下是几种场景下的模型和遮罩组合的推荐

09:10

如果使用语音驱动视频功能

09:12

人脸识别模型就使用YOLO加人脸增强

09:16

换脸的情况下

09:17

如果视频中人脸没有遮挡

09:19

推荐使用YOLO加box遮罩

09:22

如果视频中人脸有大量遮挡

09:25

推荐使用老版本retinaface加封闭遮罩

09:28

如果视频中人脸只有少量遮挡

09:30

但存在大量低头侧脸的画面

09:33

推荐使用YOLO加box遮罩

09:36

以上推荐则是基于当前版本的测试

09:38

可能以后新版本会有所变化

09:40

实际操作时对于复杂的面部镜头

09:43

还是要多次尝试才能达到最佳的效果

09:45

当有些面部镜头超出了基于2D人脸识别

09:49

和交换模型的处理范围时

09:51

比如这个视频

09:52

在头部转向接近90度时

09:54

模型就无法处理了

09:56

无论如何调整参数

09:58

重绘以后的面部都会存在严重变形

10:01

甚至出现多个眼睛

10:04

最后来请大家听一段音乐

10:06

这个视频里有大量低头侧脸的画面

10:09

但角度还没有超过模型的处理范围

10:11

使用新版模型替换之后

10:13

不知道你能否找到其中的破绽

10:16

好 今天的分享就到这里

10:17

感谢观看 咱们下期见

06:00

你的辫子长长

06:02

你的眼睛亮亮

06:04

我的心儿慌慌

06:06

我的大脑缺氧

06:35

fusion is a face swapper and enhancer of the next generation

06:46

face fusion is the best face swapper out there