在这个章节我们会基于 Byzer-LLM 构建一个写案例,帮助大家理解 Byzer-LLM 引擎的使用方式以及带来的价值,同时帮助企业快速验证效果以及 快速部署到生产环境。

虚拟外教

构建一个虚拟外教,会涉及到三个大模型:

  1. 语音转文本
  2. 大预言模型
  3. 文本合成语音

我们分别使用:

  1. fast whisper
  2. LLama 13B
  3. Bark

在继续后面的步骤之前,请确保按官方文档部署好环境。

部署 Fast Whisper

模型下载地址: https://huggingface.co/guillaumekln/faster-whisper-large-v2。需要提前下载到 Ray 所在服务器

  1. 因为该模型为了追求速度,所以依赖 NVIDIA libraries cuBLAS 11.x 和 cuDNN 8.x 。 请到 https://developer.nvidia.com/cudnn 下载,并且按照对应的安装步骤 将一些依赖库拷贝到前面我们创建的软链目录下。 或者通过
conda install -y libcublas -c nvidia/label/cuda-11.8.0
conda install -y cudnn -c nvidia/label/cuda-11.8.0 

来安装。

  1. 该模型依赖 Byzer-llm 默认是不带的,所以需要手动安装:
pip install fast-whisper

最后在 Byzer Notebook里启动:

!byzerllm setup single;
!byzerllm setup "num_gpus=1";
-- !byzerllm setup "resource.master=0.01";

run command as LLM.`` where 
action="infer"
and pretrainedModelType="whisper"
and localModelDir="/home/byzerllm/models/faster-whisper-large-v2"
and udfName="voice_to_text"
and reconnect="false"
and modelTable="command";

部署 LLama 13B

!byzerllm setup single;
!byzerllm setup "num_gpus=2";
-- !byzerllm setup "resources.master=0.001";

run command as LLM.`` where 
action="infer"
and pretrainedModelType="llama"
and localModelDir="/home/byzerllm/models/openbuddy-llama-13b-v5-fp16"
and reconnect="false"
and udfName="llama_13b_chat"
and modelTable="command";

部署 Bark

模型请到 https://huggingface.co/suno/bark 下载。需要提前下载到 Ray 所在服务器

下载完模型,进入模型目录,然后执行如下指令:

git clone https://huggingface.co/bert-base-multilingual-cased  pretrained_tokenizer

注意,该模型在运行时还会下载一些音频的编解码器,所以需要确保网络通畅。

!byzerllm setup single;
!byzerllm setup "num_gpus=1";
--!byzerllm setup "resource.master=0.01";
--!byzerllm setup "resource.worker_2=0";
!byzerllm setup "maxConcurrency=4";

run command as LLM.`` where 
action="infer"
and pretrainedModelType="bark"
and localModelDir="/home/byzerllm/models/bark"
and udfName="text_to_voice"
and reconnect="false"
and modelTable="command";

注意

如果你显卡有限,比如只有一张显卡,但是显存够大,那么你可以通过 num_gpus 来控制每个模型可以使用的显卡资源数。

!byzerllm setup single;
!byzerllm setup "num_gpus=0.3";

在这里,我们使用 num_gpus=0.3 表示他只会占用0.5颗GPU, 如果部署每个模型的时候,设置这个参数,那么就会部署在一块GPU上。 不过你需要自己确保这块GPU的显存确实可以同时跑多个模型。

开发一个界面

这里我们用 gradio 开发一个界面,假设文件名称叫 digital_techer.py:

import concurrent.futures
import json
import re
import time
from base64 import b64encode
from typing import List, Tuple

import gradio as gr
import numpy as np
import requests

# select finetune_model_predict(array(feature)) as a
def request(sql: str, json_data: str) -> str:
    url = 'http://127.0.0.1:9003/model/predict'
    data = {
        'sessionPerUser': 'true',
        'sessionPerRequest': 'true',
        'owner': 'william',
        'dataType': 'string',
        'sql': sql,
        'data': json_data
    }
    response = requests.post(url, data=data)
    if response.status_code != 200:
        raise Exception(response.text)
    return response.text


def voice_to_text(rate: int, t: np.ndarray) -> str:
    json_data = json.dumps([
        {"rate": rate, "voice": t.tolist()}
    ])

    response = request('''
     select voice_to_text(array(feature)) as value
    ''', json_data)

    t = json.loads(response)
    t2 = json.loads(t[0]["value"][0])
    return t2[0]["predict"]

def text_to_voice(sequence) -> np.ndarray:
    json_data = json.dumps([
        {"instruction": sequence}
    ])
    data = request('''
                 select text_to_voice(array(feature)) as value
                ''', json_data)
    t = json.loads(data)
    t2 = json.loads(t[0]["value"][0])
    return np.array(t2[0]["predict"])

def execute_parallel(fn,a,num_workers=3):
    import concurrent.futures

    def process_chunk(chunk):
        return [fn(*m) for m in chunk]

    def split_list(lst, n):
        k, m = divmod(len(lst), n)
        return [lst[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]

    # Split the list into three chunks
    chunks = split_list(a,num_workers)

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        results = list(executor.map(lambda x: process_chunk(x), chunks))
    return results

## s,history = state.history
def chat(s: str, history: List[Tuple[str, str]]) -> str:
    newhis = [{"role": item[0], "content": item[1]} for item in history]
    template = """You are a helpful assistant. Think it over and answer the user question correctly. 
    User: {context}
    Please answer based on the content above: 
    {query}
    Assistant:"""
    json_data = json.dumps([
        {"instruction": s, "k": 1, "temperature": 0.1, "prompt": template, 'history': newhis, 'max_length': 8000}
    ])
    response = request('''
     select llama_13b_chat(array(feature)) as value
    ''', json_data)
    t = json.loads(response)
    t2 = json.loads(t[0]["value"][0])
    return t2[0]["predict"]


class UserState:
    def __init__(self, history: List[Tuple[str, str]] = [], output_state: str = "") -> None:
        self.history = history
        self.output_state = output_state

    def add_chat(self, role, content):
        self.history.append((role, content))
        if len(self.history) > 10:
            self.history = self.history[len(self.history) - 10:]

    def add_output(self, message):
        self.output_state = f"{self.output_state}\n\n{message}"

    def clear(self):
        self.history = []
        self.output_state = ""


def talk(t: str, state: UserState) -> str:
    state.add_chat('user', t)
    s = chat(t, history=state.history)
    state.add_chat('assistant', s)
    return s


def html_audio_autoplay(bytes: bytes) -> object:
    """Creates html object for autoplaying audio at gradio app.
    Args:
        bytes (bytes): audio bytes
    Returns:
        object: html object that provides audio autoplaying
    """
    b64 = b64encode(bytes).decode()
    html = f"""
    <audio controls autoplay>
    <source src="data:audio/wav;base64,{b64}" type="audio/wav">
    </audio>
    """
    return html


def convert_to_16_bit_wav(data):
    # Based on: https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.write.html
    warning = "Trying to convert audio automatically from {} to 16-bit int format."
    if data.dtype in [np.float64, np.float32, np.float16]:
        data = data / np.abs(data).max()
        data = data * 32767
        data = data.astype(np.int16)
    elif data.dtype == np.int32:
        data = data / 65538
        data = data.astype(np.int16)
    elif data.dtype == np.int16:
        pass
    elif data.dtype == np.uint16:
        data = data - 32768
        data = data.astype(np.int16)
    elif data.dtype == np.uint8:
        data = data * 257 - 32768
        data = data.astype(np.int16)
    else:
        raise ValueError(
            "Audio data cannot be converted automatically from "
            f"{data.dtype} to 16-bit int format."
        )
    return data


def main_note(audio, text, state: UserState):
    if audio is None:
        return "", state.output_state, state

    if len(state.history) == 0:
        state.history.append(["system", "You are a helpful assistant."])

    rate, y = audio
    print("voice to text:")

    t = voice_to_text(rate, y)

    if len(t.strip()) == 0:
        return "", state.output_state, state

    if t.strip() == "重新开始":
        state.clear()
        return "", state.output_state, state

    print(t)
    print("talk to llama30b:")
    s = talk(t + " " + text, state)
    print("llama30b:", s)
    print("text to voice")
    message = f"你: {t}\n\n外教: {s}\n"
    sequences = re.split(r'[.!?。!?]', s)
    chunks = []
    start_time = time.time()
    for s in sequences:
        if len(s.strip()) == 0:
            continue
        chunks.append(s.strip())
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        m = list(executor.map(lambda x: text_to_voice(x), chunks))
    m = np.concatenate(m)
    end_time = time.time()
    print(f"total time: {end_time - start_time} seconds")

    from scipy.io.wavfile import write as write_wav
    import io
    wav_file = io.BytesIO()
    write_wav(wav_file, 24_000, convert_to_16_bit_wav(m))
    wav_file.seek(0)
    html = html_audio_autoplay(wav_file.getvalue())

    state.add_output(message)
    return html, state.output_state, state

def main():
    state = gr.State(UserState())
    demo = gr.Interface(
        fn=main_note,
        inputs=[gr.Audio(source="microphone"), gr.TextArea(lines=30, placeholder="message"), state],
        outputs=["html", gr.TextArea(lines=30, placeholder="message"), state],
        examples=[
        ],
        interpretation=None,
        allow_flagging="never",
    )
    demo.launch(server_name="127.0.0.1", server_port=7861, debug=True)

if __name__ == "__main__":
    main()

之后运行命令

python digital_python.py

即可。

Logo

更多推荐