使用 Amazon Bedrock Cohere 多语言嵌入模型构建金融搜索应用机器学习博客

2026-01-27 12:37:04

使用 Amazon Bedrock Cohere 的多语言嵌入模型建构金融搜索应用

重点总结在企业中，面对大量非结构化数据的挑战，传统的关键字或同义词匹配方法不再有效。通过使用 Cohere 的多语言嵌入模型，可以通过文本嵌入来更好地理解与分析这些数据。这一方法使得金融分析师能够快速提取相关资讯，减少错误并提升分析效率。本文将展示如何利用 Amazon Bedrock 中的 Cohere 嵌入模型构建一个多语言的金融新闻搜索应用。

企业数据的挑战

企业接触到的大量数据往往难以被发现，因为这些数据通常是非结构化的。在分析非结构化数据时，传统的方法只依赖于关键字或同义词匹配，无法捕捉文档的完整上下文，使其在处理非结构化数据方面的效果不佳。

相较之下，文本嵌入利用机器学习技术来捕捉非结构化数据的意义。嵌入是由表示性语言模型生成的，它将文本转换为数字向量并编码文档中的上下文信息。这使得语义搜索、强化检索生成RAG、主题建模和文本分类等应用成为可能。

以金融服务业为例，应用场景包括从财报中提取洞见、在财务报表中搜索信息、以及分析金融新闻中对股票和市场的情感分析。文本嵌入使行业专业人员能够快速从文档中提取信息，减少错误，并提高表现。

Cohere 的多语言嵌入模型

Cohere 是一家领先的企业 AI 平台，专注于构建世界级的大型语言模型及基于 LLM 的解决方案，能让计算机进行文本的搜索、理解意义和对话。他们提供易于使用的界面和坚实的安全隐私控制。

Cohere 的多语言嵌入模型为超过 100 种语言生成文档的向量表示，并可通过 Amazon Bedrock 作为 API 使用，消除了管理基础设施的需要，确保敏感信息的安全和保护。

这一模型通过将具有相似意义的文本分组，将其在语义向量空间中分配相互接近的位置。这使开发人员可以处理多种语言的文本，无需在不同模型之间切换，提高了多语言应用的效率和性能。

特点描述专注于文档质量Cohere 的模型不仅测量文档之间的相似度，还测量文档的质量。改善 RAG 应用的检索Cohere 的嵌入模型在检索系统中表现卓越。具成本效益的数据压缩Cohere 使用特殊的压缩训练方法，为向量数据库带来显著的成本节约。

用例示例

文本嵌入将非结构化数据转变为结构化形式，使你能够客观地比较、分析并提取洞见。以下是 Cohere 嵌入模型所支持的几个示例用例：

语义搜索：结合向量数据库实现强大的搜索应用，根据搜索短语的含义提供优秀的相关性。大型系统的搜索引擎：为 RAG 系统发现和检索来自连接企业数据源的最相关信息。文本分类：支持意图识别、情感分析和先进文档分析。主题建模：将一组文档转化为不同的集群，以揭示新兴主题和趋势。

通过 Rerank 增强搜索系统

若企业已拥有传统关键字搜索系统，该如何引入现代语义搜索能力？对于这类已存在于公司信息架构中的系统，完全转型到基于嵌入的方法在许多情况下是不切实际的。

Cohere 的 Rerank 端点旨在填补这一空白，它作为搜索流程的第二步，为用户的查询提供相关文档的排名。企业可以保留现有的关键字或语义系统进行第一阶段检索，然后使用 Rerank 端点提升搜索结果的质量。

Rerank 通过一行代码轻松引入语义搜索技术，为改善搜索结果提供了快速而简单的选择。该端点也支持多语言。

解决方案概述

金融分析师需消化大量内容，如金融出版物和新闻媒体，以保持信息的更新。根据金融专业人员协会AFP，金融分析师 75 的时间花在数据收集或管理过程中，而不是增值分析。从各种来源和文档中寻找问题的答案是一项耗时且繁琐的工作。使用 Cohere 嵌入模型，分析师可以快速搜索多种语言的文章标题，以找到与特定查询最相关的文章，大大节省了时间和精力。

在本案例示例中，我们将展示 Cohere 的 Embed 模型如何在一次独特的流程中跨多语言的金融新闻进行搜索和查询。然后我们将展示如何将 Rerank 添加到嵌入检索中或将其添加到旧的词汇搜索中进一步改善结果。

支持的 notebook 可在 GitHub 查看。

通过 Amazon Bedrock 启用模型访问

Amazon Bedrock 用户需要请求模型访问，以使其可用。要请求访问更多模型，请选择 Amazon Bedrock 控制台上的模型访问。欲了解详细信息，请参见模型访问。本指导需要请求对 Cohere Embed 多语言模型的访问。

安装包并导入模块

首先，我们安装必要的包并导入本示例中将使用的模块：

python!pip install upgrade cohereaws hnswlib translateimport pandas as pdimport cohereawsimport hnswlibimport osimport reimport boto3

导入文档

我们使用一个包含 15 种语言真实世界文章标题的数据集MultiFIN。这是一个为金融自然语言处理 (NLP) 精心策划的开源数据集，可在 GitHub 存储库中找到。

使用 Amazon Bedrock Cohere 多语言嵌入模型构建金融搜索应用机器学习博客

在本例中，我们创建了一个包含 MultiFIN 数据的 CSV 文件，以及一列翻译。我们不使用这一列来喂给模型，而是用来帮助我们在打印结果时让不懂丹麦语或西班牙语的人也能跟上。我们指向该 CSV 以创建数据框：

pythonurl = https//rawgithubusercontentcom/cohereai/cohereaws/main/notebooks/bedrock/multiFINtraincsvdf = pdreadcsv(url)

检查数据集

dfhead(5)

选择要查询的文档列表

MultiFIN 拥有 6000 以上的记录，涉及 15 种不同语言。为了本示例，我们重点关注三种语言：英语、西班牙语和丹麦语。我们还根据标题长度进行排序，选择较长的标题。

因为我们选择长文章，所以确保其长度不是由于重复序列。我们将清理其中一个示例。

pythondf[text]iloc[2215]

pythonEl 86 de las empresas espaolas comprometidas con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de Desarrollo Sostenible

python

确保标题中没有重复文本

def removeduplicates(text) return resub(r((bwb{12}wb))1 r1 text flags=reI)

df[text] = df[text]apply(removeduplicates)

保留选定的语言

languages = [English Spanish Danish]df = dfloc[df[lang]isin(languages)]

选择长度排名前 80 的文章

df[textlength] = df[text]strlen()dfsortvalues(by=[textlength] ascending=False inplace=True)top80df = df[80]

语言分布

top80df[lang]valuecounts()

我们的文档列表在三种语言中分布均匀：

pythonlangSpanish 33English 29Danish 18Name count dtype int64

以下是我们数据集中最长的文章标题：

pythontop80df[text]iloc[0]

pythonCFOdirect Resultater fra PwCs Employee Engagement Landscape Survey herunder hvordan man skaber mere engagement blandt medarbejdere Ls desuden om de regnskabsmssige konsekvenser for indkomstskat ifbm Brexit

嵌入和索引文档

现在，我们希望将文档嵌入并存储嵌入。这些嵌入是非常大的向量，封装了我们文档的语义。特别地，我们使用 Cohere 的 embedmultilingualv30 模型，该模型创建具有 1024 维度的嵌入。

在传递查询时，我们也嵌入该查询，并使用 hnswlib 库查找最近的邻居。

建立 Cohere 客户端、嵌入文档和创建搜索索引只需几行代码。我们还跟踪文档的语言和翻译，以丰富结果的显示。

python

建立 Cohere 客户端

co = cohereawsClient(mode=cohereawsModeBEDROCK)modelid = cohereembedmultilingualv3

嵌入文档

docs = top80df[text]tolist()docslang = top80df[lang]tolist()translateddocs = top80df[translatedtext]tolist() # 用于参考非英语结果时docembs = coembed(texts=docs modelid=modelid inputtype=searchdocument)embeddings

创建搜索索引

index = hnswlibIndex(space=ip dim=1024)indexinitindex(maxelements=len(docembs) efconstruction=512 M=64)indexadditems(docembs list(range(len(docembs))))

构建检索系统

接下来，我们构建一个函数，该函数以查询作为输入，嵌入并查找与其最相关的四个标题：

python

检索与查询最接近的 4 个文档

def retrieval(query) # 嵌入查询并检索结果 queryemb = coembed(texts=[query] modelid=modelid inputtype=searchquery)embeddings docids = indexknnquery(queryemb k=3)[0][0] # 我们将检索 4 个最近邻域

# 打印并附加结果print(fQUERY {queryupper()} n)retrieveddocs translatedretrieveddocs = [] []for docid in docids    # 附加结果    retrieveddocsappend(docs[docid])    translatedretrieveddocsappend(translateddocs[docid])    # 打印结果    print(fORIGINAL ({docslang[docid]}) {docs[docid]})    if docslang[docid] != English        print(fTRANSLATION {translateddocs[docid]} n)    else        print()print(END OF RESULTS nn)return retrieveddocs translatedretrieveddocs

查询检索系统

让我们试验几个不同的查询。首先从英语开始：

pythonqueries = [ Are businesses meeting sustainability goals Can data science help meet sustainability goals]

for query in queries retrieval(query)

结果如下：

plaintextQUERY ARE BUSINESSES MEETING SUSTAINABILITY GOALS

ORIGINAL (English) Quality of business reporting on the Sustainable Development Goals improves but has a long way to go to meet and drive targets

ORIGINAL (English) Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress

ORIGINAL (Spanish) Integrar los criterios ESG y el propsito en la estrategia principal reto de los Consejos de las empresas espaolas en el mundo postCOVID TRANSLATION Integrate ESG criteria and purpose into the main challenge strategy of the Boards of Spanish companies in the postCOVID world

END OF RESULTS

QUERY CAN DATA SCIENCE HELP MEET SUSTAINABILITY GOALS

ORIGINAL (English) Using AI to better manage the environment could reduce greenhouse gas emissions boost global GDP by up to 38 million jobs by 2030

ORIGINAL (English) Quality of business reporting on the Sustainable Development Goals improves but has a long way to go to meet and drive targets

ORIGINAL (English) Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress

END OF RESULTS

请注意以下问题：

我们提出了相关但略有不同的问题，模型能够精确地将最相关的结果呈现在顶部。我们的模型不执行基于关键字的搜索，而是语义搜索。即使我们使用了“数据科学”这一术语而不是“人工智慧”，模型仍然能够理解提问的内容并返回最相关的结果。

接下来，我们用丹麦语查询一下：

pythonquery = Hvor kan jeg finde den seneste danske boligplan # Where can I find the latest Danish property planretrieveddocs translatedretrieveddocs = retrieval(query)

plaintextQUERY HVOR KAN JEG FINDE DEN SENESTE DANSKE BOLIGPLAN

ORIGINAL (Danish) Nyt fra CFOdirect Ny PPampEguide FAQs om den nye leasingstandard podcast om udfordringerne ved implementering af leasingstandarden og meget mereTRANSLATION New from CFOdirect New PPampE guide FAQs on the new leasing standard podcast on the challenges of implementing the leasing standard and much more

ORIGINAL (Danish) Lovforslag fremlagt om rentefri ln udskudt frist for lnsumsafgift frtidig udbetaling af skattekredit og loft p indestende p skattekontoenTRANSLATION Legislative proposal presented on interestfree loans deferred payroll tax deadline early payment of tax credit and ceiling on deposits in the tax account

quickq加速器苹果版

ORIGINAL (Danish) Nyt fra CFOdirect Shareholdersprgsml til ledelsen SEC cybersikkerhedsguide den amerikanske skattereform og meget mereTRANSLATION New from CFOdirect Shareholder questions for management the SEC cybersecurity guide US tax reform and more

END OF RESULTS

在上述示例中，英文缩写“PPampE”代表“property plant and equipment”，我们的模型能够将其与查询连接。

尽管所有返回的结果都是丹麦语，但如果其语义意义更接近，模型也可以返回其他语言的文档。我们拥有完全的灵活性，通过数行代码，我们可以指定模型只查询与查询语言相同的文档，或查看所有文档。

使用 Cohere Rerank 改善结果

嵌入技术非常强大。不过，我们现在将了解如何通过 Cohere 的 Rerank 端点进一步精细化结果，该端点已被训练以根

产品展示

使用 Amazon Bedrock Cohere 多语言嵌入模型构建金融搜索应用机器学习博客

使用 Amazon Bedrock Cohere 的多语言嵌入模型建构金融搜索应用

企业数据的挑战

Cohere 的多语言嵌入模型

用例示例

通过 Rerank 增强搜索系统

解决方案概述

通过 Amazon Bedrock 启用模型访问

安装包并导入模块

导入文档

检查数据集

选择要查询的文档列表

确保标题中没有重复文本

保留选定的语言

选择长度排名前 80 的文章

语言分布

嵌入和索引文档

建立 Cohere 客户端

嵌入文档

创建搜索索引

构建检索系统

检索与查询最接近的 4 个文档

查询检索系统

ORIGINAL (English) Quality of business reporting on the Sustainable Development Goals improves but has a long way to go to meet and drive targets

ORIGINAL (English) Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress

ORIGINAL (English) Using AI to better manage the environment could reduce greenhouse gas emissions boost global GDP by up to 38 million jobs by 2030

ORIGINAL (English) Quality of business reporting on the Sustainable Development Goals improves but has a long way to go to meet and drive targets

ORIGINAL (English) Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress

使用 Cohere Rerank 改善结果

导航

导航

互动quickq加速器苹果版

订阅我们的邮箱

产品展示

使用 Amazon Bedrock Cohere 多语言嵌入模型构建金融搜索应用 机器学习博客

使用 Amazon Bedrock Cohere 的多语言嵌入模型建构金融搜索应用

企业数据的挑战

Cohere 的多语言嵌入模型

用例示例

通过 Rerank 增强搜索系统

解决方案概述

通过 Amazon Bedrock 启用模型访问

安装包并导入模块

导入文档

检查数据集

选择要查询的文档列表

确保标题中没有重复文本

保留选定的语言

选择长度排名前 80 的文章

语言分布

嵌入和索引文档

建立 Cohere 客户端

嵌入文档

创建搜索索引

构建检索系统

检索与查询最接近的 4 个文档

查询检索系统

ORIGINAL (English) Quality of business reporting on the Sustainable Development Goals improves but has a long way to go to meet and drive targets

ORIGINAL (English) Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress

ORIGINAL (English) Using AI to better manage the environment could reduce greenhouse gas emissions boost global GDP by up to 38 million jobs by 2030

ORIGINAL (English) Quality of business reporting on the Sustainable Development Goals improves but has a long way to go to meet and drive targets

ORIGINAL (English) Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress

使用 Cohere Rerank 改善结果

导航

导航

互动quickq加速器苹果版

订阅我们的邮箱

使用 Amazon Bedrock Cohere 多语言嵌入模型构建金融搜索应用机器学习博客