什么是迁移学习理论?如何运用实践?

一、迁移学习理论预训练模型(Pretrainedmodel):
一般情况下预训练模型都是大型模型,具备复杂的网络结构,众多的参数量,以及在足够大的数据集下进行训练而产生的模型.在NLP领域,预训练模型往往是语言模型,因为语言模型的训练是无监督的,可以获得大规模语料,同时语言模型又是许多典型NLP任务的基础,如机器翻译,文本生成,阅读理解等,常见的预训练模型有BERT,GPT,roBERTa,transformer-XL等.
微调(Fine-tuning):根据给定的预训练模型,改变它的部分参数或者为其新增部分输出结构后,通过在小部分数据集上训练,来使整个模型更好的适应特定任务.
微调脚本(Fine-tuningscript):实现微调过程的代码文件。这些脚本文件中,应包括对预训练模型的调用,对微调参数的选定以及对微调结构的更改等,同时,因为微调是一个训练过程,它同样需要一些超参数的设定,以及损失函数和优化器的选取等,因此微调脚本往往也包含了整个迁移学习的过程.
关于微调脚本的说明:一般情况下,微调脚本应该由不同的任务类型开发者自己编写,但是由于目前研究的NLP任务类型(分类,提取,生成)以及对应的微调输出结构都是有限的,有些微调方式已经在很多数据集上被验证是有效的,因此微调脚本也可以使用已经完成的规范脚本.
两种迁移方式:直接使用预训练模型,进行相同任务的处理,不需要调整参数或模型结构,这些模型开箱即用。但是这种情况一般只适用于普适任务,如:fasttest工具包中预训练的词向量模型。另外,很多预训练模型开发者为了达到开箱即用的效果,将模型结构分各个部分保存为不同的预训练模型,提供对应的加载方法来完成特定目标.
更加主流的迁移学习方式是发挥预训练模型特征抽象的能力,然后再通过微调的方式,通过训练更新小部分参数以此来适应不同的任务。这种迁移方式需要提供小部分的标注数据来进行监督学习.
关于迁移方式的说明:直接使用预训练模型的方式,已经在fasttext的词向量迁移中学习.接下来的迁移学习实践将主要讲解通过微调的方式进行迁移学习.
二、NLP中的标准数据集GLUE数据集合的介绍:GLUE由纽约大学,华盛顿大学,Google联合推出,涵盖不同NLP任务类型,截止至2020年1月其中包括11个子任务数据集,成为衡量NLP研究发展的衡量标准.
LUE数据集合包含以下数据集CoLA数据集SST-2数据集MRPC数据集STS-B数据集QQP数据集MNLI数据集SNLI数据集QNLI数据集RTE数据集WNLI数据集diagnostics数据集(官方未完善)GLUE数据集合的下载方式:
下载脚本代码:
运行脚本下载所有数据集:
#假设你已经将以上代码copy到download_glue_data.py文件中#运行这个python脚本,你将同目录下得到一个glue文件夹pythondownload_glue_data.py
输出效果:
DownloadingandextractingCoLA…Completed!DownloadingandextractingSST…Completed!ProcessingMRPC…LocalMRPCdatanotspecified,downloadingdatafromhttps://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txtCompleted!DownloadingandextractingQQP…Completed!DownloadingandextractingSTS…Completed!DownloadingandextractingMNLI…Completed!DownloadingandextractingSNLI…Completed!DownloadingandextractingQNLI…Completed!DownloadingandextractingRTE…Completed!DownloadingandextractingWNLI…Completed!Downloadingandextractingdiagnostic…Completed!GLUE数据集合中子数据集的样式及其任务类型
CoLA数据集文件样式
-CoLA/-dev.tsv-original/-test.tsv-train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv,dev.tsv,test.tsv,分别代表训练集,验证集和测试集.其中train.tsv与dev.tsv数据样式相同,都是带有标签的数据,其中test.tsv是不带有标签的数据.train.tsv数据样式:
…gj041Shecoughedherselfawakeastheleaflandedonhernose.gj041Thewormwriggledontothecarpet.gj041Thechocolatemeltedontothecarpet.gj040*Theballwriggleditselfloose.gj041Billwriggledhimselfloose.bc011Thesinkingoftheshiptocollecttheinsurancewasverydevious.bc011Theship’ssinkingwasverydevious.bc010*Theship‘ssinkingtocollecttheinsurancewasverydevious.bc011Thetestingofsuchdrugsononeselfistoorisky.bc010*Thisdrug’stestingononeselfistoorisky….
train.tsv数据样式说明:
train.tsv中的数据内容共分为4列,第一列数据,如gj04,bc01等代表每条文本数据的来源即出版物代号;第二列数据,0或1,代表每条文本数据的语法是否正确,0代表不正确,1代表正确;第三列数据,‘’,是作者最初的正负样本标记,与第二列意义相同,‘’表示不正确;第四列即是被标注的语法使用是否正确的文本句子.test.tsv数据样式:
indexsentence0Billwhistledpastthehouse.1Thecarhonkeditswaydowntheroad.2BillpushedHarryoffthesofa.3thekittensyawnedawakeandplayed.4IdemandthatthemoreJohneats,themorehepay.5IfJohneatsmore,keepyourmouthshuttighter,OK?6Hisexpectationsarealwayslowerthanmineare.7Thesooneryoucall,themorecarefullyIwillwordtheletter.8Themoretimidhefeels,themorepeopleheinterviewswithoutaskingquestionsof.9OnceJanetleft,Fredbecamealotcrazier….
test.tsv数据样式说明:
test.tsv中的数据内容共分为2列,第一列数据代表每条文本数据的索引;第二列数据代表用于测试的句子.CoLA数据集的任务类型:
二分类任务评估指标为:MCC(马修斯相关系数,在正负样本分布十分不均衡的情况下使用的二分类评估指标)SST-2数据集文件样式
-SST-2/-dev.tsv-original/-test.tsv-train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv,dev.tsv,test.tsv,分别代表训练集,验证集和测试集.其中train.tsv与dev.tsv数据样式相同,都是带有标签的数据,其中test.tsv是不带有标签的数据.train.tsv数据样式:
sentencelabelhidenewsecretionsfromtheparentalunits0containsnowit,onlylaboredgags0thatlovesitscharactersandcommunicatessomethingratherbeautifulabouthumannature1remainsutterlysatisfiedtoremainthesamethroughout0ontheworstrevenge-of-the-nerdsclichésthefilmmakerscoulddredgeup0that‘sfartootragictomeritsuchsuperficialtreatment0demonstratesthatthedirectorofsuchhollywoodblockbustersaspatriotgamescanstillturnoutasmall,personalfilmwithanemotionalwallop.1ofsaucy1adepressedfifteen-year-old’ssuicidalpoetry0…
train.tsv数据样式说明:
train.tsv中的数据内容共分为2列,第一列数据代表具有感情色彩的评论文本;第二列数据,0或1,代表每条文本数据是积极或者消极的评论,0代表消极,1代表积极.test.tsv数据样式:
indexsentence0uneasymishmashofstylesandgenres.1thisfilm‘srelationshiptoactualtensionisthesameaswhatchristmas-treeflockinginaspraycanistoactualsnow:apoor–ifdurable–imitation.2bytheendofnosuchthingtheaudience,likebeatrice,hasawatchfulaffectionforthemonster.3directorrobmarshallwentoutgunningtomakeagreatone.4lathananddiggshaveconsiderablepersonalcharm,andtheirscreenrapportmakestheoldstoryseemnew.5awell-madeandoftenlovelydepictionofthemysteriesoffriendship.6noneofthisviolatestheletterofbehan’sbook,butmissingisitsspirit,itsribald,full-throatedhumor.7althoughitbangsaveryclicheddrumattimes,thiscrowd-pleaser‘sfreshdialogue,energeticmusic,andgood-naturedspunkareofteninfectious.8itisnotamass-marketentertainmentbutanuncompromisingattemptbyoneartisttothinkaboutanother.9thisisjunkfoodcinemaatitsgreasiest….
test.tsv数据样式说明:*test.tsv中的数据内容共分为2列,第一列数据代表每条文本数据的索引;第二列数据代表用于测试的句子.
SST-2数据集的任务类型:
二分类任务评估指标为:ACC-MRPC/-dev.tsv-test.tsv-train.tsv-dev_ids.tsv-msr_paraphrase_test.txt-msr_paraphrase_train.txt
文件样式说明:
在使用中常用到的文件是train.tsv,dev.tsv,test.tsv,分别代表训练集,验证集和测试集.其中train.tsv与dev.tsv数据样式相同,都是带有标签的数据,其中test.tsv是不带有标签的数据.train.tsv数据样式:
Quality#1ID#2ID#1String#2String1702876702977Amroziaccusedhisbrother,whomhecalled“thewitness”,ofdeliberatelydistortinghisevidence.Referringtohimasonly“thewitness”,Amroziaccusedhisbrotherofdeliberatelydistortinghisevidence.021087052108831YucaipaownedDominick’sbeforesellingthechaintoSafewayin1998for\(2.5billion.YucaipaboughtDominick'sin1995for\)693millionandsoldittoSafewayfor\(1.8billionin1998.113303811330521TheyhadpublishedanadvertisementontheInternetonJune10,offeringthecargoforsale,headded.OnJune10,theship'sownershadpublishedanadvertisementontheInternet,offeringtheexplosivesforsale.033446673344648Around0335GMT,Tabshareswereup19cents,or4.4%,atA\)4.56,havingearliersetarecordhighofA\(4.57.Tabsharesjumped20cents,or4.6%,tosetarecordclosinghighatA\)4.57.112368201236712Thestockrose\(2.11,orabout11percent,tocloseFridayat\)21.51ontheNewYorkStockExchange.PG&ECorp.sharesjumped\(1.63or8percentto\)21.03ontheNewYorkStockExchangeonFriday.1738533737951Revenueinthefirstquarteroftheyeardropped15percentfromthesameperiodayearearlier.WiththescandalhangingoverStewart‘scompany,revenuethefirstquarteroftheyeardropped15percentfromthesameperiodayearearlier.0264589264502TheNasdaqhadaweeklygainof17.27,or1.2percent,closingat1,520.15onFriday.Thetech-lacedNasdaqComposite.IXICrallied30.46points,or2.04percent,to1,520.15.1579975579810TheDVD-CCAthenappealedtothestateSupremeCourt.TheDVDCCAappealedthatdecisiontotheU.S.SupremeCourt….
train.tsv数据样式说明:
train.tsv中的数据内容共分为5列,第一列数据,0或1,代表每对句子是否具有相同的含义,0代表含义不相同,1代表含义相同.第二列和第三列分别代表每对句子的id,第四列和第五列分别具有相同/不同含义的句子对.test.tsv数据样式:
index#1ID#2ID#1String#2String010898741089925PCCW’schiefoperatingofficer,MikeButcher,andAlexArena,thechieffinancialofficer,willreportdirectlytoMrSo.CurrentChiefOperatingOfficerMikeButcherandGroupChiefFinancialOfficerAlexArenawillreporttoSo.130194463019327Theworld‘stwolargestautomakerssaidtheirU.S.salesdeclinedmorethanpredictedlastmonthasalatesummersalesfrenzycausedmoreofanindustrybacklashthanexpected.DomesticsalesatbothGMandNo.2FordMotorCo.declinedmorethanpredictedasalatesummersalesfrenzypromptedalarger-than-expectedindustrybacklash.219456051945824AccordingtothefederalCentersforDiseaseControlandPrevention(news-websites),therewere19reportedcasesofmeaslesintheUnitedStatesin2002.TheCentersforDiseaseControlandPreventionsaidtherewere19reportedcasesofmeaslesintheUnitedStatesin2002.314304021430329AtropicalstormrapidlydevelopedintheGulfofMexicoSundayandwasexpectedtohitsomewherealongtheTexasorLouisianacoastsbyMondaynight.AtropicalstormrapidlydevelopedintheGulfofMexicoonSundayandcouldhavehurricane-forcewindswhenithitslandsomewherealongtheLouisianacoastMondaynight.433543813354396Thecompanydidn’tdetailthecostsofthereplacementandrepairs.Butcompanyofficialsexpectthecostsofthereplacementworktorunintothemillionsofdollars.513909951391183Thesettlingcompanieswouldalsoassigntheirpossibleclaimsagainsttheunderwriterstotheinvestorplaintiffs,headded.Undertheagreement,thesettlingcompanieswillalsoassigntheirpotentialclaimsagainsttheunderwriterstotheinvestors,headded.622014012201285AirCommodoreQuaifesaidtheHornetsremainedonthree-minutealertthroughouttheoperation.AirCommodoreJohnQuaifesaidthesecurityoperationwasunprecedented.724538432453998AWashingtonCountymanmayhavethecountysfirsthumancaseofWestNilevirus,thehealthdepartmentsaidFriday.ThecountysfirstandonlyhumancaseofWestNilethisyearwasconfirmedbyhealthofficialsonSept.8….
test.tsv数据样式说明:*test.tsv中的数据内容共分为5列,第一列数据代表每条文本数据的索引;其余列的含义与train.tsv中相同.
MRPC数据集的任务类型:
句子对二分类任务评估指标为:ACC和F1STS-B数据集文件样式
-STS-B/-dev.tsv-test.tsv-train.tsv-LICENSE.txt-readme.txt-original/
文件样式说明:
在使用中常用到的文件是train.tsv,dev.tsv,test.tsv,分别代表训练集,验证集和测试集.其中train.tsv与dev.tsv数据样式相同,都是带有标签的数据,其中test.tsv是不带有标签的数据.train.tsv数据样式:
indexgenrefilenameyearold_indexsource1source2sentence1sentence2score0main-captionsMSRvid2012test0001nonenoneAplaneistakingoff.Anairplaneistakingoff.5.0001main-captionsMSRvid2012test0004nonenoneAmanisplayingalargeflute.Amanisplayingaflute.3.8002main-captionsMSRvid2012test0005nonenoneAmanisspreadingshrededcheeseonapizza.Amanisspreadingshreddedcheeseonanuncookedpizza.3.8003main-captionsMSRvid2012test0006nonenoneThreemenareplayingchess.Twomenareplayingchess.2.6004main-captionsMSRvid2012test0009nonenoneAmanisplayingthecello.Amanseatedisplayingthecello.4.2505main-captionsMSRvid2012test0011nonenoneSomemenarefighting.Twomenarefighting.4.2506main-captionsMSRvid2012test0012nonenoneAmanissmoking.Amanisskating.0.5007main-captionsMSRvid2012test0013nonenoneThemanisplayingthepiano.Themanisplayingtheguitar.1.6008main-captionsMSRvid2012test0014nonenoneAmanisplayingonaguitarandsinging.Awomanisplayinganacousticguitarandsinging.2.2009main-captionsMSRvid2012test0016nonenoneApersonisthrowingacatontotheceiling.Apersonthrowsacatontheceiling.5.000…
train.tsv数据样式说明:
train.tsv中的数据内容共分为10列,第一列数据是数据索引;第二列代表每对句子的来源,如main-captions表示来自字幕;第三列代表来源的具体保存文件名,第四列代表出现时间(年);第五列代表原始数据的索引;第六列和第七列分别代表句子对原始来源;第八列和第九列代表相似程度不同的句子对;第十列代表句子对的相似程度由低到高,值域范围是[0,5].test.tsv数据样式:
indexgenrefilenameyearold_indexsource1source2sentence1sentence20main-captionsMSRvid2012test0024nonenoneAgirlisstylingherhair.Agirlisbrushingherhair.1main-captionsMSRvid2012test0033nonenoneAgroupofmenplaysocceronthebeach.Agroupofboysareplayingsocceronthebeach.2main-captionsMSRvid2012test0045nonenoneOnewomanismeasuringanotherwoman‘sankle.Awomanmeasuresanotherwoman’sankle.3main-captionsMSRvid2012test0063nonenoneAmaniscuttingupacucumber.Amanisslicingacucumber.4main-captionsMSRvid2012test0066nonenoneAmanisplayingaharp.Amanisplayingakeyboard.5main-captionsMSRvid2012test0074nonenoneAwomaniscuttingonions.Awomaniscuttingtofu.6main-captionsMSRvid2012test0076nonenoneAmanisridinganelectricbicycle.Amanisridingabicycle.7main-captionsMSRvid2012test0082nonenoneAmanisplayingthedrums.Amanisplayingtheguitar.8main-captionsMSRvid2012test0092nonenoneAmanisplayingguitar.Aladyisplayingtheguitar.9main-captionsMSRvid2012test0095nonenoneAmanisplayingaguitar.Amanisplayingatrumpet.10main-captionsMSRvid2012test0096nonenoneAmanisplayingaguitar.Amanisplayingatrumpet….
test.tsv数据样式说明:
test.tsv中的数据内容共分为9列,含义与train.tsv前9列相同.STS-B数据集的任务类型:
句子对多分类任务/句子对回归任务评估指标为:Pearson-SpearmanCorrQQP数据集文件样式
-QQP/-dev.tsv-original/-test.tsv-train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv,dev.tsv,test.tsv,分别代表训练集,验证集和测试集.其中train.tsv与dev.tsv数据样式相同,都是带有标签的数据,其中test.tsv是不带有标签的数据.train.tsv数据样式:
idqid1qid2question1question2is_duplicate133273213221213222Howisthelifeofamathstudent?Couldyoudescribeyourownexperiences?Whichlevelofpreprationisenoughfortheexamjlpt5?0402555536040536041HowdoIcontrolmyhornyemotions?Howdoyoucontrolyourhorniness?1360472364011490273Whatcausesstoolcolortochangetoyellow?Whatcancausestooltocomeoutaslittleballs?01506621557217256WhatcanonedoafterMBBS?WhatdoidoaftermyMBBS?1183004279958279959WherecanIfindapoweroutletformylaptopatMelbourneAirport?WouldasecondairportinSydney,Australiabeneededifahigh-speedraillinkwascreatedbetweenMelbourneandSydney?0119056193387193388HownottofeelguiltysinceIamMuslimandI‘mconsciouswewon’thavesextogether?Idon‘tbeleiveIambulimic,butIforcethrowupatleastonceadayafterIeatsomethingandfeelguilty.ShouldItellsomebody,andifsowho?035686342286296457Howisairtrafficcontrolled?Howdoyoubecomeanairtrafficcontroller?0106969147570787Whatisthebestselfhelpbookyouhaveread?Why?Howdiditchangeyourlife?WhatarethetopselfhelpbooksIshouldread?1…
train.tsv数据样式说明:
train.tsv中的数据内容共分为6列,第一列代表文本数据索引;第二列和第三列数据分别代表问题1和问题2的id;第四列和第五列代表需要进行’是否重复‘判定的句子对;第六列代表上述问题是/不是重复性问题的标签,0代表不重复,1代表重复.test.tsv数据样式:
idquestion1question20WouldtheideaofTrumpandPutininbedtogetherscareyou,giventhegeopoliticalimplications?DoyouthinkthatifDonaldTrumpwereelectedPresident,hewouldbeabletorestorerelationswithPutinandRussiaashesaidhecould,basedontherockyrelationshipPutinhadwithObamaandBush?1WhatarethetoptenConsumer-to-ConsumerE-commerceonline?WhatarethetoptenConsumer-to-BusinessE-commerceonline?2Whydon’tpeoplesimply‘Google’insteadofaskingquestionsonQuora?WhydopeopleaskQuoraquestionsinsteadofjustsearchinggoogle?3Isitsafetoinvestinsocialtradebiz?Issocialtradegeniune?4Iftheuniverseisexpandingthendoesmatteralsoexpand?Ifuniverseandspaceisexpanding?Doesthatmeananythingthatoccupiesspaceisalsoexpanding?5Whatisthepluralofhypothesis?Whatisthepluralofthesis?6Whatistheapplicationformyouneedforlaunchingacompany?WhatistheapplicationformyouneedforlaunchingacompanyinAustria?7WhatisBigTheta?WhenshouldIuseBigThetaasopposedtobigO?IsO(Logn)closetoO(n)orO(1)?8Whatarethehealthimplicationsofaccidentallyeatingasmallquantityofaluminiumfoil?Whataretheimplicationsofnoteatingvegetables?…
test.tsv数据样式说明:
test.tsv中的数据内容共分为3列,第一列数据代表每条文本数据的索引;第二列和第三列数据代表用于测试的问题句子对.QQP数据集的任务类型:
句子对二分类任务评估指标为:ACC/F1(MNLI/SNLI)数据集文件样式
-(MNLI/SNLI)/-dev_matched.tsv-dev_mismatched.tsv-original/-test_matched.tsv-test_mismatched.tsv-train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv,dev_matched.tsv,dev_mismatched.tsv,test_matched.tsv,test_mismatched.tsv分别代表训练集,与训练集一同采集的验证集,与训练集不是一同采集验证集,与训练集一同采集的测试集,与训练集不是一同采集测试集.其中train.tsv与dev_matched.tsv和dev_mismatched.tsv数据样式相同,都是带有标签的数据,其中test_matched.tsv与test_mismatched.tsv数据样式相同,都是不带有标签的数据.train.tsv数据样式:
indexpromptIDpairIDgenresentence1_binary_parsesentence2_binary_parsesentence1_parsesentence2_parsesentence1sentence2label1gold_label03119331193ngovernment((Conceptually(creamskimming))((has(((two(basicdimensions))-)((productand)geography))).))(((Productand)geography)((are(what(make(cream(skimmingwork))))).))(ROOT(S(NP(JJConceptually)(NNcream)(NNskimming))(VP(VBZhas)(NP(NP(CDtwo)(JJbasic)(NNSdimensions))(:-)(NP(NNproduct)(CCand)(NNgeography))))(..)))(ROOT(S(NP(NNProduct)(CCand)(NNgeography))(VP(VBPare)(SBAR(WHNP(WPwhat))(S(VP(VBPmake)(NP(NP(NNcream))(VP(VBGskimming)(NP(NNwork))))))))(..)))Conceptuallycreamskimminghastwobasicdimensions-productandgeography.Productandgeographyarewhatmakecreamskimmingwork.neutralneutral1101457101457etelephone(you((know(during(((theseason)and)(iguess))))(at(at((yourlevel)(uh(you(((losethem)(to(the(nextlevel))))(if((if(they(decide(to(recall(the(the(parentteam))))))))((theBraves)(decide(to(call(to((recall(aguy))(from((tripleA)(((then((a(double(Aguy)))((goesup)(to(replacehim)))))and)((a(single(Aguy)))((goesup)(to(replacehim)))))))))))))))))))))))(You((((lose(thethings))(to(the(followinglevel))))(if((thepeople)recall))).))(ROOT(S(NP(PRPyou))(VP(VBPknow)(PP(INduring)(NP(NP(DTthe)(NNseason))(CCand)(NP(FWi)(FWguess))))(PP(INat)(INat)(NP(NP(PRP\(your)(NNlevel))(SBAR(S(INTJ(UHuh))(NP(PRPyou))(VP(VBPlose)(NP(PRPthem))(PP(TOto)(NP(DTthe)(JJnext)(NNlevel)))(SBAR(INif)(S(SBAR(INif)(S(NP(PRPthey))(VP(VBPdecide)(S(VP(TOto)(VP(VBrecall)(NP(DTthe)(DTthe)(NNparent)(NNteam))))))))(NP(DTthe)(NNPSBraves))(VP(VBPdecide)(S(VP(TOto)(VP(VBcall)(S(VP(TOto)(VP(VBrecall)(NP(DTa)(NNguy))(PP(INfrom)(NP(NP(RBtriple)(DTA))(SBAR(S(S(ADVP(RBthen))(NP(DTa)(JJdouble)(NNPA)(NNguy))(VP(VBZgoes)(PRT(RPup))(S(VP(TOto)(VP(VBreplace)(NP(PRPhim)))))))(CCand)(S(NP(DTa)(JJsingle)(NNPA)(NNguy))(VP(VBZgoes)(PRT(RPup))(S(VP(TOto)(VP(VBreplace)(NP(PRPhim))))))))))))))))))))))))))))(ROOT(S(NP(PRPYou))(VP(VBPlose)(NP(DTthe)(NNSthings))(PP(TOto)(NP(DTthe)(JJfollowing)(NNlevel)))(SBAR(INif)(S(NP(DTthe)(NNSpeople))(VP(VBPrecall)))))(..)))youknowduringtheseasonandiguessatatyourleveluhyoulosethemtothenextlevelififtheydecidetorecallthetheparentteamtheBravesdecidetocalltorecallaguyfromtripleAthenadoubleAguygoesuptoreplacehimandasingleAguygoesuptoreplacehimYoulosethethingstothefollowinglevelifthepeoplerecall.entailmententailment2134793134793efiction((One(of(ournumber)))((will(((carryout)(yourinstructions))minutely)).))(((Amember)(of(myteam)))((will((execute(yourorders))(with(immenseprecision)))).))(ROOT(S(NP(NP(CDOne))(PP(INof)(NP(PRP\)our)(NNnumber))))(VP(MDwill)(VP(VBcarry)(PRT(RPout))(NP(PRP\(your)(NNSinstructions))(ADVP(RBminutely))))(..)))(ROOT(S(NP(NP(DTA)(NNmember))(PP(INof)(NP(PRP\)my)(NNteam))))(VP(MDwill)(VP(VBexecute)(NP(PRP\(your)(NNSorders))(PP(INwith)(NP(JJimmense)(NNprecision)))))(..)))Oneofournumberwillcarryoutyourinstructionsminutely.Amemberofmyteamwillexecuteyourorderswithimmenseprecision.entailmententailment33739737397efiction((How(((doyou)know)?))((Allthis)(((is(theirinformation))again).)))((Thisinformation)((belongs(tothem)).))(ROOT(S(SBARQ(WHADVP(WRBHow))(SQ(VBPdo)(NP(PRPyou))(VP(VBknow)))(.?))(NP(PDTAll)(DTthis))(VP(VBZis)(NP(PRP\)their)(NNinformation))(ADVP(RBagain)))(..)))(ROOT(S(NP(DTThis)(NNinformation))(VP(VBZbelongs)(PP(TOto)(NP(PRPthem))))(..)))Howdoyouknow?Allthisistheirinformationagain.Thisinformationbelongstothem.entailmententailment…
train.tsv数据样式说明:
train.tsv中的数据内容共分为12列,第一列代表文本数据索引;第二列和第三列数据分别代表句子对的不同类型id;第四列代表句子对的来源;第五列和第六列代表具有句法结构分析的句子对表示;第七列和第八列代表具有句法结构和词性标注的句子对表示,第九列和第十列代表原始的句子对,第十一和第十二列代表不同标准的标注方法产生的标签,在这里,他们始终相同,一共有三种类型的标签,neutral代表两个句子既不矛盾也不蕴含,entailment代表两个句子具有蕴含关系,contradiction代表两个句子观点矛盾.test_matched.tsv数据样式:
indexpromptIDpairIDgenresentence1_binary_parsesentence2_binary_parsesentence1_parsesentence2_parsesentence1sentence203149331493travel((((((((Hierbas,)(ansseco)),)(ansdulce)),)and)frigola)(((arejust)((a(fewnames))(worth((keeping(alook-out))for)))).))(Hierbas((is((aname)(worth((lookingout)for)))).))(ROOT(S(NP(NP(NNSHierbas))(,,)(NP(NNans)(NNseco))(,,)(NP(NNans)(NNdulce))(,,)(CCand)(NP(NNfrigola)))(VP(VBPare)(ADVP(RBjust))(NP(NP(DTa)(JJfew)(NNSnames))(PP(JJworth)(S(VP(VBGkeeping)(NP(DTa)(NNlook-out))(PP(INfor)))))))(..)))(ROOT(S(NP(NNSHierbas))(VP(VBZis)(NP(NP(DTa)(NNname))(PP(JJworth)(S(VP(VBGlooking)(PRT(RPout))(PP(INfor)))))))(..)))Hierbas,ansseco,ansdulce,andfrigolaarejustafewnamesworthkeepingalook-outfor.Hierbasisanameworthlookingoutfor.19216492164government(((Theextent)(of(the(behavioraleffects))))((would((depend(in(part(on((thestructure)(of(((the(individual(accountprogram)))and)(anylimits))))))))(on(accessing(thefunds))))).))((Manypeople)((would(be(very(unhappy(to((loosecontrol)(over(their(ownmoney))))))))).))(ROOT(S(NP(NP(DTThe)(NNextent))(PP(INof)(NP(DTthe)(JJbehavioral)(NNSeffects))))(VP(MDwould)(VP(VBdepend)(PP(INin)(NP(NP(NNpart))(PP(INon)(NP(NP(DTthe)(NNstructure))(PP(INof)(NP(NP(DTthe)(JJindividual)(NNaccount)(NNprogram))(CCand)(NP(DTany)(NNSlimits))))))))(PP(INon)(S(VP(VBGaccessing)(NP(DTthe)(NNSfunds)))))))(..)))(ROOT(S(NP(JJMany)(NNSpeople))(VP(MDwould)(VP(VBbe)(ADJP(RBvery)(JJunhappy)(PP(TOto)(NP(NP(JJloose)(NNcontrol))(PP(INover)(NP(PRP\(their)(JJown)(NNmoney))))))))(..)))Theextentofthebehavioraleffectswoulddependinpartonthestructureoftheindividualaccountprogramandanylimitsonaccessingthefunds.Manypeoplewouldbeveryunhappytoloosecontrolovertheirownmoney.296629662government(((Timelyaccess)(toinformation))((is(in((the(bestinterests))(of(((bothGAO)and)(theagencies)))))).))(It(((is(in((everyone's)(bestinterest))))(to((haveaccess)(to(information(in(a(timelymanner)))))))).))(ROOT(S(NP(NP(JJTimely)(NNaccess))(PP(TOto)(NP(NNinformation))))(VP(VBZis)(PP(INin)(NP(NP(DTthe)(JJSbest)(NNSinterests))(PP(INof)(NP(NP(DTboth)(NNPGAO))(CCand)(NP(DTthe)(NNSagencies)))))))(..)))(ROOT(S(NP(PRPIt))(VP(VBZis)(PP(INin)(NP(NP(NNeveryone)(POS's))(JJSbest)(NNinterest)))(S(VP(TOto)(VP(VBhave)(NP(NNaccess))(PP(TOto)(NP(NP(NNinformation))(PP(INin)(NP(DTa)(JJtimely)(NNmanner)))))))))(..)))TimelyaccesstoinformationisinthebestinterestsofbothGAOandtheagencies.Itisineveryone'sbestinteresttohaveaccesstoinformationinatimelymanner.359915991travel((Based(in((the(Auvergnat(spatown)))(ofVichy))))(,((the(Frenchgovernment))(often((((proved(morezealous))(than(itsmasters)))(in(((suppressing(civilliberties))and)((drawingup)(anti-Jewishlegislation))))).)))))((The(Frenchgovernment))((passed((anti-Jewishlaws)(aimed(at(helping(theNazi)))))).))(ROOT(S(PP(VBNBased)(PP(INin)(NP(NP(DTthe)(NNPAuvergnat)(NNspa)(NNtown))(PP(INof)(NP(NNPVichy))))))(,,)(NP(DTthe)(JJFrench)(NNgovernment))(ADVP(RBoften))(VP(VBDproved)(NP(JJRmore)(NNSzealous))(PP(INthan)(NP(PRP\)its)(NNSmasters)))(PP(INin)(S(VP(VP(VBGsuppressing)(NP(JJcivil)(NNSliberties)))(CCand)(VP(VBGdrawing)(PRT(RPup))(NP(JJanti-Jewish)(NNlegislation)))))))(..)))(ROOT(S(NP(DTThe)(JJFrench)(NNgovernment))(VP(VBDpassed)(NP(NP(JJanti-Jewish)(NNSlaws))(VP(VBNaimed)(PP(INat)(S(VP(VBGhelping)(NP(DTthe)(JJNazi))))))))(..)))BasedintheAuvergnatspatownofVichy,theFrenchgovernmentoftenprovedmorezealousthanitsmastersinsuppressingcivillibertiesanddrawingupanti-Jewishlegislation.TheFrenchgovernmentpassedanti-JewishlawsaimedathelpingtheNazi….
est_matched.tsv数据样式说明:
test_matched.tsv中的数据内容共分为10列,与train.tsv的前10列含义相同.(MNLI/SNLI)数据集的任务类型:
句子对多分类任务评估指标为:ACC(QNLI/RTE/WNLI)数据集文件样式
*QNLI,RTE,WNLI三个数据集的样式基本相同.
-(QNLI/RTE/WNLI)/-dev.tsv-test.tsv-train.tsv
文件样式说明:
在使用中常用到的文件是train.tsv,dev.tsv,test.tsv,分别代表训练集,验证集和测试集.其中train.tsv与dev.tsv数据样式相同,都是带有标签的数据,其中test.tsv是不带有标签的数据.QNLI中的train.tsv数据样式:
indexquestionsentencelabel0WhendidthethirdDigimonseriesbegin?Unlikethetwoseasonsbeforeitandmostoftheseasonsthatfollowed,DigimonTamerstakesadarkerandmorerealisticapproachtoitsstoryfeaturingDigimonwhodonotreincarnateaftertheirdeathsandmorecomplexcharacterdevelopmentintheoriginalJapanese.not_entailment1Whichmissilebatteriesoftenhaveindividuallaunchersseveralkilometresfromoneanother?WhenMANPADSisoperatedbyspecialists,batteriesmayhaveseveraldozenteamsdeployingseparatelyinsmallsections;self-propelledairdefencegunsmaydeployinpairs.not_entailment2WhattwothingsdoesPopperargueTarski‘stheoryinvolvesinanevaluationoftruth?Hebasesthisinterpretationonthefactthatexamplessuchastheonedescribedaboverefertotwothings:assertionsandthefactstowhichtheyrefer.entailment3Whatisthenameofthevillage9milesnorthofCalafatwheretheOttomanforcesattackedtheRussians?On31December1853,theOttomanforcesatCalafatmovedagainsttheRussianforceatChetateaorCetate,asmallvillageninemilesnorthofCalafat,andengagedthemon6January1854.entailment4WhatfamouspalaceislocatedinLondon?LondoncontainsfourWorldHeritageSites:theTowerofLondon;KewGardens;thesitecomprisingthePalaceofWestminster,WestminsterAbbey,andStMargaret’sChurch;andthehistoricsettlementofGreenwich(inwhichtheRoyalObservatory,GreenwichmarksthePrimeMeridian,0°longitude,andGMT).not_entailment5Whenistheterm‘Germandialects’usedinregardtotheGermanlanguage?WhentalkingabouttheGermanlanguage,thetermGermandialectsisonlyusedforthetraditionalregionalvarieties.entailment6WhatwasthenameoftheislandtheEnglishtradedtotheDutchinreturnforNewAmsterdam?AttheendoftheSecondAnglo-DutchWar,theEnglishgainedNewAmsterdam(NewYork)inNorthAmericainexchangeforDutchcontrolofRun,anIndonesianisland.entailment7HowwerethePortugueseexpelledfromMyanmar?Fromthe1720sonward,thekingdomwasbesetwithrepeatedMeitheiraidsintoUpperMyanmarandanaggingrebellioninLanNa.not_entailment8Whatdoestheword‘customer’properlyapplyto?Thebillalsorequiredrotationofprincipalmaintenanceinspectorsandstipulatedthattheword“customer”properlyappliestotheflyingpublic,notthoseentitiesregulatedbytheFAA.entailment…
RTE中的train.tsv数据样式:
indexsentence1sentence2label0NoWeaponsofMassDestructionFoundinIraqYet.WeaponsofMassDestructionFoundinIraq.not_entailment1Aplaceofsorrow,afterPopeJohnPaulIIdied,becameaplaceofcelebration,asRomanCatholicfaithfulgatheredindowntownChicagotomarktheinstallationofnewPopeBenedictXVI.PopeBenedictXVIisthenewleaderoftheRomanCatholicChurch.entailment2Herceptinwasalreadyapprovedtotreatthesickestbreastcancerpatients,andthecompanysaid,Monday,itwilldiscusswithfederalregulatorsthepossibilityofprescribingthedrugformorebreastcancerpatients.Herceptincanbeusedtotreatbreastcancer.entailment3JudieVivian,chiefexecutiveatProMedica,amedicalservicecompanythathelpssustainthe2-year-oldVietnamHeartInstituteinHoChiMinhCity(formerlySaigon),saidthatsofarabout1,500childrenhavereceivedtreatment.ThepreviousnameofHoChiMinhCitywasSaigon.entailment4Amanisdueincourtlaterchargedwiththemurder26yearsagoofateenagerwhosecasewasthefirsttobefeaturedonBBCOne‘sCrimewatch.ColetteAram,16,waswalkingtoherboyfriend’shouseinKeyworth,Nottinghamshire,on30October1983whenshedisappeared.Herbodywaslaterfoundinafieldclosetoherhome.PaulStewartHutchinson,50,hasbeenchargedwithmurderandisduebeforeNottinghammagistrateslater.PaulStewartHutchinsonisaccusedofhavingstabbedagirl.not_entailment5Britainsaid,Friday,thatithasbarredcleric,OmarBakri,fromreturningtothecountryfromLebanon,wherehewasreleasedbypoliceafterbeingdetainedfor24hours.Bakriwasbrieflydetained,butwasreleased.entailment6Nearly4millionchildrenwhohaveatleastoneparentwhoenteredtheU.S.illegallywerebornintheUnitedStatesandareU.S.citizensasaresult,accordingtothestudyconductedbythePewHispanicCenter.That‘saboutthreequartersoftheestimated5.5millionchildrenofillegalimmigrantsinsidetheUnitedStates,accordingtothestudy.About1.8millionchildrenofundocumentedimmigrantsliveinpoverty,thestudyfound.ThreequartersofU.S.illegalimmigrantshavechildren.not_entailment7LiketheUnitedStates,U.N.officialsarealsodismayedthatAristidekilledaconferencecalledbyPrimeMinisterRobertMalvalinPort-au-Princeinhopesofbringingallthefeudingpartiestogether.AristidehadPrimeMinisterRobertMalvalmurderedinPort-au-Prince.not_entailment8WASHINGTON–AnewlydeclassifiednarrativeoftheBushadministration’sadvicetotheCIAonharshinterrogationsshowsthatthesmallgroupofJusticeDepartmentlawyerswhowrotememosauthorizingcontroversialinterrogationtechniqueswereoperatingnotontheirownbutwithdirectionfromtopadministrationofficials,includingthen-VicePresidentDickCheneyandnationalsecurityadviserCondoleezzaRice.Atthesametime,thenarrativesuggeststhatthen-DefenseSecretaryDonaldH.Rumsfeldandthen-SecretaryofStateColinPowellwerelargelyleftoutofthedecision-makingprocess.DickCheneywastheVicePresidentofBush.entailment
WNLI中的train.tsv数据样式:
indexsentence1sentence2label0Istuckapinthroughacarrot.WhenIpulledthepinout,ithadahole.Thecarrothadahole.11Johncouldn‘tseethestagewithBillyinfrontofhimbecauseheissoshort.Johnissoshort.12Thepolicearrestedallofthegangmembers.Theyweretryingtostopthedrugtradeintheneighborhood.Thepoliceweretryingtostopthedrugtradeintheneighborhood.13StevefollowsFred’sexampleineverything.Heinfluenceshimhugely.Steveinfluenceshimhugely.04WhenTatyanareachedthecabin,hermotherwassleeping.Shewascarefulnottodisturbher,undressingandclimbingbackintoherberth.motherwascarefulnottodisturbher,undressingandclimbingbackintoherberth.05Georgegotfreeticketstotheplay,buthegavethemtoEric,becausehewasparticularlyeagertoseeit.Georgewasparticularlyeagertoseeit.06Johnwasjoggingthroughtheparkwhenhesawamanjugglingwatermelons.Hewasveryimpressive.Johnwasveryimpressive.07Icouldn‘tputthepotontheshelfbecauseitwastootall.Thepotwastootall.18Wehadhopedtoplacecopiesofournewsletteronallthechairsintheauditorium,butthereweresimplynotenoughofthem.Thereweresimplynotenoughcopiesofthenewsletter.1
(QNLI/RTE/WNLI)中的train.tsv数据样式说明:
train.tsv中的数据内容共分为4列,第一列代表文本数据索引;第二列和第三列数据代表需要进行’是否蕴含‘判定的句子对;第四列数据代表两个句子是否具有蕴含关系,0/not_entailment代表不是蕴含关系,1/entailment代表蕴含关系.QNLI中的test.tsv数据样式:
indexquestionsentence0WhatorganizationisdevotedtoJihadagainstIsrael?ForsomedecadespriortotheFirstPalestineIntifadain1987,theMuslimBrotherhoodinPalestinetooka“quiescent”stancetowardsIsrael,focusingonpreaching,educationandsocialservices,andbenefitingfromIsrael’s“indulgence”tobuildupanetworkofmosquesandcharitableorganizations.1InwhatcenturywastheYarrow-Schlick-Tweedybalancingsystemused?Inthelate19thcentury,theYarrow-Schlick-Tweedybalancing‘system’wasusedonsomemarinetripleexpansionengines.2ThelargestbrandofwhatstoreintheUKislocatedinKingstonPark?ClosetoNewcastle,thelargestindoorshoppingcentreinEurope,theMetroCentre,islocatedinGateshead.3WhatdoestheIPCCrelyonforresearch?Inprinciple,thismeansthatanysignificantnewevidenceoreventsthatchangeourunderstandingofclimatesciencebetweenthisdeadlineandpublicationofanIPCCreportcannotbeincluded.4Whatistheprincipleaboutrelatingspinandspacevariables?Thusinthecaseoftwofermionsthereisastrictlynegativecorrelationbetweenspatialandspinvariables,whereasfortwobosons(e.g.quantaofelectromagneticwaves,photons)thecorrelationisstrictlypositive.5WhichnetworkbroadcastedSuperBowl50intheU.S.?CBSbroadcastSuperBowl50intheU.S.,andchargedanaverageof\(5millionfora30-secondcommercialduringthegame.6WhatdidthemuseumacquirefromtheRoyalCollegeofScience?Tolinkthistotherestofthemuseum,anewentrancebuildingwasconstructedonthesiteoftheformerboilerhouse,theintendedsiteoftheSpiral,between1978and1982.7WhatisthenameoftheoldnorthbranchoftheRhine?FromWijkbijDuurstede,theoldnorthbranchoftheRhineiscalledKrommeRijn("BentRhine")pastUtrecht,firstLeidseRijn("RhineofLeiden")andthen,OudeRijn("OldRhine").8WhatwasoneofLuther'smostpersonalwritings?Itremainsinusetoday,alongwithLuther'shymnsandhistranslationoftheBible....<p>(RTE/WNLI)中的test.tsv数据样式:</p><p>indexsentence1sentence20MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheycameinsight.HorsesranawaywhenMaudeandDoracameinsight.1MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheycameinsight.Horsesranawaywhenthetrainscameinsight.2MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheycameinsight.Horsesranawaywhenthepuffscameinsight.3MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheycameinsight.Horsesranawaywhentheroarscameinsight.4MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheycameinsight.Horsesranawaywhenthewhistlescameinsight.5MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheycameinsight.Horsesranawaywhenthehorsescameinsight.6MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheysawatraincoming.MaudeandDorasawatraincoming.7MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheysawatraincoming.Thetrainssawatraincoming.8MaudeandDorahadseenthetrainsrushingacrosstheprairie,withlong,rollingpuffsofblacksmokestreamingbackfromtheengine.Theirroarsandtheirwild,clearwhistlescouldbeheardfromfaraway.Horsesranawaywhentheysawatraincoming.Thepuffssawatraincoming....<p>(QNLI/RTE/WNLI)中的test.tsv数据样式说明:</p>test.tsv中的数据内容共分为3列,第一列数据代表每条文本数据的索引;第二列和第三列数据代表需要进行'是否蕴含'判定的句子对.<p>(QNLI/RTE/WNLI)数据集的任务类型:</p>句子对二分类任务评估指标为:ACC三、NLP中的常用预训练模型当下NLP中流行的预训练模型BERTGPTGPT-2Transformer-XLXLNetXLMRoBERTaDistilBERTALBERTT5XLM-RoBERTa<p>BERT及其变体:</p>bert-base-uncased:编码器具有12个隐层,输出768维张量,12个自注意力头,共110M参数量,在小写的英文文本上进行训练而得到.bert-large-uncased:编码器具有24个隐层,输出1024维张量,16个自注意力头,共340M参数量,在小写的英文文本上进行训练而得到.bert-base-cased:编码器具有12个隐层,输出768维张量,12个自注意力头,共110M参数量,在不区分大小写的英文文本上进行训练而得到.bert-large-cased:编码器具有24个隐层,输出1024维张量,16个自注意力头,共340M参数量,在不区分大小写的英文文本上进行训练而得到.bert-base-multilingual-uncased:编码器具有12个隐层,输出768维张量,12个自注意力头,共110M参数量,在小写的102种语言文本上进行训练而得到.bert-large-multilingual-uncased:编码器具有24个隐层,输出1024维张量,16个自注意力头,共340M参数量,在小写的102种语言文本上进行训练而得到.bert-base-chinese:编码器具有12个隐层,输出768维张量,12个自注意力头,共110M参数量,在简体和繁体中文文本上进行训练而得到.<p>GPT:</p>openai-gpt:编码器具有12个隐层,输出768维张量,12个自注意力头,共110M参数量,由OpenAI在英文语料上进行训练而得到.<p>GPT-2及其变体:</p>gpt2:编码器具有12个隐层,输出768维张量,12个自注意力头,共117M参数量,在OpenAIGPT-2英文语料上进行训练而得到.gpt2-xl:编码器具有48个隐层,输出1600维张量,25个自注意力头,共1558M参数量,在大型的OpenAIGPT-2英文语料上进行训练而得到.<p>Transformer-XL:</p>transfo-xl-wt103:编码器具有18个隐层,输出1024维张量,16个自注意力头,共257M参数量,在wikitext-103英文语料进行训练而得到.XLNet及其变体:xlnet-base-cased:编码器具有12个隐层,输出768维张量,12个自注意力头,共110M参数量,在英文语料上进行训练而得到.xlnet-large-cased:编码器具有24个隐层,输出1024维张量,16个自注意力头,共240参数量,在英文语料上进行训练而得到.<p>XLM:</p>xlm-mlm-en-2048:编码器具有12个隐层,输出2048维张量,16个自注意力头,在英文文本上进行训练而得到.<p>RoBERTa及其变体:</p>roberta-base:编码器具有12个隐层,输出768维张量,12个自注意力头,共125M参数量,在英文文本上进行训练而得到.roberta-large:编码器具有24个隐层,输出1024维张量,16个自注意力头,共355M参数量,在英文文本上进行训练而得到.<p>DistilBERT及其变体:</p>distilbert-base-uncased:基于bert-base-uncased的蒸馏(压缩)模型,编码器具有6个隐层,输出768维张量,12个自注意力头,共66M参数量.distilbert-base-multilingual-cased:基于bert-base-multilingual-uncased的蒸馏(压缩)模型,编码器具有6个隐层,输出768维张量,12个自注意力头,共66M参数量.<p>ALBERT:</p>albert-base-v1:编码器具有12个隐层,输出768维张量,12个自注意力头,共125M参数量,在英文文本上进行训练而得到.albert-base-v2:编码器具有12个隐层,输出768维张量,12个自注意力头,共125M参数量,在英文文本上进行训练而得到,相比v1使用了更多的数据量,花费更长的训练时间.<p>T5及其变体:</p>t5-small:编码器具有6个隐层,输出512维张量,8个自注意力头,共60M参数量,在C4语料上进行训练而得到.t5-base:编码器具有12个隐层,输出768维张量,12个自注意力头,共220M参数量,在C4语料上进行训练而得到.t5-large:编码器具有24个隐层,输出1024维张量,16个自注意力头,共770M参数量,在C4语料上进行训练而得到.<p>XLM-RoBERTa及其变体:</p>xlm-roberta-base:编码器具有12个隐层,输出768维张量,8个自注意力头,共125M参数量,在2.5TB的100种语言文本上进行训练而得到.xlm-roberta-large:编码器具有24个隐层,输出1027维张量,16个自注意力头,共355M参数量,在2.5TB的100种语言文本上进行训练而得到.<p>预训练模型说明:</p>所有上述预训练模型及其变体都是以transformer为基础,只是在模型结构如神经元连接方式,编码器隐层数,多头注意力的头数等发生改变,这些改变方式的大部分依据都是由在标准数据集上的表现而定,因此,对于我们使用者而言,不需要从理论上深度探究这些预训练模型的结构设计的优劣,只需要在自己处理的目标数据上,尽量遍历所有可用的模型对比得到最优效果即可.四、加载和使用预训练模型<p>加载和使用预训练模型的工具</p>在这里我们使用torch.hub工具进行模型的加载和使用.这些预训练模型由世界先进的NLP研发团队huggingface提供.<p>加载和使用预训练模型的步骤</p>第一步:确定需要加载的预训练模型并安装依赖包.第二步:加载预训练模型的映射器tokenizer.第三步:加载带/不带头的预训练模型.第四步:使用模型获得输出结果.第一步:确定需要加载的预训练模型并安装依赖包能够加载哪些模型可以参考2.3NLP中的常用预训练模型这里假设我们处理的是中文文本任务,需要加载的模型是BERT的中文模型:bert-base-chinese在使用工具加载模型前需要安装必备的依赖包:<p>pipinstalltqdmboto3requestsregexsentencepiecesacremoses第二步:加载预训练模型的映射器tokenizer<p>importtorch#预训练模型来源source='huggingface/pytorch-transformers'#选定加载模型的哪一部分,这里是模型的映射器part='tokenizer'#加载的预训练模型的名字model_name='bert-base-chinese'tokenizer=torch.hub.load(source,part,model_name)第三步:加载带/不带头的预训练模型加载预训练模型时我们可以选择带头或者不带头的模型这里的'头'是指模型的任务输出层,选择加载不带头的模型,相当于使用模型对输入文本进行特征表示.选择加载带头的模型时,有三种类型的'头'可供选择,modelWithLMHead(语言模型头),modelForSequenceClassification(分类模型头),modelForQuestionAnswering(问答模型头)不同类型的'头',可以使预训练模型输出指定的张量维度.如使用'分类模型头',则输出尺寸为(1,2)的张量,用于进行分类任务判定结果.<p>#加载不带头的预训练模型part='model'model=torch.hub.load(source,part,model_name)#加载带有语言模型头的预训练模型part='modelWithLMHead'lm_model=torch.hub.load(source,part,model_name)#加载带有类模型头的预训练模型part='modelForSequenceClassification'classification_model=torch.hub.load(source,part,model_name)#加载带有问答模型头的预训练模型part='modelForQuestionAnswering'qa_model=torch.hub.load(source,part,model_name)第四步:使用模型获得输出结果使用不带头的模型进行输出:<p>#输入的中文文本input_text="人生该如何起头"#使用tokenizer进行数值映射indexed_tokens=tokenizer.encode(input_text)#打印映射后的结构print("indexed_tokens:",indexed_tokens)#将映射结构转化为张量输送给不带头的预训练模型tokens_tensor=torch.tensor([indexed_tokens])#使用不带头的预训练模型获得结果withtorch.no_grad():encoded_layers,_=model(tokens_tensor)print("不带头的模型输出结果:",encoded_layers)print("不带头的模型输出结果的尺寸:",encoded_layers.shape)<p>输出效果:</p><p>#tokenizer映射后的结果,101和102是起止符,#中间的每个数字对应"人生该如何起头"的每个字.indexed_tokens:[101,782,4495,6421,1963,862,6629,1928,102]不带头的模型输出结果:tensor([[[0.5421,0.4526,-0.0179,...,1.0447,-0.1140,0.0068],[-0.1343,0.2785,0.1602,...,-0.0345,-0.1646,-0.2186],[0.9960,-0.5121,-0.6229,...,1.4173,0.5533,-0.2681],...,[0.0115,0.2150,-0.0163,...,0.6445,0.2452,-0.3749],[0.8649,0.4337,-0.1867,...,0.7397,-0.2636,0.2144],[-0.6207,0.1668,0.1561,...,1.1218,-0.0985,-0.0937]]])#输出尺寸为1x9x768,即每个字已经使用768维的向量进行了表示,#我们可以基于此编码结果进行接下来的自定义操作,如:编写自己的微调网络进行最终输出.不带头的模型输出结果的尺寸:torch.Size([1,9,768])<p>使用带有语言模型头的模型进行输出:</p><p>#使用带有语言模型头的预训练模型获得结果withtorch.no_grad():lm_output=lm_model(tokens_tensor)print("带语言模型头的模型输出结果:",lm_output)print("带语言模型头的模型输出结果的尺寸:",lm_output[0].shape)<p>输出效果:</p><p>带语言模型头的模型输出结果:(tensor([[[-7.9706,-7.9119,-7.9317,...,-7.2174,-7.0263,-7.3746],[-8.2097,-8.1810,-8.0645,...,-7.2349,-6.9283,-6.9856],[-13.7458,-13.5978,-12.6076,...,-7.6817,-9.5642,-11.9928],...,[-9.0928,-8.6857,-8.4648,...,-8.2368,-7.5684,-10.2419],[-8.9458,-8.5784,-8.6325,...,-7.0547,-5.3288,-7.8077],[-8.4154,-8.5217,-8.5379,...,-6.7102,-5.9782,-7.6909]]]),)#输出尺寸为1x9x21128,即每个字已经使用21128维的向量进行了表示,#同不带头的模型一样,我们可以基于此编码结果进行接下来的自定义操作,如:编写自己的微调网络进行最终输出.带语言模型头的模型输出结果的尺寸:torch.Size([1,9,21128])<p>使用带有分类模型头的模型进行输出:</p><p>#使用带有分类模型头的预训练模型获得结果withtorch.no_grad():classification_output=classification_model(tokens_tensor)print("带分类模型头的模型输出结果:",classification_output)print("带分类模型头的模型输出结果的尺寸:",classification_output[0].shape)<p>输出效果:</p><p>带分类模型头的模型输出结果:(tensor([[-0.0649,-0.1593]]),)#输出尺寸为1x2,可直接用于文本二分问题的输出带分类模型头的模型输出结果的尺寸:torch.Size([1,2])<p>使用带有问答模型头的模型进行输出:</p><p>#使用带有问答模型头的模型进行输出时,需要使输入的形式为句子对#第一条句子是对客观事物的陈述#第二条句子是针对第一条句子提出的问题#问答模型最终将得到两个张量,#每个张量中最大值对应索引的分别代表答案的在文本中的起始位置和终止位置.input_text1="我家的小狗是黑色的"input_text2="我家的小狗是什么颜色的呢?"#映射两个句子indexed_tokens=tokenizer.encode(input_text1,input_text2)print("句子对的indexed_tokens:",indexed_tokens)#输出结果:[101,2769,2157,4638,2207,4318,3221,7946,5682,4638,102,2769,2157,4638,2207,4318,3221,784,720,7582,5682,4638,1450,136,102]#用0,1来区分第一条和第二条句子segments_ids=[0]*11+[1]*14#转化张量形式segments_tensors=torch.tensor([segments_ids])tokens_tensor=torch.tensor([indexed_tokens])#使用带有问答模型头的预训练模型获得结果withtorch.no_grad():start_logits,end_logits=qa_model(tokens_tensor,token_type_ids=segments_tensors)print("带问答模型头的模型输出结果:",(start_logits,end_logits))print("带问答模型头的模型输出结果的尺寸:",(start_logits.shape,end_logits.shape))<p>输出效果:</p><p>句子对的indexed_tokens:[101,2769,2157,4638,2207,4318,3221,7946,5682,4638,102,2769,2157,4638,2207,4318,3221,784,720,7582,5682,4638,1450,136,102]带问答模型头的模型输出结果:(tensor([[0.2574,-0.0293,-0.8337,-0.5135,-0.3645,-0.2216,-0.1625,-0.2768,-0.8368,-0.2581,0.0131,-0.1736,-0.5908,-0.4104,-0.2155,-0.0307,-0.1639,-0.2691,-0.4640,-0.1696,-0.4943,-0.0976,-0.6693,0.2426,0.0131]]),tensor([[-0.3788,-0.2393,-0.5264,-0.4911,-0.7277,-0.5425,-0.6280,-0.9800,-0.6109,-0.2379,-0.0042,-0.2309,-0.4894,-0.5438,-0.6717,-0.5371,-0.1701,0.0826,0.1411,-0.1180,-0.4732,-0.1541,0.2543,0.2163,-0.0042]]))#输出为两个形状1x25的张量,他们是两条句子合并长度的概率分布,#第一个张量中最大值所在的索引代表答案出现的起始索引,#第二个张量中最大值所在的索引代表答案出现的终止索引.带问答模型头的模型输出结果的尺寸:(torch.Size([1,25]),torch.Size([1,25]))五、迁移学习实践<p>指定任务类型的微调脚本:</p>huggingface研究机构向我们提供了针对GLUE数据集合任务类型的微调脚本,这些微调脚本的核心都是微调模型的最后一个全连接层.通过简单的参数配置来指定GLUE中存在任务类型(如:CoLA对应文本二分类,MRPC对应句子对文本二分类,STS-B对应句子对文本多分类),以及指定需要微调的预训练模型.指定任务类型的微调脚本使用步骤第一步:下载微调脚本文件第二步:配置微调脚本参数第三步:运行并检验效果第一步:下载微调脚本文件<p>#克隆huggingface的transfomers文件gitclonehttps://github.com/huggingface/transformers.git#进行transformers文件夹cdtransformers#安装python的transformer工具包,因为微调脚本是py文件.pipinstall.#当前的版本可能跟我们教学的版本并不相同,你还需要执行:pipinstalltransformers==2.3.0#进入微调脚本所在路径并查看cdexamplesls#其中run_glue.py就是针对GLUE数据集合任务类型的微调脚本<p>注意:</p>对于run_glue.py,由于版本变更导致,请通过该地址http://git.itcast.cn/Stephen/AI-key-file/blob/master/run_glue.py复制里面的代码,覆盖原有内容。第二步:配置微调脚本参数在run_glue.py同级目录下创建run_glue.sh文件,写入内容如下:<p>#定义DATA_DIR:微调数据所在路径,这里我们使用glue_data中的数据作为微调数据exportDATA_DIR="../../glue_data"#定义SAVE_DIR:模型的保存路径,我们将模型保存在当前目录的bert_finetuning_test文件中exportSAVE_DIR="./bert_finetuning_test/"#使用python运行微调脚本#--model_type:选择需要微调的模型类型,这里可以选择BERT,XLNET,XLM,roBERTa,distilBERT,ALBERT#--model_name_or_path:选择具体的模型或者变体,这里是在英文语料上微调,因此选择bert-base-uncased#--task_name:它将代表对应的任务类型,如MRPC代表句子对二分类任务#--do_train:使用微调脚本进行训练#--do_eval:使用微调脚本进行验证#--data_dir:训练集及其验证集所在路径,将自动寻找该路径下的train.tsv和dev.tsv作为训练集和验证集#--max_seq_length:输入句子的最大长度,超过则截断,不足则补齐#--learning_rate:学习率#--num_train_epochs:训练轮数#--output_dir\)SAVE_DIR:训练后的模型保存路径#–overwrite_output_dir:再次训练时将清空之前的保存路径内容重新写入pythonrun_glue.py--model_typeBERT--model_name_or_pathbert-base-uncased--task_nameMRPC--do_train--do_eval--data_dir\(DATA_DIR/MRPC/\--max_seq_length128\--learning_rate2e-5\--num_train_epochs1.0\--output_dir\)SAVE_DIR--overwrite_output_dir第三步:运行并检验效果
#使用sh命令运行shrun_glue.sh
输出效果:
#最终打印模型的验证结果:01/05/202023:59:53-INFO-main-Savingfeaturesintocachedfile../../glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc01/05/202023:59:53-INFO-main-**Runningevaluation01/05/202023:59:53-INFO-main-Numexamples=40801/05/202023:59:53-INFO-main-Batchsize=8Evaluating:100%|█|51⁄51[00:23<00:00,2.20it/s]01/06/202000:00:16-INFO-main-Evalresults01/06/202000:00:16-INFO-main-acc=0.767156862745098101/06/202000:00:16-INFO-main-acc_and_f1=0.807334450634186301/06/202000:00:16-INFO-main-f1=0.8475120385232745 查看\(SAVE_DIR的文件内容:</p><p>added_tokens.jsoncheckpoint-450checkpoint-400checkpoint-350checkpoint-200checkpoint-300checkpoint-250checkpoint-200checkpoint-150checkpoint-100checkpoint-50pytorch_model.bintraining_args.binconfig.jsonspecial_tokens_map.jsonvocab.txteval_results.txttokenizer_config.json<p>文件解释:</p>pytorch_model.bin代表模型参数,可以使用torch.load加载查看;traning_args.bin代表模型训练时的超参,如batch_size,epoch等,仍可使用torch.load查看;config.json是模型配置文件,如多头注意力的头数,编码器的层数等,代表典型的模型结构,如bert,xlnet,一般不更改;added_token.json记录在训练时通过代码添加的自定义token对应的数值,即在代码中使用add_token方法添加的自定义词汇;special_token_map.json当添加的token具有特殊含义时,如分隔符,该文件存储特殊字符的及其对应的含义,使文本中出现的特殊字符先映射成其含义,之后特殊字符的含义仍然使用add_token方法映射。checkpoint:若干步骤保存的模型参数文件(也叫检测点文件)。通过微调脚本微调后模型的使用步骤第一步:在https://huggingface.co/join上创建一个帐户第二步:在服务器终端使用transformers-cli登陆第三步:使用transformers-cli上传模型并查看第四步:使用pytorch.hub加载模型进行使用第一步:在https://huggingface.co/join上创建一个帐户<p>#如果由于网络原因无法访问,我们已经为你提供了默认账户username:ItcastAIpassword:ItcastAI第二步:在服务器终端使用transformers-cli登陆<p>#在微调模型的服务器上登陆#使用刚刚注册的用户名和密码#默认username:ItcastAI#默认password:ItcastAI\)transformers-clilogin第三步:使用transformers-cli上传模型并查看 #使用transformers-cliupload命令上传模型#选择正确的微调模型路径\(transformers-cliupload./bert_finetuning_test/#查看上传结果\)transformers-clilsFilenameLastModifiedETagSize————————————————————————————————————————bert_finetuning_test/added_tokens.json2020-01-05T17:39:57.000Z“99914b932bd37a50b983c5e7c90ae93b”2bert_finetuning_test/checkpoint-400/config.json2020-01-05T17:26:49.000Z“74d53ea41e5acb6d60496bc195d82a42”684bert_finetuning_test/checkpoint-400/training_args.bin2020-01-05T17:26:47.000Z“b3273519c2b2b1cb2349937279880f50”1207bert_finetuning_test/checkpoint-450/config.json2020-01-05T17:15:42.000Z“74d53ea41e5acb6d60496bc195d82a42”684bert_finetuning_test/checkpoint-450/pytorch_model.bin2020-01-05T17:15:58.000Z“077cc0289c90b90d6b662cce104fe4ef”437982584bert_finetuning_test/checkpoint-450/training_args.bin2020-01-05T17:15:40.000Z“b3273519c2b2b1cb2349937279880f50”1207bert_finetuning_test/config.json2020-01-05T17:28:50.000Z“74d53ea41e5acb6d60496bc195d82a42”684bert_finetuning_test/eval_results.txt2020-01-05T17:28:56.000Z“67d2d49a96afc4308d33bfcddda8a7c5”81bert_finetuning_test/pytorch_model.bin2020-01-05T17:28:59.000Z“d46a8ccfb8f5ba9ecee70cef8306679e”437982584bert_finetuning_test/special_tokens_map.json2020-01-05T17:28:54.000Z“8b3fb1023167bb4ab9d70708eb05f6ec”112bert_finetuning_test/tokenizer_config.json2020-01-05T17:28:52.000Z“0d7f03e00ecb582be52818743b50e6af”59bert_finetuning_test/training_args.bin2020-01-05T17:28:48.000Z“b3273519c2b2b1cb2349937279880f50”1207bert_finetuning_test/vocab.txt2020-01-05T17:39:55.000Z“64800d5d8528ce344256daf115d4965e”231508第四步:使用pytorch.hub加载模型进行使用,更多信息请参考2.4加载和使用预训练模型 #若之前使用过huggingface的transformers,请清除~/.cacheimporttorch#如:ItcastAI/bert_finetuning_testsource=‘huggingface/pytorch-transformers’#选定加载模型的哪一部分,这里是模型的映射器part=‘tokenizer’##############################################加载的预训练模型的名字#使用自己的模型名字“username/model_name”#如:‘ItcastAI/bert_finetuning_test’model_name=‘ItcastAI/bert_finetuning_test’#############################################tokenizer=torch.hub.load(‘huggingface/pytorch-transformers’,‘tokenizer’,model_name)model=torch.hub.load(‘huggingface/pytorch-transformers’,‘modelForSequenceClassification’,model_name)index=tokenizer.encode(“Talkischeap”,“Pleaseshowmeyourcode!”)#102是bert模型中的间隔(结束)符号的数值映射mark=102#找到第一个102的索引,即句子对的间隔符号k=index.index(mark)#句子对分割id列表,由0,1组成,0的位置代表第一个句子,1的位置代表第二个句子segments_ids=[0](k+1)+[1](len(index)-k-1)#转化为tensortokens_tensor=torch.tensor([index])segments_tensors=torch.tensor([segments_ids])#使用评估模式withtorch.no_grad():#使用模型预测获得结果result=model(tokens_tensor,token_type_ids=segments_tensors)#打印预测结果以及张量尺寸print(result)print(result[0].shape) 输出效果: (tensor([[-0.0181,0.0263]]),)torch.Size([1,2])通过微调方式进行迁移学习的两种类型类型一:使用指定任务类型的微调脚本微调预训练模型,后接带有输出头的预定义网络输出结果.类型二:直接加载预训练模型进行输入文本的特征表示,后接自定义网络进行微调输出结果.说明:所有类型的实战演示,都将针对中文文本进行.类型一实战演示使用文本二分类的任务类型SST-2的微调脚本微调中文预训练模型,后接带有分类输出头的预定义网络输出结果.目标是判断句子的情感倾向.准备中文酒店评论的情感分析语料,语料样式与SST-2数据集相同,标签0代表差评,标签1好评.语料存放在与glue_data/同级目录cn_data/下,其中的SST-2目录包含train.tsv和dev.tsv train.tsv sentencelabel早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好,餐厅不分吸烟区.房间不分有无烟房.0去的时候,酒店大厅和餐厅在装修,感觉大厅有点挤.由于餐厅装修本来该享受的早饭,也没有享受(他们是8点开始每个房间送,但是我时间来不及了)不过前台服务员态度好!1有很长时间没有在西藏大厦住了,以前去北京在这里住的较多。这次住进来发现换了液晶电视,但网络不是很好,他们自己说是收费的原因造成的。其它还好。1非常好的地理位置,住的是豪华海景房,打开窗户就可以看见栈桥和海景。记得很早以前也住过,现在重新装修了。总的来说比较满意,以后还会住1交通很方便,房间小了一点,但是干净整洁,很有香港的特色,性价比较高,推荐一下哦1酒店的装修比较陈旧,房间的隔音,主要是卫生间的隔音非常差,只能算是一般的0酒店有点旧,房间比较小,但酒店的位子不错,就在海边,可以直接去游泳。8楼的海景打开窗户就是海。如果想住在热闹的地带,这里不是一个很好的选择,不过威海城市真的比较小,打车还是相当便宜的。晚上酒店门口出租车比较少。1位置很好,走路到文庙、清凉寺5分钟都用不了,周边公交车很多很方便,就是出租车不太爱去(老城区路窄爱堵车),因为是老宾馆所以设施要陈旧些,1酒店设备一般,套房里卧室的不能上网,要到客厅去。0 dev.tsv sentencelabel房间里有电脑,虽然房间的条件略显简陋,但环境、服务还有饭菜都还是很不错的。如果下次去无锡,我还是会选择这里的。1我们是5月1日通过携程网入住的,条件是太差了,根本达不到四星级的标准,所有的东西都很陈旧,卫生间水龙头用完竟关不上,浴缸的漆面都掉了,估计是十年前的四星级吧,总之下次是不会入住了。0离火车站很近很方便。住在东楼标间,相比较在九江住的另一家酒店,房间比较大。卫生间设施略旧。服务还好。10元中式早餐也不错,很丰富,居然还有青菜肉片汤。1坐落在香港的老城区,可以体验香港居民生活,门口交通很方便,如果时间不紧,坐叮当车很好呀!周围有很多小餐馆,早餐就在中远后面的南北嚼吃的,东西很不错。我们定的大床房,挺安静的,总体来说不错。前台结账没有银联!1酒店前台服务差,对待客人不热情。号称携程没有预定。感觉是客人在求他们,我们一定得住。这样的宾馆下次不会入住!0价格确实比较高,而且还没有早餐提供。1是一家很实惠的酒店,交通方便,房间也宽敞,晚上没有电话骚扰,住了两次,有一次住501房间,洗澡间排水不畅通,也许是个别问题.服务质量很好,刚入住时没有调好宽带,服务员很快就帮忙解决了.1位置非常好,就在西街的街口,但是却闹中取静,环境很清新优雅。1房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.1 在run_glue.py同级目录下创建run_cn.sh文件,写入内容如下: #定义DATA_DIR:微调数据所在路径exportDATA_DIR=“../../cn_data”#定义SAVE_DIR:模型的保存路径,我们将模型保存在当前目录的bert_finetuning文件中exportSAVE_DIR=“./bert_cn_finetuning/”#使用python运行微调脚本#–model_type:选择BERT#–model_name_or_path:选择bert-base-chinese#–task_name:句子二分类任务SST-2#–do_train:使用微调脚本进行训练#–do_eval:使用微调脚本进行验证#–data_dir:“./cn_data/SST-2/”,将自动寻找该路径下的train.tsv和dev.tsv作为训练集和验证集#–max_seq_length:128,输入句子的最大长度#–output_dir\(SAVE_DIR:"./bert_finetuning/",训练后的模型保存路径pythonrun_glue.py\--model_typeBERT\--model_name_or_pathbert-base-chinese\--task_nameSST-2\--do_train\--do_eval\--data_dir\)DATA_DIR/SST-2/--max_seq_length128--learning_rate2e-5--num_train_epochs1.0--output_dir$SAVE_DIR<p>运行并检验效果 #使用sh命令运行shrun_cn.sh 输出效果: #最终打印模型的验证结果,准确率高达0.88.01/06/202014:22:36-INFO-main-Savingfeaturesintocachedfile../../cn_data/SST-2/cached_dev_bert-base-chinese_128_sst-201/06/202014:22:36-INFO-main-
查看\(SAVE_DIR的文件内容:</p><p>added_tokens.jsoncheckpoint-350checkpoint-200checkpoint-300checkpoint-250checkpoint-200checkpoint-150checkpoint-100checkpoint-50pytorch_model.bintraining_args.binconfig.jsonspecial_tokens_map.jsonvocab.txteval_results.txttokenizer_config.json<p>使用transformers-cli上传模型:</p><p>#默认username:ItcastAI#默认password:ItcastAI\)transformers-clilogin#使用transformers-cliupload命令上传模型#选择正确的微调模型路径$transformers-cliupload./bert_cn_finetuning/
通过pytorch.hub加载模型进行使用:
importtorchsource=‘huggingface/pytorch-transformers’#模型名字为‘ItcastAI/bert_cn_finetuning’model_name=‘ItcastAI/bert_cn_finetuning’tokenizer=torch.hub.load(source,‘tokenizer’,model_name)model=torch.hub.load(source,‘modelForSequenceClassification’,model_name)defget_label(text):index=tokenizer.encode(text)tokens_tensor=torch.tensor([index])#使用评估模式withtorch.no_grad():#使用模型预测获得结果result=model(tokens_tensor)predicted_label=torch.argmax(result[0]).item()returnpredicted_labelifname==“main”:#text=“早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好”text=“房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.”print(“输入文本为:”,text)print(“预测标签为:”,get_label(text))
输出效果:
输入文本为:早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好预测标签为:0输入文本为:房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.预测标签为:1类型二实战演示直接加载预训练模型进行输入文本的特征表示,后接自定义网络进行微调输出结果.使用语料和完成的目标与类型一实战相同.
直接加载预训练模型进行输入文本的特征表示:
importtorch#进行句子的截断补齐(规范长度)fromkeras.preprocessingimportsequencesource=‘huggingface/pytorch-transformers’#直接使用预训练的bert中文模型model_name=‘bert-base-chinese’#通过torch.hub获得已经训练好的bert-base-chinese模型model=torch.hub.load(source,‘model’,model_name)#获得对应的字符映射器,它将把中文的每个字映射成一个数字tokenizer=torch.hub.load(source,‘tokenizer’,model_name)#句子规范长度cutlen=32defget_bert_encode(text):“”“description:使用bert-chinese编码中文文本:paramtext:要进行编码的文本:return:使用bert编码后的文本张量表示”“”#首先使用字符映射器对每个汉字进行映射#这里需要注意,bert的tokenizer映射后会为结果前后添加开始和结束标记即101和102#这对于多段文本的编码是有意义的,但在我们这里没有意义,因此使用[1:-1]对头和尾进行切片indexed_tokens=tokenizer.encode(text[:cutlen])[1:-1]#对映射后的句子进行截断补齐indexed_tokens=sequence.pad_sequences([indexed_tokens],cutlen)#之后将列表结构转化为tensortokens_tensor=torch.LongTensor(indexed_tokens)#使模型不自动计算梯度withtorch.no_grad():#调用模型获得隐层输出encodedlayers,=model(tokens_tensor)#输出的隐层是一个三维张量,最外层一维是1,我们使用[0]降去它.encoded_layers=encoded_layers[0]returnencoded_layers
调用:
ifname==“main”:text=“早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好”encoded_layers=get_bert_encode(text)print(encoded_layers)print(encoded_layers.shape)
输出效果:
tensor([[-1.2282,1.0551,-0.7953,…,2.3363,-0.6413,0.4174],[-0.9769,0.8361,-0.4328,…,2.1668,-0.5845,0.4836],[-0.7990,0.6181,-0.1424,…,2.2845,-0.6079,0.5288],…,[0.9514,0.5972,0.3120,…,1.8408,-0.1362,-0.1206],[0.1250,0.1984,0.0484,…,1.2302,-0.1905,0.3205],[0.2651,0.0228,0.1534,…,1.0159,-0.3544,0.1479]])torch.Size([32,768])
自定义单层的全连接网络作为微调网络:
根据实际经验,自定义的微调网络参数总数应大于0.5倍的训练数据量,小于10倍的训练数据量,这样有助于模型在合理的时间范围内收敛.importtorch.nnasnnimporttorch.nn.functionalasFclassNet(nn.Module):“”“定义微调网络的类”“”definit(self,char_size=32,embedding_size=768):“”“:paramchar_size:输入句子中的字符数量,即输入句子规范后的长度128.:paramembedding_size:字嵌入的维度,因为使用的bert中文模型嵌入维度是768,因此embedding_size为768”“”super(Net,self).init()#将char_size和embedding_size传入其中self.char_size=char_sizeself.embedding_size=embedding_size#实例化一个全连接层self.fc1=nn.Linear(char_size*embedding_size,2)defforward(self,x):#对输入的张量形状进行变换,以满足接下来层的输入要求x=x.view(-1,self.char_size*self.embedding_size)#使用一个全连接层x=self.fc1(x)returnx
调用:
ifname==“main”:#随机初始化一个输入参数x=torch.randn(1,32,768)#实例化网络结构,所有参数使用默认值net=Net()nr=net(x)print(nr)
输出效果:
tensor([[0.3279,0.2519]],grad_fn=<ReluBackward0>)
构建训练与验证数据批次生成器:
importpandasaspdfromcollectionsimportCounterfromfunctoolsimportreducefromsklearn.utilsimportshuffledefdata_loader(train_data_path,valid_data_path,batch_size):“”“description:从持久化文件中加载数据:paramtrain_data_path:训练数据路径:paramvalid_data_path:验证数据路径:parambatch_size:训练和验证数据集的批次大小:return:训练数据生成器,验证数据生成器,训练数据数量,验证数据数量”“”#使用pd进行csv数据的读取,并去除第一行的列名train_data=pd.read_csv(train_data_path,header=None,sep=“\t”).drop([0])valid_data=pd.read_csv(valid_data_path,header=None,sep=“\t”).drop([0])#打印训练集和验证集上的正负样本数量print(“训练数据集的正负样本数量:”)print(dict(Counter(train_data[1].values)))print(“验证数据集的正负样本数量:”)print(dict(Counter(valid_data[1].values)))#验证数据集中的数据总数至少能够满足一个批次iflen(valid_data)<batch_size:raise(“Batchsizeorsplitnotmatch!”)def_loader_generator(data):“”“description:获得训练集/验证集的每个批次数据的生成器:paramdata:训练数据或验证数据:return:一个批次的训练数据或验证数据的生成器”“”#以每个批次的间隔遍历数据集forbatchinrange(0,len(data),batch_size):#定义batch数据的张量列表batch_encoded=[]batch_labels=[]#将一个bitch_size大小的数据转换成列表形式,并进行逐条遍历foriteminshuffle(data.values.tolist())[batch:batch+batch_size]:#使用bert中文模型进行编码encoded=get_bert_encode(item[0])#将编码后的每条数据装进预先定义好的列表中batch_encoded.append(encoded)#同样将对应的该batch的标签装进labels列表中batch_labels.append([int(item[1])])#使用reduce高阶函数将列表中的数据转换成模型需要的张量形式#encoded的形状是(batch_size*max_len,embedding_size)encoded=reduce(lambdax,y:torch.cat((x,y),dim=0),batch_encoded)labels=torch.tensor(reduce(lambdax,y:x+y,batch_labels))#以生成器的方式返回数据和标签yield(encoded,labels)#对训练集和验证集分别使用_loader_generator函数,返回对应的生成器#最后还要返回训练集和验证集的样本数量return_loader_generator(train_data),_loader_generator(valid_data),len(train_data),len(valid_data)
调用:
ifname==“main”:train_data_path=“./cn_data/SST-2/train.tsv”valid_data_path=“./cn_data/SST-2/dev.tsv”batch_size=16train_data_labels,valid_data_labels,\train_data_len,valid_data_len=data_loader(train_data_path,valid_data_path,batch_size)print(next(train_data_labels))print(next(valid_data_labels))print(“train_data_len:”,train_data_len)print(“valid_data_len:”,valid_data_len)
输出效果:
训练数据集的正负样本数量:{‘0’:1518,‘1’:1442}验证数据集的正负样本数量:{‘1’:518,‘0’:482}(tensor([[[-0.8328,0.9376,-1.2489,…,1.8594,-0.4636,-0.1682],[-0.9798,0.5113,-0.9868,…,1.5500,-0.1934,0.2521],[-0.7574,0.3086,-0.6031,…,1.8467,-0.2507,0.3916],…,[0.0064,0.2321,0.3785,…,0.3376,0.4748,-0.1272],[-0.3175,0.4018,-0.0377,…,0.6030,0.2916,-0.4172],[-0.6154,1.0439,0.2921,…,0.5048,-0.0983,0.0061]]]),tensor([0,1,1,1,1,0,1,1,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0,0]))(tensor([[[-0.1611,0.9182,-0.3419,…,0.6323,-0.2013,0.0184],[-0.1224,0.7706,-0.2386,…,0.7925,0.0444,0.2160],[-0.0301,0.6867,-0.1510,…,0.9140,0.0308,0.2611],…,[0.3662,-0.4925,1.2332,…,0.7741,-0.1007,-0.3099],[-0.0932,-0.8494,0.6586,…,0.1235,-0.3152,-0.1635],[0.5306,-0.5510,0.3105,…,1.2631,-0.5882,-0.1133]]]),tensor([1,0,1,1,0,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,0,0,1,1,1,0,0]))train_data_len:2960valid_data_len:1000
编写训练和验证函数:
importtorch.optimasoptimdeftrain(train_data_labels):“”“description:训练函数,在这个过程中将更新模型参数,并收集准确率和损失:paramtrain_data_labels:训练数据和标签的生成器对象:return:整个训练过程的平均损失之和以及正确标签的累加数”“”#定义训练过程的初始损失和准确率累加数train_running_loss=0.0train_running_acc=0.0#循环遍历训练数据和标签生成器,每个批次更新一次模型参数fortrain_tensor,train_labelsintrain_data_labels:#初始化该批次的优化器optimizer.zero_grad()#使用微调网络获得输出train_outputs=net(train_tensor)#得到该批次下的平均损失train_loss=criterion(train_outputs,train_labels)#将该批次的平均损失加到train_running_loss中train_running_loss+=train_loss.item()#损失反向传播train_loss.backward()#优化器更新模型参数optimizer.step()#将该批次中正确的标签数量进行累加,以便之后计算准确率train_running_acc+=(train_outputs.argmax(1)==train_labels).sum().item()returntrain_running_loss,train_running_accdefvalid(valid_data_labels):“”“description:验证函数,在这个过程中将验证模型的在新数据集上的标签,收集损失和准确率:paramvalid_data_labels:验证数据和标签的生成器对象:return:整个验证过程的平均损失之和以及正确标签的累加数”“”#定义训练过程的初始损失和准确率累加数valid_running_loss=0.0valid_running_acc=0.0#循环遍历验证数据和标签生成器forvalid_tensor,valid_labelsinvalid_data_labels:#不自动更新梯度withtorch.no_grad():#使用微调网络获得输出valid_outputs=net(valid_tensor)#得到该批次下的平均损失valid_loss=criterion(valid_outputs,valid_labels)#将该批次的平均损失加到valid_running_loss中valid_running_loss+=valid_loss.item()#将该批次中正确的标签数量进行累加,以便之后计算准确率valid_running_acc+=(valid_outputs.argmax(1)==valid_labels).sum().item()returnvalid_running_loss,valid_running_acc
调用并保存模型:
ifname==“main”:#设定数据路径train_data_path=“./cn_data/SST-2/train.tsv”valid_data_path=“./cn_data/SST-2/dev.tsv”#定义交叉熵损失函数criterion=nn.CrossEntropyLoss()#定义SGD优化方法optimizer=optim.SGD(net.parameters(),lr=0.001,momentum=0.9)#定义训练轮数epochs=4#定义批次样本数量batch_size=16#进行指定轮次的训练forepochinrange(epochs):#打印轮次print(“Epoch:”,epoch+1)#通过数据加载器获得训练数据和验证数据生成器,以及对应的样本数量train_data_labels,valid_data_labels,train_data_len,\valid_data_len=data_loader(train_data_path,valid_data_path,batch_size)#调用训练函数进行训练train_running_loss,train_running_acc=train(train_data_labels)#调用验证函数进行验证valid_running_loss,valid_running_acc=valid(valid_data_labels)#计算每一轮的平均损失,train_running_loss和valid_running_loss是每个批次的平均损失之和#因此将它们乘以batch_size就得到了该轮的总损失,除以样本数即该轮次的平均损失train_average_loss=train_running_loss*batch_size/train_data_lenvalid_average_loss=valid_running_loss*batch_size/valid_data_len#train_running_acc和valid_running_acc是每个批次的正确标签累加和,#因此只需除以对应样本总数即是该轮次的准确率train_average_acc=train_running_acc/train_data_lenvalid_average_acc=valid_running_acc/valid_data_len#打印该轮次下的训练损失和准确率以及验证损失和准确率print(“TrainLoss:”,train_average_loss,“|”,“TrainAcc:”,train_average_acc)print(“ValidLoss:”,valid_average_loss,“|”,“ValidAcc:”,valid_average_acc)print(‘FinishedTraining’)#保存路径MODEL_PATH=‘./BERT_net.pth’#保存模型参数torch.save(net.state_dict(),MODEL_PATH)print(‘FinishedSaving’)
输出效果:
Epoch:1TrainLoss:2.144986984236597|TrainAcc:0.7347972972972973ValidLoss:2.1898122818128902|ValidAcc:0.704Epoch:2TrainLoss:1.3592962406135032|TrainAcc:0.8435810810810811ValidLoss:1.8816152956699324|ValidAcc:0.784Epoch:3TrainLoss:1.5507876996199943|TrainAcc:0.8439189189189189ValidLoss:1.8626576719331536|ValidAcc:0.795Epoch:4TrainLoss:0.7825378059198299|TrainAcc:0.9081081081081082ValidLoss:2.121698483480899|ValidAcc:0.803FinishedTrainingFinishedSaving
加载模型进行使用:
ifname==“main”:MODEL_PATH=‘./BERT_net.pth’#加载模型参数net.load_state_dict(torch.load(MODEL_PATH))#text=“酒店设备一般,套房里卧室的不能上网,要到客厅去。”text=“房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.”print(“输入文本为:”,text)withtorch.no_grad():output=net(get_bert_encode(text))#从output中取出最大值对应的索引print(“预测标签为:”,torch.argmax(output).item())
输出效果:
输入文本为:房间应该超出30平米,是HK同级酒店中少有的大;重装之后,设备也不错.预测标签为:1输入文本为:酒店设备一般,套房里卧室的不能上网,要到客厅去。预测标签为:0